The Corpus for Early Modern Icelandic

The Corpus for Early Modern Icelandic (CEMI, Ice. Málheild síðari alda, MSA) is a PoS-tagged corpus containing texts originally written between the years 1550 and 1900. The corpus is PoS-tagged, which means that each word form is accompanied by a lemma (e.g., nominative singular for nouns and infinitive for verbs) and a tag, which shows the part of speech and often also grammatical features such as case, number, and gender for nouns, and person, number, and tense for verbs. Each text in the corpus is also accompanied by metadata about the work from which the text is derived. For published texts, metadata is usually referred to as bibliographic information. The corpus is intended for linguistic research and for use in Language Technology projects.

The corpus mainly contains OCR-processed texts written between the years 1550 and 1900. The OCR-processed content is divided into three main parts:

  1. Printed books that were previously digitized and OCR-processed during the preparation of the corpus.
  2. Handwritten texts and letters that were digitized and OCR-processed using Transkribus during the preparation of the corpus.
  3. Old works republished in the 20th and 21st centuries that were scanned and OCR-processed from those editions.

Work on the project began in 2024. Text collection and software development were primarily carried out at the Árni Magnússon Institute (AMI). The project builds on data collection and preparatory work already conducted at the National and University Library of Iceland (LBS-HBS) and the National Archives of Iceland (ÞJSK).

The project utilizes infrastructure developed by the Centre for Digital Humanities and Arts (MSHL), such as experience with the AI-based software Transkribus (2022–2023) for OCR-processing handwritten manuscripts. It also leverages the expertise developed at the Árni Magnússon Institute over the past decade in tagging, lemmatization, and OCR correction.

The corpus is processed using automated methods. The main steps are:

  1. OCR-processing of texts
    • Transkribus for handwritten texts and letters.
    • Either Tesseract or Google Cloud Vision for printed texts.
  2. OCR correction - Common OCR errors are automatically corrected.
  3. Spelling normalization - A version of the texts with modernized spelling is created.

Most of these steps are performed using automated methods, but occasional manual adjustments are made. The MSA texts are then divided into sentences and word forms, which are subsequently tagged and lemmatized. Tags and lemmas are not manually corrected.

Subcorpora

CEMI will eventually be subdivided into various subcorpora.

Licencing

Most of the corpus is derived from works made publically available by the National Library of Iceland on the website Bækur.is, and the licensing of these texts will align with open access.

Individual works within the corpus are derived from republished old works. The licensing of such texts is defined on a case-by-case basis.

Versions

An unofficial preliminary version has been made available on the Árni Magnússon Institute's Corpus Website. There, it is possible to search two texts in the corpus and view the output in the KORP user interface.

Related Datasets:

19th century corpus: Texts published between 1800 and 1920. Mainly data from Timarit.is.

The Icelandic Gigaword Corpus (CEMI): The largest PoS-tagged corpus for Icelandic. See the Gigaword Corpus information website.

MIM-GOLD 20.05 is a gold standard for PoS-tagging Icelandic texts. It contains approximately 1 million running words with manually annotated PoS-tags. MIM-GOLD 20.05 uses a tagset revised in 2019-2020. Train/Test splits are also available. Previous versions of MIM-Gold are available, version 0.9 and 1.0.

Icelandic Tagged Corpus contains approximately 25 million running words. Further information here.

Icelandic Frequency Dictionary has been used for training and testing PoS-taggers for Icelandic since such work started. Training/Testing sets are available with various revision versions of the tagset. The current one is 20.05. Versions 18.10 and 12.11 are also available.

Using the Corpus

The corpus will eventually be available for download in TEI format.

It is also accessible in a search interface where the tags (linguistic annotation) are used to define the search more precisely. The search results are presented as words or phrases in context (e.g., KWIC) along with information about the source of each text example. The texts are displayed in two parallel versions: verbatim and normalized. Verbatim texts use the original spelling of the respective text, while normalized texts use modernized spelling.

The search interface is powered by the Swedish search system Korp.

The People Behind the Corpus

The following individuals have worked on the corpus:

Jóhannes Bjarni Sigtryggsson, project management, manual correction (AMI)
Ellert Þór Jóhannsson, project management, manual correction, licensing (AMI)
Steinþór Steingrímsson, project management, text collection, software development (AMI)
Einar Freyr Sigurðsson, project management, text collection, manual correction (AMI)
Bragi Þorgrímur Ólafsson, project management (LBS-HBS)
Kristinn Sigurðsson, project management (LBS-HBS)
Örn Hrafnkelsson, project management (LBS-HBS)
Unnar Ingvarsson, project management (ÞJSK)
Hinrik Hafsteinsson, software development, text collection, OCR processing (AMI)
Bjarki Ármannsson, text collection, OCR processing (AMI)
Starkaður Barkarson, software development (AMI)

References and Further Reading

When publishing research results based on the Corpus for Early Modern Icelandic, please cite the following paper:

bib
Sigtryggsson, Jóhannes B. and Ellert Þór Jóhannsson. 2024. Selecting texts for a Corpus of Early Modern Icelandic. DHNB 2024 Conference, Session #11: Corpora Reykjavík, Iceland.

Cooperation and Funding

Work on the Corpus for Early Modern Icelandic is carried out at the Árni Magnússon Institute (AMI), the National and University Library of Iceland (LBS-HBS), and the National Archives of Iceland (ÞJSK). The work is funded by the Icelandic Infrastructure Fund (162821011031, project manager Jóhannes Bjarni Sigtryggsson).

Preparations for the project took place in the spring of 2024. Staff began creating the database in September of the same year. The project is expected to be completed in the spring of 2025, at which point the database will be made accessible. The project utilizes Icelandic models in Transkribus, which were developed with support from the Icelandic Infrastructure Fund.