ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias
Pith reviewed 2026-05-09 16:15 UTC · model grok-4.3
The pith
A pipeline extracts headwords from digitized historical encyclopedias, matches entries across editions, and links them to Wikidata.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors constructed an automated pipeline that extracts headwords from the raw OCR text of the four major editions of Nordisk familjebok, categorizes the corresponding entities, matches entries across editions, and links them to Wikidata items. On the full corpus the pipeline attained an F1 score of 97.8 percent for headword extraction and 93.4 percent for headword classification. A small-scale evaluation showed 93 percent precision for cross-edition matching, 85 percent precision and 16.5 percent recall for Wikidata linking. The results indicate that such a pipeline can restore usable structure from digitized historical encyclopedias and thereby support analysis of how knowledge evolved
What carries the argument
The end-to-end pipeline that performs headword extraction, entity categorization, cross-edition matching, and Wikidata linking on multi-edition historical encyclopedias.
If this is right
- Entries can be tracked systematically across successive editions to reveal how descriptions and facts changed over decades.
- Historical encyclopedia content becomes linkable to contemporary structured knowledge bases such as Wikidata.
- Preservation and computational study of general knowledge become feasible at the scale of entire multi-edition works.
- Knowledge transmission patterns across time can be quantified without exhaustive manual reading of every volume.
Where Pith is reading between the lines
- The same pipeline structure could be adapted to encyclopedias in other languages once equivalent training data for headword detection is collected.
- Released datasets of matched and linked entries could serve as training material for broader historical text-alignment tasks.
- Repeated application over many encyclopedias might surface large-scale regularities in how scientific and cultural facts were revised.
Load-bearing premise
The pipeline assumes the OCR output is clean enough for reliable headword detection and that the small-scale evaluations of matching and linking generalize to the full corpus.
What would settle it
A larger-scale manual evaluation on thousands of entries from additional editions or from a different historical encyclopedia that shows precision or recall falling substantially below the reported figures.
Figures
read the original abstract
The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textit{Nordisk familjebok}, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8\% and we obtained an F1 score of 93.4\% on the headword classification. On a small-scale evaluation, we reached a 93\% precision on the cross-edition matching, 85\% precision and 16.5\% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents ATLAS, a pipeline for restoring structure to OCR-digitized historical encyclopedias by extracting headwords (F1 97.8%), classifying entries (F1 93.4%), matching entries across the four editions of Nordisk familjebok (93% precision on small-scale eval), and linking to Wikidata (85% precision, 16.5% recall on small-scale eval). The work aims to enable tracking of knowledge evolution and releases the datasets and code.
Significance. If the reported performance generalizes, the work would be a useful contribution to digital humanities and computational linguistics by demonstrating an automated pipeline for structuring multi-edition encyclopedic texts. The open release of datasets and programs is a clear strength that supports reproducibility. High F1 scores on headword extraction and classification provide solid evidence for the core steps, though the small-scale nature of the matching and linking results limits the strength of claims about full-corpus applicability.
major comments (3)
- [cross-edition matching evaluation] The cross-edition matching reports 93% precision on a small-scale evaluation, but no sample size, sampling procedure, breakdown by edition or entry length, or error analysis is provided, which is load-bearing for the claim that the pipeline enables reliable tracking across editions.
- [Wikidata linking evaluation] Wikidata linking reports 85% precision and 16.5% recall on small-scale evaluation without details on test-set size, selection criteria, or failure modes, undermining assessment of whether the linking step scales to the full corpus.
- [pipeline description and headword extraction] The pipeline assumes OCR output is clean enough for reliable regex- and embedding-based extraction, yet no quantitative OCR quality audit, error-rate breakdown, or ablation on noisy vs. post-processed text is reported.
minor comments (1)
- [abstract and evaluation sections] The abstract and results sections refer to 'small-scale evaluation' without quantifying the scale or providing a table of evaluation sizes; adding this would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating revisions where appropriate to strengthen the reporting of our evaluations and methods.
read point-by-point responses
-
Referee: [cross-edition matching evaluation] The cross-edition matching reports 93% precision on a small-scale evaluation, but no sample size, sampling procedure, breakdown by edition or entry length, or error analysis is provided, which is load-bearing for the claim that the pipeline enables reliable tracking across editions.
Authors: We agree that additional details on the small-scale evaluation are needed to support claims about cross-edition tracking. The evaluation was performed on a manually annotated subset of entry pairs drawn from the four editions. In the revised manuscript we will report the exact sample size, the sampling procedure (random selection from candidate pairs with stratification by edition pair), a breakdown by edition and entry characteristics where feasible, and a concise error analysis. These additions will clarify the scope and limitations of the reported 93% precision without altering the core results. revision: yes
-
Referee: [Wikidata linking evaluation] Wikidata linking reports 85% precision and 16.5% recall on small-scale evaluation without details on test-set size, selection criteria, or failure modes, undermining assessment of whether the linking step scales to the full corpus.
Authors: We acknowledge the need for greater transparency on the Wikidata linking evaluation. The reported figures were obtained from a manually verified sample of headwords. We will expand the manuscript to specify the test-set size, the selection criteria (random sampling from the extracted headwords), and the main failure modes observed (such as entities absent from Wikidata or ambiguous name matches). We will also note that the modest recall is expected given the historical nature of many entries and does not contradict the utility of the high-precision links that are produced. revision: yes
-
Referee: [pipeline description and headword extraction] The pipeline assumes OCR output is clean enough for reliable regex- and embedding-based extraction, yet no quantitative OCR quality audit, error-rate breakdown, or ablation on noisy vs. post-processed text is reported.
Authors: The manuscript focuses on post-OCR structuring steps rather than a full OCR audit, as the source texts were already digitized. We will add a dedicated paragraph discussing the robustness of the regex and embedding components to typical OCR artifacts (e.g., character substitutions and line-break errors), supported by qualitative examples from manual inspection. A quantitative audit or ablation would require ground-truth clean transcriptions for a representative sample, which are not available; we therefore treat this as a limitation and will state it explicitly rather than perform new experiments. revision: partial
Circularity Check
No circularity; empirical results from independent annotations
full rationale
The paper presents a standard NLP pipeline for headword extraction, classification, cross-edition matching, and Wikidata linking on digitized encyclopedias. All reported metrics (F1 97.8% extraction, F1 93.4% classification, 93% matching precision, 85% linking precision) are obtained by direct comparison against held-out annotated data or small-scale manual evaluations. No equations, fitted parameters, self-definitions, or derivation steps appear that reduce the outputs to the inputs by construction. The work is self-contained as an applied engineering contribution whose validity rests on external ground truth rather than internal renaming or self-citation chains.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Old encyclopedias are valuable pieces of historical knowledge,reflectingthelifeandideasoftheirtime. However, much of this knowledge remains locked in unstructured text, making it difficult to analyze it systematically and draw usable conclusions. Nordisk familjebokis the most comprehensive Swedish encyclopedia of its time. It holds an im- por...
work page 1991
-
[2]
We preprocessed this dataset and cleaned it to get rid of irrelevant content
We scraped the four editions of the encyclo- pedia. We preprocessed this dataset and cleaned it to get rid of irrelevant content
-
[3]
We used it to train models to extract the headwords
We created a dataset of segmented entries annotated with their headword. We used it to train models to extract the headwords
-
[4]
Weannotatedaseconddatasetof6000entries with entity classes and we trained a classifier
-
[5]
We com- pared each entry of a given edition to all other editions using a sentence embedder
We matched entries across editions. We com- pared each entry of a given edition to all other editions using a sentence embedder
-
[6]
ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias
We linked Wikidata items that had a reference to the encyclopedia entries using the same approach as in the previous step. The datasets and code required to repro- duce the experiments are publicly available on Hugging Face at https://huggingface. arXiv:2605.02466v1 [cs.CL] 4 May 2026 co/albinandersson/datasets and GitHub at https://github.com/SalamSki/EDAN70
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[7]
Previous Work Our system builds on work in three main areas: digitization of historical texts, recognition of named entities (NER) in historical documents, and linking of historical encyclopedias to Wikidata. 2.1. Digitization of Historical Texts There are now scores of book digitization projects. Project Runeberg has digitized numerous Nordic texts with ...
work page 1992
-
[8]
Datasets Nordisk familjebokis organized as a sequence of entries, where the headwords are ordered alpha- betically. To recover this structure, we identified the headwords and we segmented the raw text into entries. We then categorized these entries. To recognize the entries, a possible solution couldbetoanalyzetheimagelayoutofthescansas in Wang et al. (20...
work page 2021
-
[9]
If they contain at least oneLocationbut no Person, then the headword is aLocation
-
[10]
If they contain at least onePersonbut noLo- cation, the headword is aPerson
-
[11]
We classify the headword asOther
If they contain at least onePersonand one Location, this indicates an uncertainty. We classify the headword asOther
-
[12]
We default toOtherif neitherLocationnorPer- sonis present. In the second step, following the zero-shot NER predictions, we extracted a balanced subset of 6000 entries that we verified manually. After manu- ally correcting the annotation, the distribution was 2https://huggingface.co/datasets/ albinandersson/nf-headword-extraction slightlyaltered. Weusedthi...
-
[13]
Method The ATLAS system consists of a pipeline of com- ponents. Figure 1 shows its architecture that we describe now. 4.1. Headword Extractor and Entry Segmenter We modeled the headword extraction task as a sequence annotation task, where each token in an input sentence is classified as either part of the headword (1) or not (0). This enables the model to...
work page 2020
-
[14]
Results We broke down the results of each step in our pipeline, namely scraping, headword extraction, NER classification, cross-edition matching, and Wikidata linking. Entry ID headword Type Edition E1_match E2_match E3_match E4_match QID E1_385 Achenwall 2 E1 – E2_622 E3_416 E4_473 Q215933 E1_386 Acheron 1 E1 – E2_623 E3_417 E4_476 – E1_387 Acherontia 0 ...
work page 2025
-
[15]
Discussion Table 4 shows there is a significant difference in the extraction results between E1-E2 and E3-E4: around 20% roughly. We explain this with the struc- ture of the training set, which mainly contains en- tries from the first two editions. The high extraction percentage of the first edition (88%) could be due to an overfit. However, the results m...
-
[16]
The initial rule posits that headwords are marked with <b> tags in E1 and E2
Limitations and Future Work We used a semi-automatic labeling to build the training set of headwords and segmented entries. The initial rule posits that headwords are marked with <b> tags in E1 and E2. Unfortunately, it cre- ates a few false negatives. This can be even more confusing when two identical entries are marked differently. Cascading this proble...
-
[17]
Conclusion In this work, we described a comprehensive pipeline for processing historical encyclopedias. It consists of four major steps, notably an automated headword extraction, where we achieved an F1 scoreof97.8%andanentitytypeclassificationwith an F1 score of 93.4%. In a small-scale evaluation of the cross-edition matching, we obtained an ac- curacy b...
-
[18]
Our work contributes to the de- velopmentoftoolsforlanguageresourcesandtheir annotation
Ethics Statement The collection ofNordisk familjebokeditions is in the public domain. Our work contributes to the de- velopmentoftoolsforlanguageresourcesandtheir annotation. Wehopeitcanimprovetheunderstand- ing of human knowledge transmission through the extraction of versions of biographies and locations. Nonetheless,
-
[19]
This can notably be the case for scientific theories or technological developments
The corpus we used contains dated and pos- sibly false information. This can notably be the case for scientific theories or technological developments
-
[20]
Users must be informed of this context
The Swedish historical context and ideas of years 1870-1950 may convey biases and old-fashioned viewpoints, possibly offensive. Users must be informed of this context
work page 1950
-
[21]
Acknowledgements This work was partially supported byVeten- skaprådet, the Swedish Research Council, regis- tration number 2021-04533
work page 2021
-
[22]
References AxelAhlin, AlfredMyrneBlåder, andPierreNugues
-
[23]
Mapping the past: Geographically link- ing an early 20th century Swedish encyclope- dia with Wikidata. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages11040–11048, Torino, Italia. ELRA and ICCL. Tom Ayoola, Joseph Fisher, and Andrea Pierleoni. 2022a. Improving...
work page 2024
-
[24]
Matching and linking entries in historical Swedish encyclopedias. InProceedings of the 9th Joint SIGHUM Workshop on Computational LinguisticsforCulturalHeritage, SocialSciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 1–10, Albuquerque, New Mexico. Associ- ation for Computational Linguistics. Jan A. Botha, Zifei Shan, and Daniel Gillick. 2020....
-
[25]
Language Resource References Andersson, Albin and Jonasson, Salam and Wastring, Fredrik and Nugues, Pierre. 2026a. Nordisk Familjebok Category Classification Dataset. Hugging Face. Andersson, Albin and Jonasson, Salam and Wastring, Fredrik and Nugues, Pierre. 2026b. Nordisk Familjebok Headword Classified Matched Linked Dataset. Hugging Face. Andersson, Al...
work page 1992
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.