pith. sign in

arxiv: 2605.02466 · v1 · submitted 2026-05-04 · 💻 cs.CL

ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias

Pith reviewed 2026-05-09 16:15 UTC · model grok-4.3

classification 💻 cs.CL
keywords headword extractioncross-edition matchingWikidata linkinghistorical encyclopediasNordisk familjebokdigitized textentity classificationknowledge evolution
0
0 comments X

The pith

A pipeline extracts headwords from digitized historical encyclopedias, matches entries across editions, and links them to Wikidata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a pipeline that processes OCR-digitized text from old encyclopedias to recover their internal structure. It extracts headwords and identifies entries, classifies the entities they represent, matches the same entries across different editions of the encyclopedia, and links those entries to items in Wikidata. The authors applied the pipeline to the four main editions of the Swedish Nordisk familjebok published between 1876 and 1951. Performance reached 97.8 percent F1 on headword extraction and 93.4 percent F1 on classification, with 93 percent precision on cross-edition matching in a small-scale test. A sympathetic reader would care because the work shows how automated methods can turn unstructured historical texts into trackable, linkable knowledge resources without exhaustive manual annotation.

Core claim

The authors constructed an automated pipeline that extracts headwords from the raw OCR text of the four major editions of Nordisk familjebok, categorizes the corresponding entities, matches entries across editions, and links them to Wikidata items. On the full corpus the pipeline attained an F1 score of 97.8 percent for headword extraction and 93.4 percent for headword classification. A small-scale evaluation showed 93 percent precision for cross-edition matching, 85 percent precision and 16.5 percent recall for Wikidata linking. The results indicate that such a pipeline can restore usable structure from digitized historical encyclopedias and thereby support analysis of how knowledge evolved

What carries the argument

The end-to-end pipeline that performs headword extraction, entity categorization, cross-edition matching, and Wikidata linking on multi-edition historical encyclopedias.

If this is right

  • Entries can be tracked systematically across successive editions to reveal how descriptions and facts changed over decades.
  • Historical encyclopedia content becomes linkable to contemporary structured knowledge bases such as Wikidata.
  • Preservation and computational study of general knowledge become feasible at the scale of entire multi-edition works.
  • Knowledge transmission patterns across time can be quantified without exhaustive manual reading of every volume.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pipeline structure could be adapted to encyclopedias in other languages once equivalent training data for headword detection is collected.
  • Released datasets of matched and linked entries could serve as training material for broader historical text-alignment tasks.
  • Repeated application over many encyclopedias might surface large-scale regularities in how scientific and cultural facts were revised.

Load-bearing premise

The pipeline assumes the OCR output is clean enough for reliable headword detection and that the small-scale evaluations of matching and linking generalize to the full corpus.

What would settle it

A larger-scale manual evaluation on thousands of entries from additional editions or from a different historical encyclopedia that shows precision or recall falling substantially below the reported figures.

Figures

Figures reproduced from arXiv: 2605.02466 by Albin Andersson, Fredrik Wastring, Pierre Nugues, Salam Jonasson.

Figure 1
Figure 1. Figure 1: An overview of the ATLAS pipeline. Using the links, we downloaded the HTML con￾tent of each page and extracted the OCRed text. The HTML markup structure is regular and makes it trivial to find the beginning and end of the text. The OCRed content contains additional HTML tags. We observed that bold tags, <b>, were used to en￾capsulate headwords at the beginning of an entry. We kept them and removed all the … view at source ↗
Figure 2
Figure 2. Figure 2: The input tokenization and output mask. head. We employed an unfreezing strategy, where we experimented with different configurations to determine the best number of trainable layers. We applied these models to the raw text of the four editions. Both methods enabled us to determine the headwords and segment the text into entries. 4.2. Headword and Entry Classification We then classified the resulting entri… view at source ↗
Figure 3
Figure 3. Figure 3: Performance metrics across different model configurations for headword extraction. On the x-axis, “S” represents the LSTM model while the figures indicate the number of unfrozen layers in the fine-tuned KB-BERT model. 5.1. Headword Extraction We trained and evaluated different model architec￾tures on the headword dataset of Sect. 3.3. We compared the LSTM model with various configura￾tions of the fine-tune… view at source ↗
Figure 4
Figure 4. Figure 4: The resulting entity recognition on the extracted entries for each edition. where for each edition, around 50-60% of the arti￾cles are classified as Other, 18-23% as Locations, and 20-30% as Persons view at source ↗
Figure 5
Figure 5. Figure 5: Additions and removals for person (left) and location (right) entries across editions. view at source ↗
read the original abstract

The digitization of old encyclopedias represents an important step to improve access to historically structured knowledge. Often, however, this process does not go beyond an optical character recognition, leaving all the underlying structure unexploited. In addition, many encyclopedias had multiple editions reflecting the evolution of knowledge. The lack of structure in the raw text makes it difficult to track changes across these editions. In this work, we built a pipeline to restore the text structure, where we extract the headwords and identify entries; categorize the entities; match entries across editions; and link entries to a Wikidata item. We applied this pipeline to the four major editions of \textit{Nordisk familjebok}, an authoritative Swedish encyclopedia published between 1876 and 1951. We could extract the headwords with an F1 score of 97.8\% and we obtained an F1 score of 93.4\% on the headword classification. On a small-scale evaluation, we reached a 93\% precision on the cross-edition matching, 85\% precision and 16.5\% recall on the Wikidata linking. This shows that an automated approach to digitized historical knowledge is possible. This should facilitate the preservation of general knowledge and the understanding of knowledge transmission. The datasets and programs are available online.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper presents ATLAS, a pipeline for restoring structure to OCR-digitized historical encyclopedias by extracting headwords (F1 97.8%), classifying entries (F1 93.4%), matching entries across the four editions of Nordisk familjebok (93% precision on small-scale eval), and linking to Wikidata (85% precision, 16.5% recall on small-scale eval). The work aims to enable tracking of knowledge evolution and releases the datasets and code.

Significance. If the reported performance generalizes, the work would be a useful contribution to digital humanities and computational linguistics by demonstrating an automated pipeline for structuring multi-edition encyclopedic texts. The open release of datasets and programs is a clear strength that supports reproducibility. High F1 scores on headword extraction and classification provide solid evidence for the core steps, though the small-scale nature of the matching and linking results limits the strength of claims about full-corpus applicability.

major comments (3)
  1. [cross-edition matching evaluation] The cross-edition matching reports 93% precision on a small-scale evaluation, but no sample size, sampling procedure, breakdown by edition or entry length, or error analysis is provided, which is load-bearing for the claim that the pipeline enables reliable tracking across editions.
  2. [Wikidata linking evaluation] Wikidata linking reports 85% precision and 16.5% recall on small-scale evaluation without details on test-set size, selection criteria, or failure modes, undermining assessment of whether the linking step scales to the full corpus.
  3. [pipeline description and headword extraction] The pipeline assumes OCR output is clean enough for reliable regex- and embedding-based extraction, yet no quantitative OCR quality audit, error-rate breakdown, or ablation on noisy vs. post-processed text is reported.
minor comments (1)
  1. [abstract and evaluation sections] The abstract and results sections refer to 'small-scale evaluation' without quantifying the scale or providing a table of evaluation sizes; adding this would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating revisions where appropriate to strengthen the reporting of our evaluations and methods.

read point-by-point responses
  1. Referee: [cross-edition matching evaluation] The cross-edition matching reports 93% precision on a small-scale evaluation, but no sample size, sampling procedure, breakdown by edition or entry length, or error analysis is provided, which is load-bearing for the claim that the pipeline enables reliable tracking across editions.

    Authors: We agree that additional details on the small-scale evaluation are needed to support claims about cross-edition tracking. The evaluation was performed on a manually annotated subset of entry pairs drawn from the four editions. In the revised manuscript we will report the exact sample size, the sampling procedure (random selection from candidate pairs with stratification by edition pair), a breakdown by edition and entry characteristics where feasible, and a concise error analysis. These additions will clarify the scope and limitations of the reported 93% precision without altering the core results. revision: yes

  2. Referee: [Wikidata linking evaluation] Wikidata linking reports 85% precision and 16.5% recall on small-scale evaluation without details on test-set size, selection criteria, or failure modes, undermining assessment of whether the linking step scales to the full corpus.

    Authors: We acknowledge the need for greater transparency on the Wikidata linking evaluation. The reported figures were obtained from a manually verified sample of headwords. We will expand the manuscript to specify the test-set size, the selection criteria (random sampling from the extracted headwords), and the main failure modes observed (such as entities absent from Wikidata or ambiguous name matches). We will also note that the modest recall is expected given the historical nature of many entries and does not contradict the utility of the high-precision links that are produced. revision: yes

  3. Referee: [pipeline description and headword extraction] The pipeline assumes OCR output is clean enough for reliable regex- and embedding-based extraction, yet no quantitative OCR quality audit, error-rate breakdown, or ablation on noisy vs. post-processed text is reported.

    Authors: The manuscript focuses on post-OCR structuring steps rather than a full OCR audit, as the source texts were already digitized. We will add a dedicated paragraph discussing the robustness of the regex and embedding components to typical OCR artifacts (e.g., character substitutions and line-break errors), supported by qualitative examples from manual inspection. A quantitative audit or ablation would require ground-truth clean transcriptions for a representative sample, which are not available; we therefore treat this as a limitation and will state it explicitly rather than perform new experiments. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical results from independent annotations

full rationale

The paper presents a standard NLP pipeline for headword extraction, classification, cross-edition matching, and Wikidata linking on digitized encyclopedias. All reported metrics (F1 97.8% extraction, F1 93.4% classification, 93% matching precision, 85% linking precision) are obtained by direct comparison against held-out annotated data or small-scale manual evaluations. No equations, fitted parameters, self-definitions, or derivation steps appear that reduce the outputs to the inputs by construction. The work is self-contained as an applied engineering contribution whose validity rests on external ground truth rather than internal renaming or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard assumptions in NLP pipelines such as the quality of input OCR text and the representativeness of evaluation sets, without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5539 in / 1201 out tokens · 64648 ms · 2026-05-09T16:15:40.466869+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 1 internal anchor

  1. [1]

    However, much of this knowledge remains locked in unstructured text, making it difficult to analyze it systematically and draw usable conclusions

    Introduction Old encyclopedias are valuable pieces of historical knowledge,reflectingthelifeandideasoftheirtime. However, much of this knowledge remains locked in unstructured text, making it difficult to analyze it systematically and draw usable conclusions. Nordisk familjebokis the most comprehensive Swedish encyclopedia of its time. It holds an im- por...

  2. [2]

    We preprocessed this dataset and cleaned it to get rid of irrelevant content

    We scraped the four editions of the encyclo- pedia. We preprocessed this dataset and cleaned it to get rid of irrelevant content

  3. [3]

    We used it to train models to extract the headwords

    We created a dataset of segmented entries annotated with their headword. We used it to train models to extract the headwords

  4. [4]

    Weannotatedaseconddatasetof6000entries with entity classes and we trained a classifier

  5. [5]

    We com- pared each entry of a given edition to all other editions using a sentence embedder

    We matched entries across editions. We com- pared each entry of a given edition to all other editions using a sentence embedder

  6. [6]

    ATLAS: Article Tracking, Linking, and Analysis of Swedish Encyclopedias

    We linked Wikidata items that had a reference to the encyclopedia entries using the same approach as in the previous step. The datasets and code required to repro- duce the experiments are publicly available on Hugging Face at https://huggingface. arXiv:2605.02466v1 [cs.CL] 4 May 2026 co/albinandersson/datasets and GitHub at https://github.com/SalamSki/EDAN70

  7. [7]

    Previous Work Our system builds on work in three main areas: digitization of historical texts, recognition of named entities (NER) in historical documents, and linking of historical encyclopedias to Wikidata. 2.1. Digitization of Historical Texts There are now scores of book digitization projects. Project Runeberg has digitized numerous Nordic texts with ...

  8. [8]

    silver standard

    Datasets Nordisk familjebokis organized as a sequence of entries, where the headwords are ordered alpha- betically. To recover this structure, we identified the headwords and we segmented the raw text into entries. We then categorized these entries. To recognize the entries, a possible solution couldbetoanalyzetheimagelayoutofthescansas in Wang et al. (20...

  9. [9]

    If they contain at least oneLocationbut no Person, then the headword is aLocation

  10. [10]

    If they contain at least onePersonbut noLo- cation, the headword is aPerson

  11. [11]

    We classify the headword asOther

    If they contain at least onePersonand one Location, this indicates an uncertainty. We classify the headword asOther

  12. [12]

    In the second step, following the zero-shot NER predictions, we extracted a balanced subset of 6000 entries that we verified manually

    We default toOtherif neitherLocationnorPer- sonis present. In the second step, following the zero-shot NER predictions, we extracted a balanced subset of 6000 entries that we verified manually. After manu- ally correcting the annotation, the distribution was 2https://huggingface.co/datasets/ albinandersson/nf-headword-extraction slightlyaltered. Weusedthi...

  13. [13]

    described in

    Method The ATLAS system consists of a pipeline of com- ponents. Figure 1 shows its architecture that we describe now. 4.1. Headword Extractor and Entry Segmenter We modeled the headword extraction task as a sequence annotation task, where each token in an input sentence is classified as either part of the headword (1) or not (0). This enables the model to...

  14. [14]

    Results We broke down the results of each step in our pipeline, namely scraping, headword extraction, NER classification, cross-edition matching, and Wikidata linking. Entry ID headword Type Edition E1_match E2_match E3_match E4_match QID E1_385 Achenwall 2 E1 – E2_622 E3_416 E4_473 Q215933 E1_386 Acheron 1 E1 – E2_623 E3_417 E4_476 – E1_387 Acherontia 0 ...

  15. [15]

    We explain this with the struc- ture of the training set, which mainly contains en- tries from the first two editions

    Discussion Table 4 shows there is a significant difference in the extraction results between E1-E2 and E3-E4: around 20% roughly. We explain this with the struc- ture of the training set, which mainly contains en- tries from the first two editions. The high extraction percentage of the first edition (88%) could be due to an overfit. However, the results m...

  16. [16]

    The initial rule posits that headwords are marked with <b> tags in E1 and E2

    Limitations and Future Work We used a semi-automatic labeling to build the training set of headwords and segmented entries. The initial rule posits that headwords are marked with <b> tags in E1 and E2. Unfortunately, it cre- ates a few false negatives. This can be even more confusing when two identical entries are marked differently. Cascading this proble...

  17. [17]

    It consists of four major steps, notably an automated headword extraction, where we achieved an F1 scoreof97.8%andanentitytypeclassificationwith an F1 score of 93.4%

    Conclusion In this work, we described a comprehensive pipeline for processing historical encyclopedias. It consists of four major steps, notably an automated headword extraction, where we achieved an F1 scoreof97.8%andanentitytypeclassificationwith an F1 score of 93.4%. In a small-scale evaluation of the cross-edition matching, we obtained an ac- curacy b...

  18. [18]

    Our work contributes to the de- velopmentoftoolsforlanguageresourcesandtheir annotation

    Ethics Statement The collection ofNordisk familjebokeditions is in the public domain. Our work contributes to the de- velopmentoftoolsforlanguageresourcesandtheir annotation. Wehopeitcanimprovetheunderstand- ing of human knowledge transmission through the extraction of versions of biographies and locations. Nonetheless,

  19. [19]

    This can notably be the case for scientific theories or technological developments

    The corpus we used contains dated and pos- sibly false information. This can notably be the case for scientific theories or technological developments

  20. [20]

    Users must be informed of this context

    The Swedish historical context and ideas of years 1870-1950 may convey biases and old-fashioned viewpoints, possibly offensive. Users must be informed of this context

  21. [21]

    Acknowledgements This work was partially supported byVeten- skaprådet, the Swedish Research Council, regis- tration number 2021-04533

  22. [22]

    References AxelAhlin, AlfredMyrneBlåder, andPierreNugues

  23. [23]

    InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages11040–11048, Torino, Italia

    Mapping the past: Geographically link- ing an early 20th century Swedish encyclope- dia with Wikidata. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evalua- tion(LREC-COLING2024),pages11040–11048, Torino, Italia. ELRA and ICCL. Tom Ayoola, Joseph Fisher, and Andrea Pierleoni. 2022a. Improving...

  24. [24]

    Matching and linking entries in historical Swedish encyclopedias. InProceedings of the 9th Joint SIGHUM Workshop on Computational LinguisticsforCulturalHeritage, SocialSciences, Humanities and Literature (LaTeCH-CLfL 2025), pages 1–10, Albuquerque, New Mexico. Associ- ation for Computational Linguistics. Jan A. Botha, Zifei Shan, and Daniel Gillick. 2020....

  25. [25]

    Language Resource References Andersson, Albin and Jonasson, Salam and Wastring, Fredrik and Nugues, Pierre. 2026a. Nordisk Familjebok Category Classification Dataset. Hugging Face. Andersson, Albin and Jonasson, Salam and Wastring, Fredrik and Nugues, Pierre. 2026b. Nordisk Familjebok Headword Classified Matched Linked Dataset. Hugging Face. Andersson, Al...