pith. sign in

arxiv: 2508.18090 · v2 · submitted 2025-08-25 · 💻 cs.DL · cs.AI· cs.CL

Named Entity Recognition of Historical Texts via Large Language Model

Pith reviewed 2026-05-18 21:13 UTC · model grok-4.3

classification 💻 cs.DL cs.AIcs.CL
keywords named entity recognitionlarge language modelshistorical textszero-shot promptingfew-shot promptinginformation extractionlow-resource settings
0
0 comments X

The pith

Large language models can recognize named entities in historical texts using zero-shot and few-shot prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the use of large language models for named entity recognition on historical documents, where annotated training data is scarce because of high labeling costs and language variability such as inconsistent spelling. It applies zero-shot and few-shot prompting strategies to the HIPE-2022 dataset without task-specific training. Experiments show that LLMs reach reasonably strong performance in this setting. A sympathetic reader would care because the approach offers an efficient route to information extraction from large historical corpora where supervised methods cannot be applied. The work highlights that LLMs sidestep the need for domain-specific annotations while still supporting downstream tasks like retrieval from unstructured text.

Core claim

Experiments conducted on the HIPE-2022 dataset show that large language models can achieve reasonably strong performance on named entity recognition tasks in historical documents using zero-shot and few-shot prompting strategies. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising and suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora where traditional supervised methods are infeasible.

What carries the argument

Zero-shot and few-shot prompting strategies applied to large language models, which perform named entity recognition without requiring large amounts of annotated training data.

If this is right

  • LLMs enable information extraction from historical texts where annotated data does not exist.
  • The prompting method reduces dependence on costly manual labeling of historical sources.
  • Reasonable performance supports downstream tasks such as information retrieval from unstructured historical documents.
  • The approach remains useful even when historical language shows inconsistent spelling and archaic vocabulary.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting technique could be applied to historical texts in languages or periods not covered by HIPE-2022 to check broader applicability.
  • This low-resource method might extend to related tasks like entity linking in historical archives.
  • Widespread use could accelerate digitization and analysis of large collections of historical documents that lack prior annotations.

Load-bearing premise

The HIPE-2022 dataset and the chosen zero-shot and few-shot prompts are representative of the variability and noise in broader historical corpora.

What would settle it

Substantially lower performance when the same prompting approach is tested on a different historical NER dataset containing greater spelling inconsistencies or more archaic vocabulary.

read the original abstract

Large language models (LLMs) have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript explores the use of large language models for named entity recognition (NER) on historical texts via zero-shot and few-shot prompting. Experiments are performed on the HIPE-2022 dataset, with the central claim that LLMs achieve reasonably strong performance—though below that of fully supervised models—and thus offer a viable, low-resource alternative for information extraction from historical corpora.

Significance. If the reported results are substantiated with quantitative evidence and shown to generalize, the work could meaningfully advance digital humanities by enabling NER on scarce or noisy historical collections without requiring extensive manual annotation. The approach aligns with growing interest in prompt-based methods for low-resource domains.

major comments (2)
  1. [Abstract] Abstract: the claim that LLMs 'achieve reasonably strong performance' on the HIPE-2022 dataset is stated without any quantitative metrics (F1, precision, recall), error analysis, or concrete comparison numbers against supervised baselines. This absence directly weakens evaluation of the central empirical claim.
  2. [Experiments / Results] The evaluation is confined to a single dataset (HIPE-2022). No cross-dataset testing, characterization of HIPE-2022 against other historical collections (e.g., varying OCR noise, spelling variation, or entity distributions), or analysis of performance sensitivity to these factors is described. This is load-bearing for the generalization claim that LLMs provide a 'viable alternative for low-resource or historically significant corpora.'
minor comments (2)
  1. [Title] The title uses the singular 'Model' while the abstract and text refer to 'models' (plural); harmonize for consistency.
  2. [Methods] Clarify the exact prompting templates and model versions used (e.g., GPT-4, Llama-3) so that the zero-/few-shot setup is fully reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which helps clarify the presentation of our empirical results and the scope of our claims. We address each major comment below and have made targeted revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that LLMs 'achieve reasonably strong performance' on the HIPE-2022 dataset is stated without any quantitative metrics (F1, precision, recall), error analysis, or concrete comparison numbers against supervised baselines. This absence directly weakens evaluation of the central empirical claim.

    Authors: We agree that the abstract should be more quantitative to support the central claim. The full manuscript already contains detailed F1, precision, and recall scores for zero-shot and few-shot prompting, along with error analysis and comparisons to supervised baselines in the Experiments and Results sections. In the revised version, we have updated the abstract to explicitly include the key performance metrics (e.g., macro F1 scores) and a brief statement on how they compare to fully supervised models. revision: yes

  2. Referee: [Experiments / Results] The evaluation is confined to a single dataset (HIPE-2022). No cross-dataset testing, characterization of HIPE-2022 against other historical collections (e.g., varying OCR noise, spelling variation, or entity distributions), or analysis of performance sensitivity to these factors is described. This is load-bearing for the generalization claim that LLMs provide a 'viable alternative for low-resource or historically significant corpora.'

    Authors: We acknowledge the limitation of using a single dataset. In the revision, we have added a dedicated paragraph in the Experiments section that characterizes HIPE-2022 with respect to OCR noise levels, spelling variations, entity distributions, and how these compare to other historical corpora. We have also revised the abstract and conclusion to moderate the generalization language, framing the results as promising for similar low-resource historical settings rather than claiming broad viability across all such corpora. A full cross-dataset evaluation and sensitivity analysis are noted as directions for future work, as they fall outside the current paper's scope. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical evaluation on public dataset

full rationale

The paper reports experimental results from zero-shot and few-shot LLM prompting for NER on the publicly named HIPE-2022 dataset. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct performance measurements against an external benchmark rather than any reduction of outputs to inputs by construction. The work is therefore self-contained with no load-bearing steps that collapse into prior results or self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard assumptions about LLM prompting behavior and dataset representativeness rather than introducing new free parameters, axioms, or invented entities. No fitted constants or new postulated objects are described.

axioms (1)
  • domain assumption Zero-shot and few-shot prompting strategies transfer effectively to noisy historical language without domain-specific fine-tuning.
    Invoked in the abstract when claiming viability for historical documents.

pith-pipeline@v0.9.0 · 5794 in / 1136 out tokens · 32849 ms · 2026-05-18T21:13:41.638404+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages

  1. [1]

    Extended overview of hipe-2022: Named entity recognition and linking in multilingual historical documents

    Maud Ehrmann, Matteo Romanello, Sven Najem-Meyer, Antoine Doucet, Simon Clematide, Gulielmo Faggioli, Nicola Ferro, Alan Hanbury, and Martin Potthast. Extended overview of hipe-2022: Named entity recognition and linking in multilingual historical documents. In CEUR Workshop Proceedings, number 3180, pages 1038–1063, Bologna, Italy, 2022. CEUR-WS, CEUR

  2. [2]

    Named entity recognition and classification in historical documents: A survey

    Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, and Antoine Doucet. Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2):1–47, 2023

  3. [3]

    Recent advances in named entity recognition: A comprehensive survey and comparative study, 2024

    Imed Keraghel, Stanislas Morbieu, and Mohamed Nadif. Recent advances in named entity recognition: A comprehensive survey and comparative study, 2024

  4. [4]

    Self-improving for zero-shot named entity recognition with large language models

    Tingyu Xie, Qi Li, Yan Zhang, Zuozhu Liu, and Hongwei Wang. Self-improving for zero-shot named entity recognition with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 2: Sho...

  5. [5]

    UniversalNER: Targeted distillation from large language models for open named entity recognition

    Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. UniversalNER: Targeted distillation from large language models for open named entity recognition. In The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024. 10 A PREPRINT - AUGUST 26, 2025

  6. [6]

    À la recherche du nom perdu–searching for named entities with stanford ner in a finnish historical newspaper and journal collection

    Teemu Petteri Ruokolainen and Kimmo Tapio Kettunen. À la recherche du nom perdu–searching for named entities with stanford ner in a finnish historical newspaper and journal collection. In IAPR International Workshop on Document Analysis System: DAS 2018 , 2018

  7. [7]

    Extended overview of clef hipe 2020: named entity processing on historical newspapers

    Maud Ehrmann, Matteo Romanello, Alex Flückiger, and Simon Clematide. Extended overview of clef hipe 2020: named entity processing on historical newspapers. In CEUR Workshop Proceedings, number 2696. CEUR-WS, 2020

  8. [8]

    Challenges and solutions for Latin named entity recognition

    Alexander Erdmann, Christopher Brown, Brian Joseph, Mark Janse, Petra Ajaka, Micha Elsner, and Marie- Catherine de Marneffe. Challenges and solutions for Latin named entity recognition. In Erhard Hinrichs, Marie Hinrichs, and Thorsten Trippel, editors, Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH) , ...

  9. [9]

    An annotated dataset of literary entities

    David Bamman, Sejal Popat, and Sheng Shen. An annotated dataset of literary entities. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 2138–2144, Minneapolis, ...

  10. [10]

    hmbert: Historical multilingual language models for named entity recognition

    Stefan Schweter, Luisa März, Katharina Schmid, and Erion Çano. hmbert: Historical multilingual language models for named entity recognition. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum , volume 3180 of CEUR Workshop Proceeding...

  11. [11]

    Moreno, and Antoine Doucet

    Emanuela Boros, Carlos-Emiliano González-Gallardo, Edward Giamphy, Ahmed Hamdi, José G. Moreno, and Antoine Doucet. Knowledge-based contexts for historical named entity recognition & linking. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum, volume 3...

  12. [12]

    Sentence-BERT: Sentence embeddings using Siamese BERT-networks

    Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...

  13. [13]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  14. [14]

    The llama 3 herd of models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024

  15. [15]

    Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  16. [16]

    A comprehensive overview of large language models

    Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 2023

  17. [17]

    Gpt-ner: Named entity recognition via large language models, 2023

    Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. Gpt-ner: Named entity recognition via large language models, 2023

  18. [18]

    Moreno, and Antoine Doucet

    Carlos-Emiliano Gonzalez-Gallardo, Emanuela Boros, Nancy Girdhar, Ahmed Hamdi, Jose G. Moreno, and Antoine Doucet. Yes but.. Can ChatGPT Identify Entities in Historical Documents? . In 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , pages 184–189, Los Alamitos, CA, USA, June 2023. IEEE Computer Society

  19. [19]

    Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dha...

  20. [20]

    Few-shot named entity recognition: definition, taxonomy and research directions

    Vincenzo Moscato, Marco Postiglione, and Giancarlo Sperlí. Few-shot named entity recognition: definition, taxonomy and research directions. ACM Transactions on Intelligent Systems and Technology, 14(5):1–46, 2023

  21. [21]

    NLTK: The natural language toolkit

    Steven Bird and Edward Loper. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics

  22. [22]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations , 2023

  23. [23]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  24. [24]

    Moreno, and Antoine Doucet

    Carlos-Emiliano González-Gallardo, Emanuela Boros, Edward Giamphy, Ahmed Hamdi, José G. Moreno, and Antoine Doucet. Injecting temporal-aware knowledge in historical named entity recognition. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part I, page 377–39...

  25. [25]

    Exploring transformers for multilingual historical named entity recognition

    Anja Ryser, Quynh-Anh Nguyen, Niclas Bodenmann, and Shih-Yun Chen. Exploring transformers for multilingual historical named entity recognition. In Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum, CEUR Workshop Proceedings, pages 1090–1108, Bologna, Italy, 2022. CEUR. 12