Named Entity Recognition of Historical Texts via Large Language Model
Pith reviewed 2026-05-18 21:13 UTC · model grok-4.3
The pith
Large language models can recognize named entities in historical texts using zero-shot and few-shot prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments conducted on the HIPE-2022 dataset show that large language models can achieve reasonably strong performance on named entity recognition tasks in historical documents using zero-shot and few-shot prompting strategies. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising and suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora where traditional supervised methods are infeasible.
What carries the argument
Zero-shot and few-shot prompting strategies applied to large language models, which perform named entity recognition without requiring large amounts of annotated training data.
If this is right
- LLMs enable information extraction from historical texts where annotated data does not exist.
- The prompting method reduces dependence on costly manual labeling of historical sources.
- Reasonable performance supports downstream tasks such as information retrieval from unstructured historical documents.
- The approach remains useful even when historical language shows inconsistent spelling and archaic vocabulary.
Where Pith is reading between the lines
- The same prompting technique could be applied to historical texts in languages or periods not covered by HIPE-2022 to check broader applicability.
- This low-resource method might extend to related tasks like entity linking in historical archives.
- Widespread use could accelerate digitization and analysis of large collections of historical documents that lack prior annotations.
Load-bearing premise
The HIPE-2022 dataset and the chosen zero-shot and few-shot prompts are representative of the variability and noise in broader historical corpora.
What would settle it
Substantially lower performance when the same prompting approach is tested on a different historical NER dataset containing greater spelling inconsistencies or more archaic vocabulary.
read the original abstract
Large language models (LLMs) have demonstrated remarkable versatility across a wide range of natural language processing tasks and domains. One such task is Named Entity Recognition (NER), which involves identifying and classifying proper names in text, such as people, organizations, locations, dates, and other specific entities. NER plays a crucial role in extracting information from unstructured textual data, enabling downstream applications such as information retrieval from unstructured text. Traditionally, NER is addressed using supervised machine learning approaches, which require large amounts of annotated training data. However, historical texts present a unique challenge, as the annotated datasets are often scarce or nonexistent, due to the high cost and expertise required for manual labeling. In addition, the variability and noise inherent in historical language, such as inconsistent spelling and archaic vocabulary, further complicate the development of reliable NER systems for these sources. In this study, we explore the feasibility of applying LLMs to NER in historical documents using zero-shot and few-shot prompting strategies, which require little to no task-specific training data. Our experiments, conducted on the HIPE-2022 (Identifying Historical People, Places and other Entities) dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting. While their performance falls short of fully supervised models trained on domain-specific annotations, the results are nevertheless promising. These findings suggest that LLMs offer a viable and efficient alternative for information extraction in low-resource or historically significant corpora, where traditional supervised methods are infeasible.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the use of large language models for named entity recognition (NER) on historical texts via zero-shot and few-shot prompting. Experiments are performed on the HIPE-2022 dataset, with the central claim that LLMs achieve reasonably strong performance—though below that of fully supervised models—and thus offer a viable, low-resource alternative for information extraction from historical corpora.
Significance. If the reported results are substantiated with quantitative evidence and shown to generalize, the work could meaningfully advance digital humanities by enabling NER on scarce or noisy historical collections without requiring extensive manual annotation. The approach aligns with growing interest in prompt-based methods for low-resource domains.
major comments (2)
- [Abstract] Abstract: the claim that LLMs 'achieve reasonably strong performance' on the HIPE-2022 dataset is stated without any quantitative metrics (F1, precision, recall), error analysis, or concrete comparison numbers against supervised baselines. This absence directly weakens evaluation of the central empirical claim.
- [Experiments / Results] The evaluation is confined to a single dataset (HIPE-2022). No cross-dataset testing, characterization of HIPE-2022 against other historical collections (e.g., varying OCR noise, spelling variation, or entity distributions), or analysis of performance sensitivity to these factors is described. This is load-bearing for the generalization claim that LLMs provide a 'viable alternative for low-resource or historically significant corpora.'
minor comments (2)
- [Title] The title uses the singular 'Model' while the abstract and text refer to 'models' (plural); harmonize for consistency.
- [Methods] Clarify the exact prompting templates and model versions used (e.g., GPT-4, Llama-3) so that the zero-/few-shot setup is fully reproducible.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which helps clarify the presentation of our empirical results and the scope of our claims. We address each major comment below and have made targeted revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that LLMs 'achieve reasonably strong performance' on the HIPE-2022 dataset is stated without any quantitative metrics (F1, precision, recall), error analysis, or concrete comparison numbers against supervised baselines. This absence directly weakens evaluation of the central empirical claim.
Authors: We agree that the abstract should be more quantitative to support the central claim. The full manuscript already contains detailed F1, precision, and recall scores for zero-shot and few-shot prompting, along with error analysis and comparisons to supervised baselines in the Experiments and Results sections. In the revised version, we have updated the abstract to explicitly include the key performance metrics (e.g., macro F1 scores) and a brief statement on how they compare to fully supervised models. revision: yes
-
Referee: [Experiments / Results] The evaluation is confined to a single dataset (HIPE-2022). No cross-dataset testing, characterization of HIPE-2022 against other historical collections (e.g., varying OCR noise, spelling variation, or entity distributions), or analysis of performance sensitivity to these factors is described. This is load-bearing for the generalization claim that LLMs provide a 'viable alternative for low-resource or historically significant corpora.'
Authors: We acknowledge the limitation of using a single dataset. In the revision, we have added a dedicated paragraph in the Experiments section that characterizes HIPE-2022 with respect to OCR noise levels, spelling variations, entity distributions, and how these compare to other historical corpora. We have also revised the abstract and conclusion to moderate the generalization language, framing the results as promising for similar low-resource historical settings rather than claiming broad viability across all such corpora. A full cross-dataset evaluation and sensitivity analysis are noted as directions for future work, as they fall outside the current paper's scope. revision: partial
Circularity Check
No circularity: empirical evaluation on public dataset
full rationale
The paper reports experimental results from zero-shot and few-shot LLM prompting for NER on the publicly named HIPE-2022 dataset. No mathematical derivations, equations, fitted parameters, or self-referential definitions appear in the provided text. Claims rest on direct performance measurements against an external benchmark rather than any reduction of outputs to inputs by construction. The work is therefore self-contained with no load-bearing steps that collapse into prior results or self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Zero-shot and few-shot prompting strategies transfer effectively to noisy historical language without domain-specific fine-tuning.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our experiments, conducted on the HIPE-2022 dataset, show that LLMs can achieve reasonably strong performance on NER tasks in this setting.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
few-shot prompting with a single example provides the most consistent improvements over the zero-shot baseline
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Maud Ehrmann, Matteo Romanello, Sven Najem-Meyer, Antoine Doucet, Simon Clematide, Gulielmo Faggioli, Nicola Ferro, Alan Hanbury, and Martin Potthast. Extended overview of hipe-2022: Named entity recognition and linking in multilingual historical documents. In CEUR Workshop Proceedings, number 3180, pages 1038–1063, Bologna, Italy, 2022. CEUR-WS, CEUR
work page 2022
-
[2]
Named entity recognition and classification in historical documents: A survey
Maud Ehrmann, Ahmed Hamdi, Elvys Linhares Pontes, Matteo Romanello, and Antoine Doucet. Named entity recognition and classification in historical documents: A survey. ACM Computing Surveys, 56(2):1–47, 2023
work page 2023
-
[3]
Recent advances in named entity recognition: A comprehensive survey and comparative study, 2024
Imed Keraghel, Stanislas Morbieu, and Mohamed Nadif. Recent advances in named entity recognition: A comprehensive survey and comparative study, 2024
work page 2024
-
[4]
Self-improving for zero-shot named entity recognition with large language models
Tingyu Xie, Qi Li, Yan Zhang, Zuozhu Liu, and Hongwei Wang. Self-improving for zero-shot named entity recognition with large language models. In Kevin Duh, Helena Gomez, and Steven Bethard, editors, Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (V olume 2: Sho...
work page 2024
-
[5]
UniversalNER: Targeted distillation from large language models for open named entity recognition
Wenxuan Zhou, Sheng Zhang, Yu Gu, Muhao Chen, and Hoifung Poon. UniversalNER: Targeted distillation from large language models for open named entity recognition. In The Twelfth International Conference on Learning Representations, Vienna, Austria, 2024. 10 A PREPRINT - AUGUST 26, 2025
work page 2024
-
[6]
Teemu Petteri Ruokolainen and Kimmo Tapio Kettunen. À la recherche du nom perdu–searching for named entities with stanford ner in a finnish historical newspaper and journal collection. In IAPR International Workshop on Document Analysis System: DAS 2018 , 2018
work page 2018
-
[7]
Extended overview of clef hipe 2020: named entity processing on historical newspapers
Maud Ehrmann, Matteo Romanello, Alex Flückiger, and Simon Clematide. Extended overview of clef hipe 2020: named entity processing on historical newspapers. In CEUR Workshop Proceedings, number 2696. CEUR-WS, 2020
work page 2020
-
[8]
Challenges and solutions for Latin named entity recognition
Alexander Erdmann, Christopher Brown, Brian Joseph, Mark Janse, Petra Ajaka, Micha Elsner, and Marie- Catherine de Marneffe. Challenges and solutions for Latin named entity recognition. In Erhard Hinrichs, Marie Hinrichs, and Thorsten Trippel, editors, Proceedings of the Workshop on Language Technology Resources and Tools for Digital Humanities (LT4DH) , ...
work page 2016
-
[9]
An annotated dataset of literary entities
David Bamman, Sejal Popat, and Sheng Shen. An annotated dataset of literary entities. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, V olume 1 (Long and Short Papers), pages 2138–2144, Minneapolis, ...
work page 2019
-
[10]
hmbert: Historical multilingual language models for named entity recognition
Stefan Schweter, Luisa März, Katharina Schmid, and Erion Çano. hmbert: Historical multilingual language models for named entity recognition. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum , volume 3180 of CEUR Workshop Proceeding...
work page 2022
-
[11]
Emanuela Boros, Carlos-Emiliano González-Gallardo, Edward Giamphy, Ahmed Hamdi, José G. Moreno, and Antoine Doucet. Knowledge-based contexts for historical named entity recognition & linking. In Guglielmo Faggioli, Nicola Ferro, Allan Hanbury, and Martin Potthast, editors, Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum, volume 3...
work page 2022
-
[12]
Sentence-BERT: Sentence embeddings using Siamese BERT-networks
Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, editors, Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992,...
work page 2019
-
[13]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Nee- lakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020
work page 1901
-
[14]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv e-prints, pages arXiv–2407, 2024
work page 2024
-
[15]
DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...
work page 2025
-
[16]
A comprehensive overview of large language models
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muhammad Saqib, Saeed Anwar, Muhammad Usman, Naveed Akhtar, Nick Barnes, and Ajmal Mian. A comprehensive overview of large language models. ACM Transactions on Intelligent Systems and Technology, 2023
work page 2023
-
[17]
Gpt-ner: Named entity recognition via large language models, 2023
Shuhe Wang, Xiaofei Sun, Xiaoya Li, Rongbin Ouyang, Fei Wu, Tianwei Zhang, Jiwei Li, and Guoyin Wang. Gpt-ner: Named entity recognition via large language models, 2023
work page 2023
-
[18]
Carlos-Emiliano Gonzalez-Gallardo, Emanuela Boros, Nancy Girdhar, Ahmed Hamdi, Jose G. Moreno, and Antoine Doucet. Yes but.. Can ChatGPT Identify Entities in Historical Documents? . In 2023 ACM/IEEE Joint Conference on Digital Libraries (JCDL) , pages 184–189, Los Alamitos, CA, USA, June 2023. IEEE Computer Society
work page 2023
-
[19]
Sewon Min, Xinxi Lyu, Ari Holtzman, Mikel Artetxe, Mike Lewis, Hannaneh Hajishirzi, and Luke Zettlemoyer. Rethinking the role of demonstrations: What makes in-context learning work? In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, editors, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11048–11064, Abu Dha...
work page 2022
-
[20]
Few-shot named entity recognition: definition, taxonomy and research directions
Vincenzo Moscato, Marco Postiglione, and Giancarlo Sperlí. Few-shot named entity recognition: definition, taxonomy and research directions. ACM Transactions on Intelligent Systems and Technology, 14(5):1–46, 2023
work page 2023
-
[21]
NLTK: The natural language toolkit
Steven Bird and Edward Loper. NLTK: The natural language toolkit. In Proceedings of the ACL Interactive Poster and Demonstration Sessions, pages 214–217, Barcelona, Spain, July 2004. Association for Computational Linguistics
work page 2004
-
[22]
Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations , 2023
work page 2023
-
[23]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[24]
Carlos-Emiliano González-Gallardo, Emanuela Boros, Edward Giamphy, Ahmed Hamdi, José G. Moreno, and Antoine Doucet. Injecting temporal-aware knowledge in historical named entity recognition. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part I, page 377–39...
work page 2023
-
[25]
Exploring transformers for multilingual historical named entity recognition
Anja Ryser, Quynh-Anh Nguyen, Niclas Bodenmann, and Shih-Yun Chen. Exploring transformers for multilingual historical named entity recognition. In Working Notes of CLEF 2022 - Conference and Labs of the Evaluation F orum, CEUR Workshop Proceedings, pages 1090–1108, Bologna, Italy, 2022. CEUR. 12
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.