Pretraining Language Models on Historical Text
Pith reviewed 2026-06-28 10:56 UTC · model grok-4.3
The pith
A 7.24 billion parameter language model is trained exclusively on English text predating 1913.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
TypewriterLM is a 7.24B model trained solely on the TypewriterCorpus of pre-1913 English together with lexically grounded instruction tuning on History-LIMA and History-SelfInstruct that forces outputs to stay grounded in the source documents, plus the History-Event benchmark that measures competence, temporal grounding, and leakage.
What carries the argument
Lexically grounded instructing tuning, a post-training framework that constrains responses to remain directly grounded in historical source documents.
If this is right
- The model produces responses that stay consistent with knowledge available before 1913.
- The History-Event benchmark can measure both capability and temporal leakage in one suite.
- Released datasets and model weights enable other groups to replicate or extend historical language model work.
- The same corpus construction and tuning pipeline can be applied to create models with other fixed historical cutoffs.
Where Pith is reading between the lines
- Models of this kind could serve as controlled testbeds for measuring how language use changes across time periods.
- The grounding technique might be adapted to keep models from mixing facts across different domains even when the cutoff date is modern.
Load-bearing premise
The data cleaning and leakage mitigation steps are sufficient to keep all post-1913 text out of the training corpus.
What would settle it
Detection of any post-1913 linguistic features, dates, or references appearing in TypewriterLM outputs on historical prompts.
Figures
read the original abstract
We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TypewriterLM, a 7.24B-parameter language model trained exclusively on English text predating 1913. It describes construction of the 54B-token TypewriterCorpus from archival sources with data cleaning and leakage mitigation, introduces a lexically grounded instruction tuning framework along with the History-LIMA and History-SelfInstruct datasets, and presents the History-Event benchmark for evaluating historical competence, temporal grounding, and leakage. The model and associated resources are released to support research on historical language models.
Significance. If the exclusivity of the pre-1913 training data and the effectiveness of the proposed post-training framework hold, the work would supply a useful open resource for studying diachronic language change and temporally grounded NLP. The explicit release of the model, corpus, and benchmarks is a concrete strength that aids reproducibility.
major comments (2)
- [TypewriterCorpus construction (as described in the abstract and methods)] The headline claim that TypewriterLM is a 'History LM' trained exclusively on pre-1913 text rests on the assertion that TypewriterCorpus contains zero post-1913 material. The manuscript states that 'extensive data cleaning and leakage mitigation procedures' were applied, yet supplies no quantitative audit (date-distribution statistics, manual inspection of edge cases such as modern reprints or OCR artifacts, or external metadata validation). This verification is load-bearing for the central framing.
- [Abstract and evaluation description] No quantitative results, error analysis, or benchmark scores are referenced in the abstract or high-level description, leaving the claims of temporal consistency and model capability without empirical grounding in the provided summary.
minor comments (1)
- [Abstract] The term 'lexically grounded instructing tuning' appears to be a typographical error and should read 'instruction tuning'.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of our central claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.
read point-by-point responses
-
Referee: [TypewriterCorpus construction (as described in the abstract and methods)] The headline claim that TypewriterLM is a 'History LM' trained exclusively on pre-1913 text rests on the assertion that TypewriterCorpus contains zero post-1913 material. The manuscript states that 'extensive data cleaning and leakage mitigation procedures' were applied, yet supplies no quantitative audit (date-distribution statistics, manual inspection of edge cases such as modern reprints or OCR artifacts, or external metadata validation). This verification is load-bearing for the central framing.
Authors: We agree that quantitative verification of the pre-1913 exclusivity is essential for the central framing. The methods section describes the cleaning and leakage mitigation steps in detail, but we did not include aggregate statistics or audit summaries in the main text. In revision we will add a new subsection with: (i) date-distribution histograms based on available metadata, (ii) results from manual inspection of sampled edge cases (reprints, OCR artifacts), and (iii) any external metadata cross-checks performed. These additions will be placed in the corpus-construction section and referenced from the abstract. revision: yes
-
Referee: [Abstract and evaluation description] No quantitative results, error analysis, or benchmark scores are referenced in the abstract or high-level description, leaving the claims of temporal consistency and model capability without empirical grounding in the provided summary.
Authors: The abstract follows the conventional high-level style, but we accept that referencing key empirical results would improve grounding. We will revise the abstract to include concise quantitative highlights from the History-Event benchmark (e.g., accuracy on temporal-consistency and leakage-detection tasks) and a brief note on error-analysis findings, while preserving the abstract's length and readability. revision: yes
Circularity Check
No circularity: paper is a resource-construction effort with independent data pipelines and evaluations.
full rationale
The manuscript introduces TypewriterLM, TypewriterCorpus, History-LIMA, History-SelfInstruct, and History-Event via explicit construction steps (data collection, cleaning, instruction tuning, benchmark design). No equations, parameter fits, or predictions are presented that reduce by construction to the inputs; the exclusivity claim rests on described (but externally verifiable) cleaning procedures rather than any self-referential derivation. No self-citation chains or uniqueness theorems are invoked as load-bearing. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
free parameters (1)
- Model parameter count (7.24B)
axioms (1)
- domain assumption Historical text sources can be collected and cleaned to eliminate all post-1913 content and produce a temporally consistent training corpus.
invented entities (1)
-
lexically grounded instructing tuning
no independent evidence
Reference graph
Works this paper leans on
-
[1]
and Shaw, Philip A
Barber, Charles and Beal, Joan C. and Shaw, Philip A. , year=. The English Language: A Historical Introduction , publisher=
-
[2]
The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study
Fischer, Stefan and Knappen, J. The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020
2020
-
[3]
2025 , eprint=
Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability , author=. 2025 , eprint=
2025
-
[4]
c.\ 1510 -- c.\ 1900 , year =
Digitised Books. c.\ 1510 -- c.\ 1900 , year =
1900
-
[5]
2024 , publisher =
Brezina, Vaclav , title =. 2024 , publisher =
2024
-
[6]
2024 , eprint =
Qwen2.5 Technical Report , author =. 2024 , eprint =
2024
-
[7]
History LLMs , institution =
G. History LLMs , institution =. 2025 , url =
2025
-
[8]
2026 , month=
Introducing talkie: a 13B vintage language model from 1930 , author=. 2026 , month=
1930
-
[9]
The European English Messenger , year =
A European database of descriptors of English electronic texts , author=. The European English Messenger , year =
-
[10]
2016 , note =
Huber, Magnus and Nissel, Magnus and Puga, Karin , title =. 2016 , note =
2016
-
[11]
ICAME Journal , year =
Siemund, Rainer and Claridge, Claudia , title =. ICAME Journal , year =
-
[12]
Neural Machine Translation of Rare Words with Subword Units
Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. ACL. 2016
2016
-
[13]
2024 , eprint =
The Llama 3 Herd of Models , author =. 2024 , eprint =
2024
-
[14]
2020 , eprint=
GLU Variants Improve Transformer , author=. 2020 , eprint=
2020
-
[15]
GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebron, Federico and Sanghai, Sumit. GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP. 2023
2023
-
[16]
GitHub repository , howpublished =
OpenAI , title =. GitHub repository , howpublished =. 2023 , publisher =
2023
-
[17]
Root Mean Square Layer Normalization , url =
Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =
-
[18]
2021 , eprint=
RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=
2021
-
[19]
Neurocomputing , volume=
Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=
2024
-
[20]
ICLR , year=
Decoupled Weight Decay Regularization , author=. ICLR , year=
-
[21]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
2026 , url=
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. 2026 , url=
2026
-
[23]
DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
-
[24]
2026 , eprint=
DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining , author=. 2026 , eprint=
2026
-
[25]
2025 , eprint=
Chronologically Consistent Large Language Models , author=. 2025 , eprint=
2025
-
[26]
COLM , year=
Dated Data: Tracing Knowledge Cutoffs in Large Language Models , author=. COLM , year=
-
[27]
American stories: A large-scale structured text dataset of historical
Dell, Melissa and Carlson, Jacob and Bryan, Tom and Silcock, Emily and Arora, Abhishek and Shen, Zejiang and D'Amico-Wong, Luca and Le, Quan and Querubin, Pablo and Heldring, Leander , journal=. American stories: A large-scale structured text dataset of historical
-
[28]
Available at SSRN 4881024 , year=
StoriesLM: A family of language models with time-indexed training data , author=. Available at SSRN 4881024 , year=
-
[29]
Gavin Greif and Niclas Griesshaber and Robin Greif , year=. Multimodal. 2504.00414 , archivePrefix=
-
[30]
2025 , eprint=
Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918) , author=. 2025 , eprint=
1918
-
[31]
2025 , publisher =
Grigorian, Hayk and Yaghoobian, Hamed , title =. 2025 , publisher =
2025
-
[32]
ACL , year=
HellaSwag: Can a Machine Really Finish Your Sentence? , author=. ACL , year=
-
[33]
On Memorization of Large Language Models in Logical Reasoning
Xie, Chulin and Huang, Yangsibo and Zhang, Chiyuan and Yu, Da and Chen, Xinyun and Lin, Bill Yuchen and Li, Bo and Ghazi, Badih and Kumar, Ravi. On Memorization of Large Language Models in Logical Reasoning. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Assoc...
2025
-
[34]
NeurIPS , year=
Scaling Data-Constrained Language Models , author=. NeurIPS , year=
-
[35]
ICLR , year=
A Fine-Grained Analysis on Distribution Shift , author=. ICLR , year=
-
[36]
arXiv preprint arXiv:2505.00030 , year=
Can Language Models Represent the Past without Anachronism? , author=. arXiv preprint arXiv:2505.00030 , year=
-
[37]
ICLR , year=
Finetuned Language Models are Zero-Shot Learners , author=. ICLR , year=
-
[38]
NeurIPS , volume=
Training language models to follow instructions with human feedback , author=. NeurIPS , volume=
-
[39]
Advances in Neural Information Processing Systems , volume=
Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=
-
[40]
ACL , year=
Self-instruct: Aligning language models with self-generated instructions , author=. ACL , year=
-
[41]
Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=
2022
-
[42]
ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
Lookahead bias in pretrained language models , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=
2025
-
[43]
Length-Controlled
Yann Dubois and Percy Liang and Tatsunori Hashimoto , booktitle=. Length-Controlled. 2024 , url=
2024
-
[44]
First Conference on Language Modeling , year=
Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators , author=. First Conference on Language Modeling , year=
-
[45]
2022 , eprint =
Training Compute-Optimal Large Language Models , author =. 2022 , eprint =
2022
-
[46]
, title =
Venturella, T. , title =. 2026 , howpublished =
2026
-
[47]
2026 , note =
michaelmla , title =. 2026 , note =
2026
-
[48]
OpenAI blog , year=
Language models are unsupervised multitask learners , author=. OpenAI blog , year=
-
[49]
Instruction-Following Evaluation for Large Language Models
Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[50]
TIME , year =
Billy Perrigo , title =. TIME , year =
-
[51]
1965 , url =
Bailyn, Bernard , title =. 1965 , url =
1965
-
[52]
Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 , editor =
Denison, David , title =. Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 , editor =. 1994 , publisher =
1993
-
[53]
ICLR , year=
Proving Test Set Contamination in Black-Box Language Models , author=. ICLR , year=
-
[54]
A Careful Examination of Large Language Model Performance on Grade School Arithmetic , url =
Zhang, Hugh and Da, Jeff and Lee, Dean and Robinson, Vaughn and Wu, Catherine and Song, Will and Zhao, Tiffany and Raja, Pranav and Zhuang, Charlotte and Slack, Dylan and Lyu, Qin and Hendryx, Sean and Kaplan, Russell and Lunati, Michele and Yue, Summer , booktitle =. A Careful Examination of Large Language Model Performance on Grade School Arithmetic , u...
-
[55]
Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning
Razeghi, Yasaman and Logan IV, Robert L and Gardner, Matt and Singh, Sameer. Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning. EMNLP. 2022
2022
-
[56]
Introducing Claude Sonnet 4.6 , year =
-
[57]
2025 , eprint=
Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=
2025
-
[58]
2026 , month =
Introducing. 2026 , month =
2026
-
[59]
Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[60]
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics , year =
A Model of the Language Process , author =. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics , year =
-
[61]
Scaling Point-in-Time Language Models
Kelly, Bryan T and Malamud, Semyon and Schwab, Johannes and Xu, Teng Andrea. Scaling Point-in-Time Language Models. 2026
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.