pith. sign in

arxiv: 2606.02991 · v1 · pith:ET75EIP3new · submitted 2026-06-02 · 💻 cs.CL · cs.AI

Pretraining Language Models on Historical Text

Pith reviewed 2026-06-28 10:56 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords historical language modelstemporal leakagepre-1913 Englishinstruction tuningcorpus constructionevaluation benchmarksarchival sources
0
0 comments X

The pith

A 7.24 billion parameter language model is trained exclusively on English text predating 1913.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build and evaluate language models whose knowledge is strictly limited to English written before 1913. It does so by assembling a 54-billion-token corpus drawn only from archival sources and applying cleaning steps meant to block any later material. A post-training method called lexically grounded instruction tuning is introduced to keep model answers tied to the original historical documents. A new benchmark tests whether the resulting model stays temporally consistent and free of leakage. The work supplies the model, data, and evaluation tools so that future studies can examine language models whose training cutoff is fixed in the past.

Core claim

TypewriterLM is a 7.24B model trained solely on the TypewriterCorpus of pre-1913 English together with lexically grounded instruction tuning on History-LIMA and History-SelfInstruct that forces outputs to stay grounded in the source documents, plus the History-Event benchmark that measures competence, temporal grounding, and leakage.

What carries the argument

Lexically grounded instructing tuning, a post-training framework that constrains responses to remain directly grounded in historical source documents.

If this is right

  • The model produces responses that stay consistent with knowledge available before 1913.
  • The History-Event benchmark can measure both capability and temporal leakage in one suite.
  • Released datasets and model weights enable other groups to replicate or extend historical language model work.
  • The same corpus construction and tuning pipeline can be applied to create models with other fixed historical cutoffs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models of this kind could serve as controlled testbeds for measuring how language use changes across time periods.
  • The grounding technique might be adapted to keep models from mixing facts across different domains even when the cutoff date is modern.

Load-bearing premise

The data cleaning and leakage mitigation steps are sufficient to keep all post-1913 text out of the training corpus.

What would settle it

Detection of any post-1913 linguistic features, dates, or references appearing in TypewriterLM outputs on historical prompts.

Figures

Figures reproduced from arXiv: 2606.02991 by Freda Shi, Junchi Yu, Niclas Griesshaber, Philip Torr, Xiaoxi Luo, Yao Lu, Yixuan Wang, Zachary Shinnick.

Figure 1
Figure 1. Figure 1: Probing TYPEWRITERLM on “future” events. The cleanest fix is to train language models un￾der a strict knowledge cutoff that excludes infor￾mation beyond a chosen date, an approach that is attracting increasing attention from both ma￾chine learning and the humanities (Grigorian and Yaghoobian, 2025; Göttlich et al., 2025; Levine et al., 2026). While modern LLMs benefit from massive and diverse web-collected… view at source ↗
Figure 2
Figure 2. Figure 2: Number of tokens by decade (1700–1913) in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Bits-per-Byte Surprisingness Scores Results [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Number of HIST-EVENT events per decade (1700–2025). C.3 Full BPB Statistics The full BPB statistics is provided in [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
read the original abstract

We introduce TypewriterLM, a 7.24B History language model (LM) trained exclusively on English text predating 1913. Developing History LMs requires addressing challenges in data quality and availability, preventing temporal leakage, designing temporally consistent post-training pipelines, and constructing reliable evaluations. To address these issues, we construct TypewriterCorpus, a 54B-token historical corpus collected from diverse archival and linguistically annotated sources with extensive data cleaning and leakage mitigation procedures. Furthermore, we introduce lexically grounded instructing tuning, a post-training framework that constraints responses to remain directly grounded in historical source documents. Using this framework we construct two historical instruction tuning datasets: History-LIMA and History-SelfInstruct. To evaluate capability and temporal consistency, we introduce History-Event, a benchmark suite for evaluating competence, temporal grounding and data leakage. We release TypewriterLM and all associated resources to support future research on historical language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TypewriterLM, a 7.24B-parameter language model trained exclusively on English text predating 1913. It describes construction of the 54B-token TypewriterCorpus from archival sources with data cleaning and leakage mitigation, introduces a lexically grounded instruction tuning framework along with the History-LIMA and History-SelfInstruct datasets, and presents the History-Event benchmark for evaluating historical competence, temporal grounding, and leakage. The model and associated resources are released to support research on historical language models.

Significance. If the exclusivity of the pre-1913 training data and the effectiveness of the proposed post-training framework hold, the work would supply a useful open resource for studying diachronic language change and temporally grounded NLP. The explicit release of the model, corpus, and benchmarks is a concrete strength that aids reproducibility.

major comments (2)
  1. [TypewriterCorpus construction (as described in the abstract and methods)] The headline claim that TypewriterLM is a 'History LM' trained exclusively on pre-1913 text rests on the assertion that TypewriterCorpus contains zero post-1913 material. The manuscript states that 'extensive data cleaning and leakage mitigation procedures' were applied, yet supplies no quantitative audit (date-distribution statistics, manual inspection of edge cases such as modern reprints or OCR artifacts, or external metadata validation). This verification is load-bearing for the central framing.
  2. [Abstract and evaluation description] No quantitative results, error analysis, or benchmark scores are referenced in the abstract or high-level description, leaving the claims of temporal consistency and model capability without empirical grounding in the provided summary.
minor comments (1)
  1. [Abstract] The term 'lexically grounded instructing tuning' appears to be a typographical error and should read 'instruction tuning'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of our central claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses
  1. Referee: [TypewriterCorpus construction (as described in the abstract and methods)] The headline claim that TypewriterLM is a 'History LM' trained exclusively on pre-1913 text rests on the assertion that TypewriterCorpus contains zero post-1913 material. The manuscript states that 'extensive data cleaning and leakage mitigation procedures' were applied, yet supplies no quantitative audit (date-distribution statistics, manual inspection of edge cases such as modern reprints or OCR artifacts, or external metadata validation). This verification is load-bearing for the central framing.

    Authors: We agree that quantitative verification of the pre-1913 exclusivity is essential for the central framing. The methods section describes the cleaning and leakage mitigation steps in detail, but we did not include aggregate statistics or audit summaries in the main text. In revision we will add a new subsection with: (i) date-distribution histograms based on available metadata, (ii) results from manual inspection of sampled edge cases (reprints, OCR artifacts), and (iii) any external metadata cross-checks performed. These additions will be placed in the corpus-construction section and referenced from the abstract. revision: yes

  2. Referee: [Abstract and evaluation description] No quantitative results, error analysis, or benchmark scores are referenced in the abstract or high-level description, leaving the claims of temporal consistency and model capability without empirical grounding in the provided summary.

    Authors: The abstract follows the conventional high-level style, but we accept that referencing key empirical results would improve grounding. We will revise the abstract to include concise quantitative highlights from the History-Event benchmark (e.g., accuracy on temporal-consistency and leakage-detection tasks) and a brief note on error-analysis findings, while preserving the abstract's length and readability. revision: yes

Circularity Check

0 steps flagged

No circularity: paper is a resource-construction effort with independent data pipelines and evaluations.

full rationale

The manuscript introduces TypewriterLM, TypewriterCorpus, History-LIMA, History-SelfInstruct, and History-Event via explicit construction steps (data collection, cleaning, instruction tuning, benchmark design). No equations, parameter fits, or predictions are presented that reduce by construction to the inputs; the exclusivity claim rests on described (but externally verifiable) cleaning procedures rather than any self-referential derivation. No self-citation chains or uniqueness theorems are invoked as load-bearing. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The claims rest primarily on the domain assumption of effective temporal isolation via cleaning and the introduction of a new tuning framework; model scale is a chosen hyperparameter rather than a fitted value derived from data.

free parameters (1)
  • Model parameter count (7.24B)
    Selected scale for the LM; not derived from data fitting but chosen as the training configuration.
axioms (1)
  • domain assumption Historical text sources can be collected and cleaned to eliminate all post-1913 content and produce a temporally consistent training corpus.
    Invoked directly in the construction of TypewriterCorpus and leakage mitigation procedures.
invented entities (1)
  • lexically grounded instructing tuning no independent evidence
    purpose: Post-training framework that constrains model responses to historical source documents
    New method introduced by the paper with no independent evidence provided beyond the framework description.

pith-pipeline@v0.9.1-grok · 5705 in / 1488 out tokens · 50165 ms · 2026-06-28T10:56:42.125275+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 6 canonical work pages · 3 internal anchors

  1. [1]

    and Shaw, Philip A

    Barber, Charles and Beal, Joan C. and Shaw, Philip A. , year=. The English Language: A Historical Introduction , publisher=

  2. [2]

    The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study

    Fischer, Stefan and Knappen, J. The Royal Society Corpus 6.0: Providing 300+ Years of Scientific Writing for Humanistic Study. Proceedings of the Twelfth Language Resources and Evaluation Conference. 2020

  3. [3]

    2025 , eprint=

    Institutional Books 1.0: A 242B token dataset from Harvard Library's collections, refined for accuracy and usability , author=. 2025 , eprint=

  4. [4]

    c.\ 1510 -- c.\ 1900 , year =

    Digitised Books. c.\ 1510 -- c.\ 1900 , year =

  5. [5]

    2024 , publisher =

    Brezina, Vaclav , title =. 2024 , publisher =

  6. [6]

    2024 , eprint =

    Qwen2.5 Technical Report , author =. 2024 , eprint =

  7. [7]

    History LLMs , institution =

    G. History LLMs , institution =. 2025 , url =

  8. [8]

    2026 , month=

    Introducing talkie: a 13B vintage language model from 1930 , author=. 2026 , month=

  9. [9]

    The European English Messenger , year =

    A European database of descriptors of English electronic texts , author=. The European English Messenger , year =

  10. [10]

    2016 , note =

    Huber, Magnus and Nissel, Magnus and Puga, Karin , title =. 2016 , note =

  11. [11]

    ICAME Journal , year =

    Siemund, Rainer and Claridge, Claudia , title =. ICAME Journal , year =

  12. [12]

    Neural Machine Translation of Rare Words with Subword Units

    Sennrich, Rico and Haddow, Barry and Birch, Alexandra. Neural Machine Translation of Rare Words with Subword Units. ACL. 2016

  13. [13]

    2024 , eprint =

    The Llama 3 Herd of Models , author =. 2024 , eprint =

  14. [14]

    2020 , eprint=

    GLU Variants Improve Transformer , author=. 2020 , eprint=

  15. [15]

    GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    Ainslie, Joshua and Lee-Thorp, James and de Jong, Michiel and Zemlyanskiy, Yury and Lebron, Federico and Sanghai, Sumit. GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. EMNLP. 2023

  16. [16]

    GitHub repository , howpublished =

    OpenAI , title =. GitHub repository , howpublished =. 2023 , publisher =

  17. [17]

    Root Mean Square Layer Normalization , url =

    Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =

  18. [18]

    2021 , eprint=

    RoFormer: Enhanced Transformer with Rotary Position Embedding , author=. 2021 , eprint=

  19. [19]

    Neurocomputing , volume=

    Roformer: Enhanced transformer with rotary position embedding , author=. Neurocomputing , volume=. 2024 , publisher=

  20. [20]

    ICLR , year=

    Decoupled Weight Decay Regularization , author=. ICLR , year=

  21. [21]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark and Isaac Cowhey and Oren Etzioni and Tushar Khot and Ashish Sabharwal and Carissa Schoenick and Oyvind Tafjord , title =. arXiv:1803.05457v1 , year =

  22. [22]

    2026 , url=

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=. 2026 , url=

  23. [23]

    DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

  24. [24]

    2026 , eprint=

    DatedGPT: Preventing Lookahead Bias in Large Language Models with Time-Aware Pretraining , author=. 2026 , eprint=

  25. [25]

    2025 , eprint=

    Chronologically Consistent Large Language Models , author=. 2025 , eprint=

  26. [26]

    COLM , year=

    Dated Data: Tracing Knowledge Cutoffs in Large Language Models , author=. COLM , year=

  27. [27]

    American stories: A large-scale structured text dataset of historical

    Dell, Melissa and Carlson, Jacob and Bryan, Tom and Silcock, Emily and Arora, Abhishek and Shen, Zejiang and D'Amico-Wong, Luca and Le, Quan and Querubin, Pablo and Heldring, Leander , journal=. American stories: A large-scale structured text dataset of historical

  28. [28]

    Available at SSRN 4881024 , year=

    StoriesLM: A family of language models with time-indexed training data , author=. Available at SSRN 4881024 , year=

  29. [29]

    Multimodal

    Gavin Greif and Niclas Griesshaber and Robin Greif , year=. Multimodal. 2504.00414 , archivePrefix=

  30. [30]

    2025 , eprint=

    Multimodal LLMs for Historical Dataset Construction from Archival Image Scans: German Patents (1877-1918) , author=. 2025 , eprint=

  31. [31]

    2025 , publisher =

    Grigorian, Hayk and Yaghoobian, Hamed , title =. 2025 , publisher =

  32. [32]

    ACL , year=

    HellaSwag: Can a Machine Really Finish Your Sentence? , author=. ACL , year=

  33. [33]

    On Memorization of Large Language Models in Logical Reasoning

    Xie, Chulin and Huang, Yangsibo and Zhang, Chiyuan and Yu, Da and Chen, Xinyun and Lin, Bill Yuchen and Li, Bo and Ghazi, Badih and Kumar, Ravi. On Memorization of Large Language Models in Logical Reasoning. Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Assoc...

  34. [34]

    NeurIPS , year=

    Scaling Data-Constrained Language Models , author=. NeurIPS , year=

  35. [35]

    ICLR , year=

    A Fine-Grained Analysis on Distribution Shift , author=. ICLR , year=

  36. [36]

    arXiv preprint arXiv:2505.00030 , year=

    Can Language Models Represent the Past without Anachronism? , author=. arXiv preprint arXiv:2505.00030 , year=

  37. [37]

    ICLR , year=

    Finetuned Language Models are Zero-Shot Learners , author=. ICLR , year=

  38. [38]

    NeurIPS , volume=

    Training language models to follow instructions with human feedback , author=. NeurIPS , volume=

  39. [39]

    Advances in Neural Information Processing Systems , volume=

    Lima: Less is more for alignment , author=. Advances in Neural Information Processing Systems , volume=

  40. [40]

    ACL , year=

    Self-instruct: Aligning language models with self-generated instructions , author=. ACL , year=

  41. [41]

    Edward J Hu and yelong shen and Phillip Wallis and Zeyuan Allen-Zhu and Yuanzhi Li and Shean Wang and Lu Wang and Weizhu Chen , booktitle=. Lo. 2022 , url=

  42. [42]

    ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

    Lookahead bias in pretrained language models , author=. ICML 2025 Workshop on Reliable and Responsible Foundation Models , year=

  43. [43]

    Length-Controlled

    Yann Dubois and Percy Liang and Tatsunori Hashimoto , booktitle=. Length-Controlled. 2024 , url=

  44. [44]

    First Conference on Language Modeling , year=

    Length-Controlled AlpacaEval: A Simple Debiasing of Automatic Evaluators , author=. First Conference on Language Modeling , year=

  45. [45]

    2022 , eprint =

    Training Compute-Optimal Large Language Models , author =. 2022 , eprint =

  46. [46]

    , title =

    Venturella, T. , title =. 2026 , howpublished =

  47. [47]

    2026 , note =

    michaelmla , title =. 2026 , note =

  48. [48]

    OpenAI blog , year=

    Language models are unsupervised multitask learners , author=. OpenAI blog , year=

  49. [49]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

  50. [50]

    TIME , year =

    Billy Perrigo , title =. TIME , year =

  51. [51]

    1965 , url =

    Bailyn, Bernard , title =. 1965 , url =

  52. [52]

    Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 , editor =

    Denison, David , title =. Corpora across the centuries: Proceedings of the First International Colloquium on English Diachronic Corpora, St Catharine's College Cambridge, 25-27 March 1993 , editor =. 1994 , publisher =

  53. [53]

    ICLR , year=

    Proving Test Set Contamination in Black-Box Language Models , author=. ICLR , year=

  54. [54]

    A Careful Examination of Large Language Model Performance on Grade School Arithmetic , url =

    Zhang, Hugh and Da, Jeff and Lee, Dean and Robinson, Vaughn and Wu, Catherine and Song, Will and Zhao, Tiffany and Raja, Pranav and Zhuang, Charlotte and Slack, Dylan and Lyu, Qin and Hendryx, Sean and Kaplan, Russell and Lunati, Michele and Yue, Summer , booktitle =. A Careful Examination of Large Language Model Performance on Grade School Arithmetic , u...

  55. [55]

    Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning

    Razeghi, Yasaman and Logan IV, Robert L and Gardner, Matt and Singh, Sameer. Impact of Pretraining Term Frequencies on Few-Shot Numerical Reasoning. EMNLP. 2022

  56. [56]

    Introducing Claude Sonnet 4.6 , year =

  57. [57]

    2025 , eprint=

    Gemini: A Family of Highly Capable Multimodal Models , author=. 2025 , eprint=

  58. [58]

    2026 , month =

    Introducing. 2026 , month =

  59. [59]

    Qwen3 Technical Report

    Qwen3 Technical Report , author=. arXiv preprint arXiv:2505.09388 , year=

  60. [60]

    Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics , year =

    A Model of the Language Process , author =. Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics , year =

  61. [61]

    Scaling Point-in-Time Language Models

    Kelly, Bryan T and Malamud, Semyon and Schwab, Johannes and Xu, Teng Andrea. Scaling Point-in-Time Language Models. 2026