Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts
Pith reviewed 2026-05-18 03:01 UTC · model grok-4.3
The pith
A new open corpus of 5.1 billion Korean tokens across 1,300 years tracks the peak of Idu script and the rapid shift to Hangul.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Open Korean Historical Corpus aggregates public-domain texts into a single, openly licensed resource spanning thirteen centuries and multiple scripts; quantitative analysis of this resource shows that Idu usage peaked in the 1860s and then declined sharply, that the Hanja-to-Hangul transition was a rapid change beginning around 1890, and that North Korean lexical divergence produces up to 51 times higher out-of-vocabulary rates in modern tokenizers.
What carries the argument
The Open Korean Historical Corpus, built by combining 19 sources into 17.7 million documents and 5.1 billion tokens, supplies the data that makes frequency counts of Idu, Hanja, and Hangul possible across centuries.
If this is right
- Idu usage reached its highest point in the 1860s and declined sharply afterward.
- The replacement of Hanja by Hangul took the form of a rapid transformation that began around 1890.
- North Korean lexical choices produce up to 51 times higher out-of-vocabulary rates when modern tokenizers are applied.
- The corpus can be used directly as pre-training data to improve large-language-model handling of Sino-Korean vocabulary and of archaic or mixed scripts.
Where Pith is reading between the lines
- Similar corpus-construction methods could be applied to other East-Asian languages that experienced script transitions to produce comparable timelines.
- Pre-training on this material may reduce error rates on downstream tasks that involve reading or translating historical Korean documents.
- Adding further regional or post-2025 sources would allow ongoing monitoring of lexical divergence between the two Koreas.
Load-bearing premise
The 19 chosen sources give a representative picture of historical Korean usage and that transcription or digitization errors do not create systematic bias in the measured trends for Idu frequency, script transition dates, or North Korean vocabulary divergence.
What would settle it
A separate, independently assembled collection of Korean historical texts that shows a different peak decade for Idu or a different starting year for the main Hanja-to-Hangul shift, or that produces substantially lower out-of-vocabulary rates on North Korean material.
read the original abstract
The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 17.7 million documents and 5.1 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the Open Korean Historical Corpus (OKHC), a large-scale openly licensed dataset of 17.7 million documents and 5.1 billion tokens spanning 1,300 years (7th century to 2025) across 6 languages and under-represented scripts including Idu and Hanja-Hangul mixed writing. Using this resource, it reports three quantitative findings on Korean linguistic evolution: Idu usage peaked in the 1860s before sharp decline, the shift from Hanja to Hangul was rapid starting around 1890, and North Korean lexical divergence produces up to 51 times higher OOV rates in modern tokenizers. The corpus is positioned as a foundational resource for diachronic NLP and LLM pre-training.
Significance. If the reported trends hold after validation, the work provides a valuable, openly accessible resource that fills a clear gap in historical Korean corpora, enabling quantitative diachronic studies of script transitions and lexical change. The scale, public-domain licensing, and coverage of Idu and mixed scripts are particular strengths that could support improved modeling of Sino-Korean vocabulary and archaic forms in modern NLP systems.
major comments (3)
- [Abstract and §4] Abstract and analysis sections: The three central quantitative claims (Idu peak in the 1860s, Hangul transition around 1890, and 51x North Korean OOV) are presented without any reported details on document dating methods, script classification procedures, data cleaning steps, or estimated transcription/OCR error rates. These omissions directly affect the ability to assess whether the observed trends reflect genuine linguistic change or artifacts of source selection and digitization.
- [§3] Corpus construction section: No per-source or per-period breakdown of document or token counts is provided, nor is there explicit justification or sampling analysis showing that the 19 sources form a representative sample across time, script, and genre. This leaves the diachronic frequency curves vulnerable to bias if preserved official texts or certain scripts are over-represented.
- [§5] OOV analysis section: The 51x OOV claim for North Korean texts requires clear specification of the tokenizer, reference vocabulary, and verification that modern North Korean texts in the corpus are free from systematic digitization errors; without these, the magnitude of the reported divergence cannot be confidently attributed to lexical change alone.
minor comments (3)
- [Throughout] Define 'document' and 'token' consistently and report how tokenization was handled for historical scripts (Idu, mixed Hanja-Hangul) versus modern Hangul.
- [Figures in §4] Add error bars, sensitivity analyses, or robustness checks to the frequency plots for Idu usage and script transition timing.
- [§3] Include a table summarizing source metadata (time span, script types, approximate token counts) to improve transparency.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have revised the manuscript to incorporate additional details and clarifications as appropriate.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and analysis sections: The three central quantitative claims (Idu peak in the 1860s, Hangul transition around 1890, and 51x North Korean OOV) are presented without any reported details on document dating methods, script classification procedures, data cleaning steps, or estimated transcription/OCR error rates. These omissions directly affect the ability to assess whether the observed trends reflect genuine linguistic change or artifacts of source selection and digitization.
Authors: We agree that providing more details on these methodological aspects is important for reproducibility and validity assessment. In the revised manuscript, we have expanded §4 to include a detailed description of document dating methods, which rely primarily on the metadata provided by the original sources and cross-referenced with historical records where available. We have also added information on script classification procedures, involving a combination of rule-based detection for Idu, Hanja, and Hangul scripts supplemented by manual verification on samples. Data cleaning steps, including normalization, deduplication, and removal of low-quality documents, are now explicitly outlined. Regarding error rates, we have included estimates from manual inspection of a random sample of documents, noting that while OCR/transcription errors exist, they are not systematic in a way that would bias the diachronic trends reported. We believe these additions address the concern. revision: yes
-
Referee: [§3] Corpus construction section: No per-source or per-period breakdown of document or token counts is provided, nor is there explicit justification or sampling analysis showing that the 19 sources form a representative sample across time, script, and genre. This leaves the diachronic frequency curves vulnerable to bias if preserved official texts or certain scripts are over-represented.
Authors: We acknowledge the importance of transparency in corpus composition. We have added a new table in §3 that provides a per-source breakdown of document counts, token counts, time periods covered, and primary scripts. Additionally, we have included a per-period (by century) aggregate of token counts to allow readers to assess the distribution. Regarding representativeness, we have expanded the justification in the text, explaining the selection criteria for the 19 sources to maximize coverage of different eras, genres (e.g., official documents, literature, personal writings), and scripts, while noting the inherent limitations due to historical preservation biases. We discuss potential over-representation of certain official texts and how it might affect interpretations, but argue that the trends observed are robust across multiple sources. revision: yes
-
Referee: [§5] OOV analysis section: The 51x OOV claim for North Korean texts requires clear specification of the tokenizer, reference vocabulary, and verification that modern North Korean texts in the corpus are free from systematic digitization errors; without these, the magnitude of the reported divergence cannot be confidently attributed to lexical change alone.
Authors: We have revised §5 to clearly specify the tokenizer used (the Korean-adapted tokenizer from our baseline experiments) and the reference vocabulary derived from a large modern South Korean corpus. We have added details on verification procedures for the North Korean texts, including cross-referencing with multiple independent sources to check for digitization artifacts. While complete assurance against all possible errors is difficult, our analysis shows that the elevated OOV rates are consistent across different subsets and align with known lexical differences between North and South Korean variants. We maintain that the reported divergence is attributable to lexical change. revision: yes
Circularity Check
No circularity: corpus introduction and descriptive statistics only
full rationale
The paper compiles a new historical corpus from 19 public-domain sources and reports empirical observations on script usage frequencies, transition timing, and tokenizer OOV rates directly computed from the assembled documents. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; the reported trends (Idu peak, Hangul transition, North Korean OOV) are simple aggregates and counts over the collected data rather than quantities that reduce to the inputs by construction. Self-citations, if present, are not load-bearing for any central claim, and the work is self-contained as a data resource paper.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages... 17.7 million documents and 5.1 billion tokens from 19 sources
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Idu usage peaked in the 1860s before declining sharply; the transition from Hanja to Hangul was a rapid transformation starting around 1890
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.