Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Alice Oh; Haneul Yoo; Jiho Jin; Kiwoong Park; Kyunghyun Cho; Nawon Kim; Seyoung Song; Songeun Chae

arxiv: 2510.24541 · v2 · submitted 2025-10-28 · 💻 cs.CL

Open Korean Historical Corpus: A Millennia-Scale Diachronic Collection of Public Domain Texts

Seyoung Song , Nawon Kim , Songeun Chae , Kiwoong Park , Jiho Jin , Haneul Yoo , Kyunghyun Cho , Alice Oh This is my paper

Pith reviewed 2026-05-18 03:01 UTC · model grok-4.3

classification 💻 cs.CL

keywords Korean historical corpusdiachronic linguisticsHangul script transitionIdu writing systemNorth Korean OOV ratespublic domain textsNLP resourcesmillennia-scale collection

0 comments

The pith

A new open corpus of 5.1 billion Korean tokens across 1,300 years tracks the peak of Idu script and the rapid shift to Hangul.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles the Open Korean Historical Corpus from 19 public-domain sources to create an openly licensed collection of 17.7 million documents and 5.1 billion tokens that covers Korean from the seventh century through 2025, including rare scripts such as Idu and Hanja-Hangul mixtures. The authors use this resource to measure three concrete changes in written language: Idu frequency rose to a maximum in the 1860s before falling, the replacement of Hanja by Hangul accelerated sharply after 1890, and texts from North Korea produce as much as 51 times more out-of-vocabulary tokens for current tokenizers. A sympathetic reader cares because earlier NLP work lacked large, accessible historical data, leaving the timing and extent of these shifts unmeasured and limiting both linguistic research and the training of models that must handle archaic or regionally divergent Korean.

Core claim

The Open Korean Historical Corpus aggregates public-domain texts into a single, openly licensed resource spanning thirteen centuries and multiple scripts; quantitative analysis of this resource shows that Idu usage peaked in the 1860s and then declined sharply, that the Hanja-to-Hangul transition was a rapid change beginning around 1890, and that North Korean lexical divergence produces up to 51 times higher out-of-vocabulary rates in modern tokenizers.

What carries the argument

The Open Korean Historical Corpus, built by combining 19 sources into 17.7 million documents and 5.1 billion tokens, supplies the data that makes frequency counts of Idu, Hanja, and Hangul possible across centuries.

If this is right

Idu usage reached its highest point in the 1860s and declined sharply afterward.
The replacement of Hanja by Hangul took the form of a rapid transformation that began around 1890.
North Korean lexical choices produce up to 51 times higher out-of-vocabulary rates when modern tokenizers are applied.
The corpus can be used directly as pre-training data to improve large-language-model handling of Sino-Korean vocabulary and of archaic or mixed scripts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar corpus-construction methods could be applied to other East-Asian languages that experienced script transitions to produce comparable timelines.
Pre-training on this material may reduce error rates on downstream tasks that involve reading or translating historical Korean documents.
Adding further regional or post-2025 sources would allow ongoing monitoring of lexical divergence between the two Koreas.

Load-bearing premise

The 19 chosen sources give a representative picture of historical Korean usage and that transcription or digitization errors do not create systematic bias in the measured trends for Idu frequency, script transition dates, or North Korean vocabulary divergence.

What would settle it

A separate, independently assembled collection of Korean historical texts that shows a different peak decade for Idu or a different starting year for the main Hanja-to-Hangul shift, or that produces substantially lower out-of-vocabulary rates on North Korean material.

read the original abstract

The history of the Korean language is characterized by a discrepancy between its spoken and written forms and a pivotal shift from Chinese characters to the Hangul alphabet. However, this linguistic evolution has remained largely unexplored in NLP due to a lack of accessible historical corpora. To address this gap, we introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages, as well as under-represented writing systems like Korean-style Sinitic (Idu) and Hanja-Hangul mixed script. This corpus contains 17.7 million documents and 5.1 billion tokens from 19 sources, ranging from the 7th century to 2025. We leverage this resource to quantitatively analyze major linguistic shifts: (1) Idu usage peaked in the 1860s before declining sharply; (2) the transition from Hanja to Hangul was a rapid transformation starting around 1890; and (3) North Korea's lexical divergence causes modern tokenizers to produce up to 51 times higher out-of-vocabulary rates. This work provides a foundational resource for quantitative diachronic analysis by capturing the history of the Korean language. Moreover, it can serve as a pre-training corpus for large language models, potentially improving their understanding of Sino-Korean vocabulary in modern Hangul as well as archaic writing systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper releases a large open historical Korean corpus spanning 1300 years with Idu and mixed scripts, plus three basic trend measurements.

read the letter

The main thing here is a new openly licensed corpus of Korean texts from the 7th century to now, built from 19 sources into 17.7 million documents and 5.1 billion tokens. It includes Idu and Hanja-Hangul mixed scripts that have been hard to work with before, and the authors attach three simple quantitative observations: Idu peaking in the 1860s, a fast shift to Hangul around 1890, and North Korean texts producing up to 51 times higher OOV rates in modern tokenizers. The open license and the long time span with those specific scripts are the parts that actually fill a documented gap in the literature. Releasing the data this way is the clearest win, since it gives people raw material for diachronic studies or pre-training that was not available at this scale. The paper keeps the motivation straightforward and shows the trends without overclaiming. On the softer side, the timing claims and the OOV multiplier rest on the sources being representative and on transcription plus dating being accurate enough not to move the curves. The abstract gives only aggregate numbers with no per-source breakdown or error rates, so it is still unclear how much digitization issues or selection bias in the 19 sources could affect the reported peaks and transitions. If older mixed-script material has higher error rates, those frequency plots could shift. This is mainly a data resource paper with descriptive statistics rather than new methods or theory. People working on Korean historical NLP, language change, or models that need archaic or Sino-Korean forms would get direct use from the corpus itself. It deserves a serious referee because the scale and licensing make the release worth checking in detail, even if the trend analyses stay preliminary. I would send it to peer review and ask reviewers to look closely at the collection and validation steps.

Referee Report

3 major / 3 minor

Summary. The paper introduces the Open Korean Historical Corpus (OKHC), a large-scale openly licensed dataset of 17.7 million documents and 5.1 billion tokens spanning 1,300 years (7th century to 2025) across 6 languages and under-represented scripts including Idu and Hanja-Hangul mixed writing. Using this resource, it reports three quantitative findings on Korean linguistic evolution: Idu usage peaked in the 1860s before sharp decline, the shift from Hanja to Hangul was rapid starting around 1890, and North Korean lexical divergence produces up to 51 times higher OOV rates in modern tokenizers. The corpus is positioned as a foundational resource for diachronic NLP and LLM pre-training.

Significance. If the reported trends hold after validation, the work provides a valuable, openly accessible resource that fills a clear gap in historical Korean corpora, enabling quantitative diachronic studies of script transitions and lexical change. The scale, public-domain licensing, and coverage of Idu and mixed scripts are particular strengths that could support improved modeling of Sino-Korean vocabulary and archaic forms in modern NLP systems.

major comments (3)

[Abstract and §4] Abstract and analysis sections: The three central quantitative claims (Idu peak in the 1860s, Hangul transition around 1890, and 51x North Korean OOV) are presented without any reported details on document dating methods, script classification procedures, data cleaning steps, or estimated transcription/OCR error rates. These omissions directly affect the ability to assess whether the observed trends reflect genuine linguistic change or artifacts of source selection and digitization.
[§3] Corpus construction section: No per-source or per-period breakdown of document or token counts is provided, nor is there explicit justification or sampling analysis showing that the 19 sources form a representative sample across time, script, and genre. This leaves the diachronic frequency curves vulnerable to bias if preserved official texts or certain scripts are over-represented.
[§5] OOV analysis section: The 51x OOV claim for North Korean texts requires clear specification of the tokenizer, reference vocabulary, and verification that modern North Korean texts in the corpus are free from systematic digitization errors; without these, the magnitude of the reported divergence cannot be confidently attributed to lexical change alone.

minor comments (3)

[Throughout] Define 'document' and 'token' consistently and report how tokenization was handled for historical scripts (Idu, mixed Hanja-Hangul) versus modern Hangul.
[Figures in §4] Add error bars, sensitivity analyses, or robustness checks to the frequency plots for Idu usage and script transition timing.
[§3] Include a table summarizing source metadata (time span, script types, approximate token counts) to improve transparency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We address each of the major comments below and have revised the manuscript to incorporate additional details and clarifications as appropriate.

read point-by-point responses

Referee: [Abstract and §4] Abstract and analysis sections: The three central quantitative claims (Idu peak in the 1860s, Hangul transition around 1890, and 51x North Korean OOV) are presented without any reported details on document dating methods, script classification procedures, data cleaning steps, or estimated transcription/OCR error rates. These omissions directly affect the ability to assess whether the observed trends reflect genuine linguistic change or artifacts of source selection and digitization.

Authors: We agree that providing more details on these methodological aspects is important for reproducibility and validity assessment. In the revised manuscript, we have expanded §4 to include a detailed description of document dating methods, which rely primarily on the metadata provided by the original sources and cross-referenced with historical records where available. We have also added information on script classification procedures, involving a combination of rule-based detection for Idu, Hanja, and Hangul scripts supplemented by manual verification on samples. Data cleaning steps, including normalization, deduplication, and removal of low-quality documents, are now explicitly outlined. Regarding error rates, we have included estimates from manual inspection of a random sample of documents, noting that while OCR/transcription errors exist, they are not systematic in a way that would bias the diachronic trends reported. We believe these additions address the concern. revision: yes
Referee: [§3] Corpus construction section: No per-source or per-period breakdown of document or token counts is provided, nor is there explicit justification or sampling analysis showing that the 19 sources form a representative sample across time, script, and genre. This leaves the diachronic frequency curves vulnerable to bias if preserved official texts or certain scripts are over-represented.

Authors: We acknowledge the importance of transparency in corpus composition. We have added a new table in §3 that provides a per-source breakdown of document counts, token counts, time periods covered, and primary scripts. Additionally, we have included a per-period (by century) aggregate of token counts to allow readers to assess the distribution. Regarding representativeness, we have expanded the justification in the text, explaining the selection criteria for the 19 sources to maximize coverage of different eras, genres (e.g., official documents, literature, personal writings), and scripts, while noting the inherent limitations due to historical preservation biases. We discuss potential over-representation of certain official texts and how it might affect interpretations, but argue that the trends observed are robust across multiple sources. revision: yes
Referee: [§5] OOV analysis section: The 51x OOV claim for North Korean texts requires clear specification of the tokenizer, reference vocabulary, and verification that modern North Korean texts in the corpus are free from systematic digitization errors; without these, the magnitude of the reported divergence cannot be confidently attributed to lexical change alone.

Authors: We have revised §5 to clearly specify the tokenizer used (the Korean-adapted tokenizer from our baseline experiments) and the reference vocabulary derived from a large modern South Korean corpus. We have added details on verification procedures for the North Korean texts, including cross-referencing with multiple independent sources to check for digitization artifacts. While complete assurance against all possible errors is difficult, our analysis shows that the elevated OOV rates are consistent across different subsets and align with known lexical differences between North and South Korean variants. We maintain that the reported divergence is attributable to lexical change. revision: yes

Circularity Check

0 steps flagged

No circularity: corpus introduction and descriptive statistics only

full rationale

The paper compiles a new historical corpus from 19 public-domain sources and reports empirical observations on script usage frequencies, transition timing, and tokenizer OOV rates directly computed from the assembled documents. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text; the reported trends (Idu peak, Hangul transition, North Korean OOV) are simple aggregates and counts over the collected data rather than quantities that reduce to the inputs by construction. Self-citations, if present, are not load-bearing for any central claim, and the work is self-contained as a data resource paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a data-release and descriptive-analysis paper. No free parameters are fitted to produce the central claims, no mathematical axioms are invoked, and no new physical or theoretical entities are postulated. The '6 languages' and script categories are descriptive labels applied to existing historical texts.

pith-pipeline@v0.9.0 · 5805 in / 1355 out tokens · 43303 ms · 2026-05-18T03:01:54.267246+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Open Korean Historical Corpus, a large-scale, openly licensed dataset spanning 1,300 years and 6 languages... 17.7 million documents and 5.1 billion tokens from 19 sources
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Idu usage peaked in the 1860s before declining sharply; the transition from Hanja to Hangul was a rapid transformation starting around 1890

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.