A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

Queenie Luo

arxiv: 2605.23103 · v1 · pith:SXQCGP7Onew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.CY· cs.DB

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

Queenie Luo This is my paper

Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.DB

keywords BERT classifierClassical Chinesewenji titlespersonal letterstext classificationMing-Qing literaturedigital humanities

0 comments

The pith

A fine-tuned BERT classifier distinguishes personal letters from prefaces in Classical Chinese wenji titles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lepton, a model fine-tuned on 5438 hand-labeled titles from thirty-three late-Ming and early-Qing authors to classify wenji table-of-contents entries as personal letters or prefaces. It deploys the model to label approximately fifty-five thousand letters across a wider collection, populating the Ming Letter Platform at CBDB. This approach automates extraction at a scale that manual review cannot reach. Historians can then access structured letter data for network and biographical studies without examining every volume.

Core claim

Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati and predicts whether a title is a personal letter or a closely confusable preface. The deployed model has identified approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.

What carries the argument

Lepton, the fine-tuned bert-base-chinese classifier that separates personal-letter titles from prefaces in wenji tables of contents.

If this is right

The classifier scales extraction of personal letters to tens of thousands of additional titles without further manual labeling.
Biographical databases receive structured letter metadata drawn automatically from collected works.
The Hugging Face deployment allows direct application to other wenji collections by researchers.
Farewell-prefaces become systematically separated from letters for historical analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning method could classify other confusable title types in Classical Chinese texts.
A corpus of fifty-five thousand labeled letters opens quantitative study of literati correspondence networks.
Retraining or testing on titles from earlier or later periods would show whether the learned distinctions remain stable over time.

Load-bearing premise

The distinctions learned from titles by thirty-three authors generalize to titles by other authors across mid-Ming to early-Qing wenji.

What would settle it

Hand-label a fresh sample of titles from authors outside the original thirty-three and measure whether the model's accuracy on that sample falls below the level observed on the training data.

read the original abstract

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lepton fine-tunes BERT on titles from 33 late-Ming/early-Qing authors and deploys it to label 55k letters, but the abstract reports no accuracy numbers or checks for temporal generalization.

read the letter

The core contribution is a working classifier that turns wenji tables of contents into a larger pool of personal letters for CBDB. That is useful incremental work for anyone building biographical or social-history datasets in Ming-Qing studies; the 55k-letter output is the concrete result that matters more than the method itself. The paper does the straightforward thing of taking bert-base-chinese, labeling 5438 titles by hand, and shipping the model on Hugging Face for reuse. That deployment step and the CBDB integration are the parts that actually move data into an existing resource rather than just describing another fine-tune. The training corpus is narrow by design, drawn from only thirty-three authors in the late-Ming and early-Qing window. The abstract gives no accuracy figures, no held-out author or period test, no baseline comparison, and no error analysis on the deployed set. If title phrasing conventions shift between mid-Ming and later periods or across different author cohorts, the model can produce systematic mislabels at the scale claimed. The stress-test concern about generalization therefore stands on the information provided. This is the kind of applied digital-humanities paper that a specialist in Chinese historical databases would want to see the numbers for, but a general NLP audience would find routine. It is worth sending to peer review so the validation details and any temporal robustness checks can be examined; without those the claim that the 55k labels are reliable remains untested.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Lepton, a fine-tuned BERT classifier for distinguishing personal-letter titles from confusable prefaces in Classical Chinese wenji tables of contents. It trains bert-base-chinese on 5438 hand-labeled titles from thirty-three late-Ming and early-Qing literati, deploys the model on Hugging Face, and reports its use by the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.

Significance. If the classifier demonstrates reliable accuracy and generalization, the work would supply a scalable tool for extracting historical correspondence from large wenji corpora, directly supporting biographical and literary research via CBDB. The reported deployment and integration with an established database constitute a concrete strength in applied digital humanities.

major comments (2)

[Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.
[Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.

minor comments (1)

[Abstract] Abstract contains a grammatical error ('I've deployed the model on Hugging Face and has been used') that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their detailed review and for identifying gaps in the presentation of our work. We respond to each major comment below, indicating revisions where the manuscript can be strengthened without misrepresenting the scope of the existing experiments.

read point-by-point responses

Referee: [Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.

Authors: We agree that the abstract should convey the model's reliability. The current manuscript emphasizes the training process and CBDB deployment over benchmarking. We will revise the abstract to report the validation procedure (stratified 5-fold cross-validation on the 5438 titles) together with aggregate performance metrics. A concise error analysis and simple baseline comparison will be added to the methods section to support the deployment claim. revision: yes
Referee: [Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.

Authors: The labeled data were drawn exclusively from the thirty-three late-Ming and early-Qing authors named in the title; no mid-Ming titles were available for labeling at the time of model development. We will revise the training-data section to state this temporal restriction explicitly, to describe the absence of author-held-out or temporal splits, and to qualify the generalization claim by noting that performance on mid-Ming material rests on linguistic continuity rather than direct measurement. The CBDB deployment across the broader corpus supplies practical corroboration but does not substitute for held-out evaluation. revision: partial

standing simulated objections not resolved

Empirical accuracy figures on mid-Ming titles, because no labeled mid-Ming titles were collected or evaluated in the reported experiments.

Circularity Check

0 steps flagged

No circularity: standard supervised classification with independent evaluation path

full rationale

The paper presents a conventional fine-tuning pipeline: hand-label 5438 titles from 33 authors, train bert-base-chinese, then apply the resulting classifier to a larger corpus. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claim (model identifies ~55k letters) rests on the empirical behavior of the trained classifier rather than any reduction of outputs to the training labels by construction. Generalization risk is a separate correctness concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract alone; no model hyperparameters, loss functions, or data-splitting rules are stated, so the ledger is empty.

pith-pipeline@v0.9.0 · 5631 in / 1130 out tokens · 23670 ms · 2026-05-25T05:15:50.032899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

David M. Berry. 2023. The Explainability Turn. Digital Humanities Quarterly 17, 2 (2023)

work page 2023
[2]

Peter K. Bol. 2007. Creating a GIS for the History of China. In Placing History: How Maps, Spatial Data, and GIS Are Changing Historical Scholarship, ESRI Press, 27–59

work page 2007
[3]

Chen, and Jeffrey R

George Aaron Broadwell, Jack W. Chen, and Jeffrey R. Tharsen. 2019. Read- ing the Quan Tang shi : Literary History, Topic Modeling, Divergence Measures. Digital Humanities Quarterly 13, 4 (2019)

work page 2019
[4]

Liu Chang, Wang Dongbo, Zhao Zhixiao, Hu Haotian, Liu Jiangfeng, Lu Si, Si Shen, and Liu Liu. 2023. SikuGPT: A Generative Pre-trained Model for Intelli- gent Information Processing of Ancient Texts. arXiv:2304.07778

work page arXiv 2023
[5]

Hilde De Weerdt, Brent Ho, and Wing Kong Hou. 2016. MARKUS: Text Analysis and Reading Platform. https://dh.chinese-empires.eu/markus/

work page 2016
[6]

Hilde De Weerdt and Csaba Oláh Horváth (eds.). 2023. Special issue on East Asian digital humanities. International Journal of Digital Humanities

work page 2023
[7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. In NAACL-HLT, 4171–4186

work page 2019
[8]

James E. Dobson. 2021. Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities 37, 1 (2021), 81–93

work page 2021
[9]

Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala, and Laura Löfberg. 2017. Old Content and Modern Tools — Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11, 3 (2017)

work page 2017
[10]

Davor Lauc and Darko Vitek. 2021. Inferring Standard Name Form, Gender and Nobility from Historical Texts Using Stable Model Semantics. Digital Humani- ties Quarterly 15, 1 (2021)

work page 2021
[11]

Queenie Luo and Yung-Sung Chuang. 2024. Cleansing Jewel: A Neural Spelling Correction Model Built on Google OCR-ed Tibetan Manuscripts. ACM Transac- tions on Asian and Low-Resource Language Information Processing 23, 5, Article 73 (May 2024), 11 pages. https://doi.org/10.1145/3654811

work page doi:10.1145/3654811 2024
[12]

Donald Sturgeon. 2021. Chinese Text Project: A Dynamic Digital Library of Premodern Chinese. Digital Scholarship in the Humanities 36, Suppl. 1 (2021), i101–i112

work page 2021
[13]

Dongbo Wang, Chang Liu, Zhixiao Zhao, Si Shen, Liu Liu, Bin Li, Haotian Hu, Mengcheng Wu, Litao Lin, Xue Zhao, and Xiyu Wang. 2022. Construction and Application of Pre-trained Models of Siku Quanshu in Orientation to Digital Humanities (SikuBERT and SikuRoBERTa). Library Tribune 42, 6 (2022), 31–43

work page 2022
[14]

Ethan Yan. 2021. GuwenBERT: A Pre-trained Language Model for Classical Chinese. https://github.com/Ethan-yt/guwenbert

work page 2021
[15]

Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others

J. Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others. 2023. Tool criticism in computational literary studies. Digital Humanities Quarterly

work page 2023
[16]

Claudia Resch, Daniela Fasching Rastinger, and Maria Kirchmair. 2023. Building an Iterative Annotation Schema for Eighteenth- Century Austrian Newspapers. Digital Humanities Quarterly

work page 2023
[17]

Carolyn Strange, Daniel McNamara, Josh Wodak, and Ian Wood. 2014. Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly 8, 1 (2014)

work page 2014
[18]

Shinya Tagami and Michael L. Satlow. 2023. Dating Ancient Inscriptions with Machine Learning. Digital Humanities Quarterly

work page 2023
[19]

Huishuang Tian, Kexin Yang, Dayiheng Liu, and Jiancheng Lv. 2020. AnchiB- ERT: A Pre-Trained Model for Ancient Chinese Language Understanding and Generation. arXiv:2009.11473. 3

work page arXiv 2020

[1] [1]

David M. Berry. 2023. The Explainability Turn. Digital Humanities Quarterly 17, 2 (2023)

work page 2023

[2] [2]

Peter K. Bol. 2007. Creating a GIS for the History of China. In Placing History: How Maps, Spatial Data, and GIS Are Changing Historical Scholarship, ESRI Press, 27–59

work page 2007

[3] [3]

Chen, and Jeffrey R

George Aaron Broadwell, Jack W. Chen, and Jeffrey R. Tharsen. 2019. Read- ing the Quan Tang shi : Literary History, Topic Modeling, Divergence Measures. Digital Humanities Quarterly 13, 4 (2019)

work page 2019

[4] [4]

Liu Chang, Wang Dongbo, Zhao Zhixiao, Hu Haotian, Liu Jiangfeng, Lu Si, Si Shen, and Liu Liu. 2023. SikuGPT: A Generative Pre-trained Model for Intelli- gent Information Processing of Ancient Texts. arXiv:2304.07778

work page arXiv 2023

[5] [5]

Hilde De Weerdt, Brent Ho, and Wing Kong Hou. 2016. MARKUS: Text Analysis and Reading Platform. https://dh.chinese-empires.eu/markus/

work page 2016

[6] [6]

Hilde De Weerdt and Csaba Oláh Horváth (eds.). 2023. Special issue on East Asian digital humanities. International Journal of Digital Humanities

work page 2023

[7] [7]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. In NAACL-HLT, 4171–4186

work page 2019

[8] [8]

James E. Dobson. 2021. Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities 37, 1 (2021), 81–93

work page 2021

[9] [9]

Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala, and Laura Löfberg. 2017. Old Content and Modern Tools — Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11, 3 (2017)

work page 2017

[10] [10]

Davor Lauc and Darko Vitek. 2021. Inferring Standard Name Form, Gender and Nobility from Historical Texts Using Stable Model Semantics. Digital Humani- ties Quarterly 15, 1 (2021)

work page 2021

[11] [11]

Queenie Luo and Yung-Sung Chuang. 2024. Cleansing Jewel: A Neural Spelling Correction Model Built on Google OCR-ed Tibetan Manuscripts. ACM Transac- tions on Asian and Low-Resource Language Information Processing 23, 5, Article 73 (May 2024), 11 pages. https://doi.org/10.1145/3654811

work page doi:10.1145/3654811 2024

[12] [12]

Donald Sturgeon. 2021. Chinese Text Project: A Dynamic Digital Library of Premodern Chinese. Digital Scholarship in the Humanities 36, Suppl. 1 (2021), i101–i112

work page 2021

[13] [13]

Dongbo Wang, Chang Liu, Zhixiao Zhao, Si Shen, Liu Liu, Bin Li, Haotian Hu, Mengcheng Wu, Litao Lin, Xue Zhao, and Xiyu Wang. 2022. Construction and Application of Pre-trained Models of Siku Quanshu in Orientation to Digital Humanities (SikuBERT and SikuRoBERTa). Library Tribune 42, 6 (2022), 31–43

work page 2022

[14] [14]

Ethan Yan. 2021. GuwenBERT: A Pre-trained Language Model for Classical Chinese. https://github.com/Ethan-yt/guwenbert

work page 2021

[15] [15]

Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others

J. Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others. 2023. Tool criticism in computational literary studies. Digital Humanities Quarterly

work page 2023

[16] [16]

Claudia Resch, Daniela Fasching Rastinger, and Maria Kirchmair. 2023. Building an Iterative Annotation Schema for Eighteenth- Century Austrian Newspapers. Digital Humanities Quarterly

work page 2023

[17] [17]

Carolyn Strange, Daniel McNamara, Josh Wodak, and Ian Wood. 2014. Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly 8, 1 (2014)

work page 2014

[18] [18]

Shinya Tagami and Michael L. Satlow. 2023. Dating Ancient Inscriptions with Machine Learning. Digital Humanities Quarterly

work page 2023

[19] [19]

Huishuang Tian, Kexin Yang, Dayiheng Liu, and Jiancheng Lv. 2020. AnchiB- ERT: A Pre-Trained Model for Ancient Chinese Language Understanding and Generation. arXiv:2009.11473. 3

work page arXiv 2020