A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works
Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3
The pith
A fine-tuned BERT classifier distinguishes personal letters from prefaces in Classical Chinese wenji titles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati and predicts whether a title is a personal letter or a closely confusable preface. The deployed model has identified approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.
What carries the argument
Lepton, the fine-tuned bert-base-chinese classifier that separates personal-letter titles from prefaces in wenji tables of contents.
If this is right
- The classifier scales extraction of personal letters to tens of thousands of additional titles without further manual labeling.
- Biographical databases receive structured letter metadata drawn automatically from collected works.
- The Hugging Face deployment allows direct application to other wenji collections by researchers.
- Farewell-prefaces become systematically separated from letters for historical analysis.
Where Pith is reading between the lines
- The same fine-tuning method could classify other confusable title types in Classical Chinese texts.
- A corpus of fifty-five thousand labeled letters opens quantitative study of literati correspondence networks.
- Retraining or testing on titles from earlier or later periods would show whether the learned distinctions remain stable over time.
Load-bearing premise
The distinctions learned from titles by thirty-three authors generalize to titles by other authors across mid-Ming to early-Qing wenji.
What would settle it
Hand-label a fresh sample of titles from authors outside the original thirty-three and measure whether the model's accuracy on that sample falls below the level observed on the training data.
read the original abstract
I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Lepton, a fine-tuned BERT classifier for distinguishing personal-letter titles from confusable prefaces in Classical Chinese wenji tables of contents. It trains bert-base-chinese on 5438 hand-labeled titles from thirty-three late-Ming and early-Qing literati, deploys the model on Hugging Face, and reports its use by the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.
Significance. If the classifier demonstrates reliable accuracy and generalization, the work would supply a scalable tool for extracting historical correspondence from large wenji corpora, directly supporting biographical and literary research via CBDB. The reported deployment and integration with an established database constitute a concrete strength in applied digital humanities.
major comments (2)
- [Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.
- [Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.
minor comments (1)
- [Abstract] Abstract contains a grammatical error ('I've deployed the model on Hugging Face and has been used') that should be corrected for clarity.
Simulated Author's Rebuttal
We thank the referee for their detailed review and for identifying gaps in the presentation of our work. We respond to each major comment below, indicating revisions where the manuscript can be strengthened without misrepresenting the scope of the existing experiments.
read point-by-point responses
-
Referee: [Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.
Authors: We agree that the abstract should convey the model's reliability. The current manuscript emphasizes the training process and CBDB deployment over benchmarking. We will revise the abstract to report the validation procedure (stratified 5-fold cross-validation on the 5438 titles) together with aggregate performance metrics. A concise error analysis and simple baseline comparison will be added to the methods section to support the deployment claim. revision: yes
-
Referee: [Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.
Authors: The labeled data were drawn exclusively from the thirty-three late-Ming and early-Qing authors named in the title; no mid-Ming titles were available for labeling at the time of model development. We will revise the training-data section to state this temporal restriction explicitly, to describe the absence of author-held-out or temporal splits, and to qualify the generalization claim by noting that performance on mid-Ming material rests on linguistic continuity rather than direct measurement. The CBDB deployment across the broader corpus supplies practical corroboration but does not substitute for held-out evaluation. revision: partial
- Empirical accuracy figures on mid-Ming titles, because no labeled mid-Ming titles were collected or evaluated in the reported experiments.
Circularity Check
No circularity: standard supervised classification with independent evaluation path
full rationale
The paper presents a conventional fine-tuning pipeline: hand-label 5438 titles from 33 authors, train bert-base-chinese, then apply the resulting classifier to a larger corpus. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claim (model identifies ~55k letters) rests on the empirical behavior of the trained classifier rather than any reduction of outputs to the training labels by construction. Generalization risk is a separate correctness concern, not circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
David M. Berry. 2023. The Explainability Turn. Digital Humanities Quarterly 17, 2 (2023)
work page 2023
-
[2]
Peter K. Bol. 2007. Creating a GIS for the History of China. In Placing History: How Maps, Spatial Data, and GIS Are Changing Historical Scholarship, ESRI Press, 27–59
work page 2007
-
[3]
George Aaron Broadwell, Jack W. Chen, and Jeffrey R. Tharsen. 2019. Read- ing the Quan Tang shi : Literary History, Topic Modeling, Divergence Measures. Digital Humanities Quarterly 13, 4 (2019)
work page 2019
- [4]
-
[5]
Hilde De Weerdt, Brent Ho, and Wing Kong Hou. 2016. MARKUS: Text Analysis and Reading Platform. https://dh.chinese-empires.eu/markus/
work page 2016
-
[6]
Hilde De Weerdt and Csaba Oláh Horváth (eds.). 2023. Special issue on East Asian digital humanities. International Journal of Digital Humanities
work page 2023
-
[7]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. In NAACL-HLT, 4171–4186
work page 2019
-
[8]
James E. Dobson. 2021. Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities 37, 1 (2021), 81–93
work page 2021
-
[9]
Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala, and Laura Löfberg. 2017. Old Content and Modern Tools — Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11, 3 (2017)
work page 2017
-
[10]
Davor Lauc and Darko Vitek. 2021. Inferring Standard Name Form, Gender and Nobility from Historical Texts Using Stable Model Semantics. Digital Humani- ties Quarterly 15, 1 (2021)
work page 2021
-
[11]
Queenie Luo and Yung-Sung Chuang. 2024. Cleansing Jewel: A Neural Spelling Correction Model Built on Google OCR-ed Tibetan Manuscripts. ACM Transac- tions on Asian and Low-Resource Language Information Processing 23, 5, Article 73 (May 2024), 11 pages. https://doi.org/10.1145/3654811
-
[12]
Donald Sturgeon. 2021. Chinese Text Project: A Dynamic Digital Library of Premodern Chinese. Digital Scholarship in the Humanities 36, Suppl. 1 (2021), i101–i112
work page 2021
-
[13]
Dongbo Wang, Chang Liu, Zhixiao Zhao, Si Shen, Liu Liu, Bin Li, Haotian Hu, Mengcheng Wu, Litao Lin, Xue Zhao, and Xiyu Wang. 2022. Construction and Application of Pre-trained Models of Siku Quanshu in Orientation to Digital Humanities (SikuBERT and SikuRoBERTa). Library Tribune 42, 6 (2022), 31–43
work page 2022
-
[14]
Ethan Yan. 2021. GuwenBERT: A Pre-trained Language Model for Classical Chinese. https://github.com/Ethan-yt/guwenbert
work page 2021
-
[15]
Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others
J. Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others. 2023. Tool criticism in computational literary studies. Digital Humanities Quarterly
work page 2023
-
[16]
Claudia Resch, Daniela Fasching Rastinger, and Maria Kirchmair. 2023. Building an Iterative Annotation Schema for Eighteenth- Century Austrian Newspapers. Digital Humanities Quarterly
work page 2023
-
[17]
Carolyn Strange, Daniel McNamara, Josh Wodak, and Ian Wood. 2014. Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly 8, 1 (2014)
work page 2014
-
[18]
Shinya Tagami and Michael L. Satlow. 2023. Dating Ancient Inscriptions with Machine Learning. Digital Humanities Quarterly
work page 2023
- [19]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.