pith. sign in

arxiv: 2605.23103 · v1 · pith:SXQCGP7Onew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.CY· cs.DB

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

Pith reviewed 2026-05-25 05:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.DB
keywords BERT classifierClassical Chinesewenji titlespersonal letterstext classificationMing-Qing literaturedigital humanities
0
0 comments X

The pith

A fine-tuned BERT classifier distinguishes personal letters from prefaces in Classical Chinese wenji titles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Lepton, a model fine-tuned on 5438 hand-labeled titles from thirty-three late-Ming and early-Qing authors to classify wenji table-of-contents entries as personal letters or prefaces. It deploys the model to label approximately fifty-five thousand letters across a wider collection, populating the Ming Letter Platform at CBDB. This approach automates extraction at a scale that manual review cannot reach. Historians can then access structured letter data for network and biographical studies without examining every volume.

Core claim

Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati and predicts whether a title is a personal letter or a closely confusable preface. The deployed model has identified approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.

What carries the argument

Lepton, the fine-tuned bert-base-chinese classifier that separates personal-letter titles from prefaces in wenji tables of contents.

If this is right

  • The classifier scales extraction of personal letters to tens of thousands of additional titles without further manual labeling.
  • Biographical databases receive structured letter metadata drawn automatically from collected works.
  • The Hugging Face deployment allows direct application to other wenji collections by researchers.
  • Farewell-prefaces become systematically separated from letters for historical analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fine-tuning method could classify other confusable title types in Classical Chinese texts.
  • A corpus of fifty-five thousand labeled letters opens quantitative study of literati correspondence networks.
  • Retraining or testing on titles from earlier or later periods would show whether the learned distinctions remain stable over time.

Load-bearing premise

The distinctions learned from titles by thirty-three authors generalize to titles by other authors across mid-Ming to early-Qing wenji.

What would settle it

Hand-label a fresh sample of titles from authors outside the original thirty-three and measure whether the model's accuracy on that sample falls below the level observed on the training data.

read the original abstract

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Lepton, a fine-tuned BERT classifier for distinguishing personal-letter titles from confusable prefaces in Classical Chinese wenji tables of contents. It trains bert-base-chinese on 5438 hand-labeled titles from thirty-three late-Ming and early-Qing literati, deploys the model on Hugging Face, and reports its use by the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji for the Ming Letter Platform.

Significance. If the classifier demonstrates reliable accuracy and generalization, the work would supply a scalable tool for extracting historical correspondence from large wenji corpora, directly supporting biographical and literary research via CBDB. The reported deployment and integration with an established database constitute a concrete strength in applied digital humanities.

major comments (2)
  1. [Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.
  2. [Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.
minor comments (1)
  1. [Abstract] Abstract contains a grammatical error ('I've deployed the model on Hugging Face and has been used') that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their detailed review and for identifying gaps in the presentation of our work. We respond to each major comment below, indicating revisions where the manuscript can be strengthened without misrepresenting the scope of the existing experiments.

read point-by-point responses
  1. Referee: [Abstract] Abstract: no performance metrics, validation procedure, error analysis, or baseline comparisons are supplied, rendering it impossible to evaluate whether the model supports the claim of useful classification accuracy on the deployed set of ~55k titles.

    Authors: We agree that the abstract should convey the model's reliability. The current manuscript emphasizes the training process and CBDB deployment over benchmarking. We will revise the abstract to report the validation procedure (stratified 5-fold cross-validation on the 5438 titles) together with aggregate performance metrics. A concise error analysis and simple baseline comparison will be added to the methods section to support the deployment claim. revision: yes

  2. Referee: [Training data description] Training data description: the 5438 labels come exclusively from thirty-three late-Ming and early-Qing authors; no author-held-out splits, temporal stratification, or accuracy figures on mid-Ming titles are described, which is load-bearing for the generalization required by the CBDB deployment claim.

    Authors: The labeled data were drawn exclusively from the thirty-three late-Ming and early-Qing authors named in the title; no mid-Ming titles were available for labeling at the time of model development. We will revise the training-data section to state this temporal restriction explicitly, to describe the absence of author-held-out or temporal splits, and to qualify the generalization claim by noting that performance on mid-Ming material rests on linguistic continuity rather than direct measurement. The CBDB deployment across the broader corpus supplies practical corroboration but does not substitute for held-out evaluation. revision: partial

standing simulated objections not resolved
  • Empirical accuracy figures on mid-Ming titles, because no labeled mid-Ming titles were collected or evaluated in the reported experiments.

Circularity Check

0 steps flagged

No circularity: standard supervised classification with independent evaluation path

full rationale

The paper presents a conventional fine-tuning pipeline: hand-label 5438 titles from 33 authors, train bert-base-chinese, then apply the resulting classifier to a larger corpus. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central claim (model identifies ~55k letters) rests on the empirical behavior of the trained classifier rather than any reduction of outputs to the training labels by construction. Generalization risk is a separate correctness concern, not circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review conducted from abstract alone; no model hyperparameters, loss functions, or data-splitting rules are stated, so the ledger is empty.

pith-pipeline@v0.9.0 · 5631 in / 1130 out tokens · 23670 ms · 2026-05-25T05:15:50.032899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    David M. Berry. 2023. The Explainability Turn. Digital Humanities Quarterly 17, 2 (2023)

  2. [2]

    Peter K. Bol. 2007. Creating a GIS for the History of China. In Placing History: How Maps, Spatial Data, and GIS Are Changing Historical Scholarship, ESRI Press, 27–59

  3. [3]

    Chen, and Jeffrey R

    George Aaron Broadwell, Jack W. Chen, and Jeffrey R. Tharsen. 2019. Read- ing the Quan Tang shi : Literary History, Topic Modeling, Divergence Measures. Digital Humanities Quarterly 13, 4 (2019)

  4. [4]

    Liu Chang, Wang Dongbo, Zhao Zhixiao, Hu Haotian, Liu Jiangfeng, Lu Si, Si Shen, and Liu Liu. 2023. SikuGPT: A Generative Pre-trained Model for Intelli- gent Information Processing of Ancient Texts. arXiv:2304.07778

  5. [5]

    Hilde De Weerdt, Brent Ho, and Wing Kong Hou. 2016. MARKUS: Text Analysis and Reading Platform. https://dh.chinese-empires.eu/markus/

  6. [6]

    Hilde De Weerdt and Csaba Oláh Horváth (eds.). 2023. Special issue on East Asian digital humanities. International Journal of Digital Humanities

  7. [7]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Under- standing. In NAACL-HLT, 4171–4186

  8. [8]

    James E. Dobson. 2021. Vector hermeneutics: On the interpretation of vector space models of text. Digital Scholarship in the Humanities 37, 1 (2021), 81–93

  9. [9]

    Kimmo Kettunen, Eetu Mäkelä, Teemu Ruokolainen, Juha Kuokkala, and Laura Löfberg. 2017. Old Content and Modern Tools — Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910. Digital Humanities Quarterly 11, 3 (2017)

  10. [10]

    Davor Lauc and Darko Vitek. 2021. Inferring Standard Name Form, Gender and Nobility from Historical Texts Using Stable Model Semantics. Digital Humani- ties Quarterly 15, 1 (2021)

  11. [11]

    Queenie Luo and Yung-Sung Chuang. 2024. Cleansing Jewel: A Neural Spelling Correction Model Built on Google OCR-ed Tibetan Manuscripts. ACM Transac- tions on Asian and Low-Resource Language Information Processing 23, 5, Article 73 (May 2024), 11 pages. https://doi.org/10.1145/3654811

  12. [12]

    Donald Sturgeon. 2021. Chinese Text Project: A Dynamic Digital Library of Premodern Chinese. Digital Scholarship in the Humanities 36, Suppl. 1 (2021), i101–i112

  13. [13]

    Dongbo Wang, Chang Liu, Zhixiao Zhao, Si Shen, Liu Liu, Bin Li, Haotian Hu, Mengcheng Wu, Litao Lin, Xue Zhao, and Xiyu Wang. 2022. Construction and Application of Pre-trained Models of Siku Quanshu in Orientation to Digital Humanities (SikuBERT and SikuRoBERTa). Library Tribune 42, 6 (2022), 31–43

  14. [14]

    Ethan Yan. 2021. GuwenBERT: A Pre-trained Language Model for Classical Chinese. https://github.com/Ethan-yt/guwenbert

  15. [15]

    Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others

    J. Berenike Herrmann, Simone Rebora, Massimo Salgaro, and others. 2023. Tool criticism in computational literary studies. Digital Humanities Quarterly

  16. [16]

    Claudia Resch, Daniela Fasching Rastinger, and Maria Kirchmair. 2023. Building an Iterative Annotation Schema for Eighteenth- Century Austrian Newspapers. Digital Humanities Quarterly

  17. [17]

    Carolyn Strange, Daniel McNamara, Josh Wodak, and Ian Wood. 2014. Mining for the Meanings of a Murder: The Impact of OCR Quality on the Use of Digitized Historical Newspapers. Digital Humanities Quarterly 8, 1 (2014)

  18. [18]

    Shinya Tagami and Michael L. Satlow. 2023. Dating Ancient Inscriptions with Machine Learning. Digital Humanities Quarterly

  19. [19]

    Huishuang Tian, Kexin Yang, Dayiheng Liu, and Jiancheng Lv. 2020. AnchiB- ERT: A Pre-Trained Model for Ancient Chinese Language Understanding and Generation. arXiv:2009.11473. 3