Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Queenie Luo; Yung-Sung Chuang

arxiv: 2304.03427 · v2 · pith:PKIS67UInew · submitted 2023-04-07 · 💻 cs.CL · cs.AI· cs.CY· cs.LG

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

Queenie Luo , Yung-Sung Chuang This is my paper

Pith reviewed 2026-05-24 09:26 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CYcs.LG

keywords spelling correctionOCRTibetan manuscriptsTransformerconfidence scorecharacter error rateneural modelsequence correction

0 comments

The pith

Transformer with confidence scoring corrects OCR errors in Tibetan manuscripts more accurately than LSTM or GRU models

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a spelling correction system for noisy OCR text from ancient Tibetan manuscripts used in historical and religious studies. It first builds paired toy and real datasets through feature engineering of Google OCR output. A Transformer is then modified with a Confidence Score mechanism to perform the corrections. Measurements of loss and character error rate show this architecture outperforms a plain Transformer as well as LSTM-2-LSTM and GRU-2-GRU baselines. The work also includes error analysis and attention visualizations to check model behavior on real manuscript data.

Core claim

The central claim is that adding a Confidence Score mechanism to the Transformer architecture produces lower loss and lower character error rates on spelling correction for Google OCR-ed Tibetan manuscripts than either a standard Transformer or the LSTM-2-LSTM and GRU-2-GRU models when all are trained on the same feature-engineered paired toy and real datasets.

What carries the argument

Transformer architecture augmented with a Confidence Score mechanism that identifies and corrects erroneous tokens in the OCR output

If this is right

The augmented model can be applied directly to auto-correct noisy OCR output from Tibetan manuscripts.
Attention and self-attention heatmaps provide a way to examine how the model processes erroneous tokens.
The superiority holds across both toy and real paired data according to the reported loss and CER metrics.
Error token analysis supports claims of robustness on the manuscript data used.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same confidence-augmented approach could be tested on OCR output from other historical scripts that suffer similar fading or staining issues.
Attention visualizations might reveal systematic patterns in OCR mistakes that could guide future preprocessing steps.
If the datasets prove representative, the model offers a concrete route to higher-quality digital versions of these texts for humanities research.

Load-bearing premise

The feature-engineered paired toy and real datasets accurately capture the distribution of real OCR errors in Tibetan manuscripts without introducing selection bias or unrealistic noise patterns.

What would settle it

Running the trained models on a fresh collection of Google OCR-ed Tibetan manuscript pages excluded from the original dataset construction and verifying whether the Transformer plus confidence model still records the lowest character error rate.

Figures

Figures reproduced from arXiv: 2304.03427 by Queenie Luo, Yung-Sung Chuang.

**Figure 2.** Figure 2: Model Architecture. The Confidence Score mechanism is incorporated into the Transformer archi [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: The figure on the left shows tokens that the model generally succeeds in correcting. These tokens are [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Attention Heatmaps. The attention heatmaps are generated using the third attention layer, averaged [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: An example of the model’s short-sightedness, where “kife” is incorrectly corrected to “knife” instead [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

read the original abstract

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A practical but lightly evaluated application of Transformer with confidence scoring to Tibetan OCR correction.

read the letter

The paper builds a spelling correction model for Tibetan OCR by taking a Transformer, adding a confidence score, and training it on feature-engineered paired datasets from etext. It claims better loss and CER than baselines and includes attention analysis. What it does is apply this to a specific humanities need with old manuscripts. The dataset construction and visualizations are clear enough to follow. The soft spots are around the evaluation. The abstract has no numbers backing the superiority claim, no details on the confidence integration or data handling, and no tests showing the toy data errors match real ones. The stress-test worry about the synthetic pairs is fair; even with the construction described, missing the histograms or overlap stats leaves room for the results to be setup-dependent. If the full paper has those, it would address it, but from the given info it's a gap. This is for researchers in digital humanities or NLP for historical languages. A reader in that area could pick up the method for similar projects. It deserves peer review because the problem is practical and the work is transparent on the approach, even if the evidence needs bolstering with more specifics.

Referee Report

3 major / 1 minor

Summary. The paper presents a neural spelling correction model for Google OCR-ed Tibetan manuscripts. It constructs paired toy and real datasets via feature engineering from a raw Tibetan etext corpus, augments the Transformer with a Confidence Score mechanism, and reports that this architecture outperforms plain Transformer, LSTM-2-LSTM, and GRU-2-GRU baselines on loss and character error rate (CER). The work also includes error analysis and attention/self-attention visualizations for robustness assessment.

Significance. If the empirical claims hold after proper validation and reporting, the work would offer a practical contribution to low-resource script OCR post-correction, particularly for Tibetan manuscripts where faded graphs and stains are common. The attention visualizations provide some interpretability value, and the focus on real-world OCR noise is relevant to digital humanities applications.

major comments (3)

[Abstract] Abstract: the central claim that 'our Transformer + Confidence score mechanism architecture proves to be superior' on loss and CER supplies no numerical values, no data-split details, no statistical tests, and no error bars, rendering the superiority assertion impossible to evaluate.
[Dataset section] Dataset section: the feature-engineered paired toy and real datasets are presented as capturing real OCR error distributions, yet no quantitative validation (error-type histograms, n-gram overlap statistics, or distributional tests) is supplied to confirm that injected noise matches empirical Google OCR errors on actual manuscripts; this assumption is load-bearing for all reported performance gains.
[Model architecture section] Model architecture section: the Confidence Score mechanism is introduced without any description of its computation, integration into the Transformer (e.g., how it modifies attention or loss), or hyperparameter details, and there is no evidence that baselines received equivalent tuning.

minor comments (1)

[Abstract] The abstract states the paper is 'divided into four sections' but the actual organization (dataset, model, training, analysis) could be more explicitly mapped to section headings for clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and indicate the revisions that will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that 'our Transformer + Confidence score mechanism architecture proves to be superior' on loss and CER supplies no numerical values, no data-split details, no statistical tests, and no error bars, rendering the superiority assertion impossible to evaluate.

Authors: We agree that the abstract lacks the quantitative details needed to evaluate the claim. The revised abstract will report the specific loss and CER values achieved by each model, the train/validation/test split ratios, and the number of runs performed. Statistical significance tests and error bars were not computed in the original experiments; we will note this limitation explicitly rather than add post-hoc tests without new runs. revision: yes
Referee: [Dataset section] Dataset section: the feature-engineered paired toy and real datasets are presented as capturing real OCR error distributions, yet no quantitative validation (error-type histograms, n-gram overlap statistics, or distributional tests) is supplied to confirm that injected noise matches empirical Google OCR errors on actual manuscripts; this assumption is load-bearing for all reported performance gains.

Authors: The datasets were constructed by manually engineering error patterns observed in Google OCR output on Tibetan manuscripts. While the original manuscript did not include quantitative validation, we accept that such checks would strengthen the claim. The revised dataset section will add error-type histograms, n-gram overlap statistics between injected and real OCR errors, and any available distributional comparisons to demonstrate that the synthetic noise approximates the empirical distribution. revision: yes
Referee: [Model architecture section] Model architecture section: the Confidence Score mechanism is introduced without any description of its computation, integration into the Transformer (e.g., how it modifies attention or loss), or hyperparameter details, and there is no evidence that baselines received equivalent tuning.

Authors: The original submission provided only a high-level description of the Confidence Score mechanism. We will expand the model architecture section to include the precise formula used to compute the score, its exact integration point within the Transformer (as an auxiliary input to the decoder and as a weighting term in the loss), all hyperparameter values, and the hyperparameter search procedure applied uniformly to the Transformer, LSTM-2-LSTM, and GRU-2-GRU baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely empirical ML study: it constructs paired datasets via feature engineering from a raw corpus, trains several sequence models (Transformer variants, LSTM, GRU), and reports direct performance measurements (loss, CER) on held-out data. No equations, derivations, or predictions are claimed; the central claim is an observed ranking of architectures on the chosen metrics. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the result. The dataset-construction step is an input assumption (as noted by the reader) but does not create circularity because the reported superiority is not forced by construction or by renaming a fitted quantity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; standard supervised sequence-modeling assumptions are implicit but not stated.

axioms (1)

domain assumption Paired noisy-correct examples can be reliably constructed from OCR output and ground-truth text without introducing systematic bias
Required for the training data construction step described in the abstract

pith-pipeline@v0.9.0 · 5737 in / 1317 out tokens · 24179 ms · 2026-05-24T09:26:03.750025+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 3 internal anchors

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). https://doi.org/10.48550/arXiv.1409.0473

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.0473 2014
[2]

Yo Joong Choe, Jiyeon Ham, Kyubyong Park, and Yeoil Yoon. 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications . Association for Computational Linguistics, Florence, Italy, 213–227. https://d...

work page 2019
[3]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631–1640

work page 2016
[4]

Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, and Vésteinn Snæbjarnarson. 2023. Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora. arXiv preprint arXiv:2305.17906 (2023)

work page arXiv 2023
[5]

Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 12–22

work page 2016
[6]

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Associati...

work page 2018
[7]

Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. Trocr: transformer-based optical character recognition with pre-trained models. arXiv 2021. arXiv preprint arXiv:2109.10282 (2021)

work page arXiv 2021
[8]

Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Computatio...

work page 2019
[9]

Aishik Rakshit, Samyak Mehta, and Anirban Dasgupta. 2023. A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing. In2023 IEEE Guwahati Subsection Conference (GCON). IEEE, 01–06

work page 2023
[10]

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1073–1083

work page 2017
[11]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Berlin, Germany, 1715–1725. https://doi.org/P16-1162

work page 2016
[12]

HAZ Shahgir and Khondker Salman Sayeed. 2023. Bangla Grammatical Error Detection Using T5 Transformer Model. arXiv preprint arXiv:2303.10612 (2023)

work page arXiv 2023
[13]

Nishant Subramani, Alexandre Matton, Malcolm Greaves, and Adrian Lam. 2020. A survey of deep learning approaches for ocr and document understanding. arXiv preprint arXiv:2011.13534 (2020)

work page arXiv 2020
[14]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014). https://doi.org/10.48550/arXiv.1409.3215

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.3215 2014
[15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). https://doi.org/10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017
[16]

Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). ...

work page 2019

[1] [1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014). https://doi.org/10.48550/arXiv.1409.0473

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.0473 2014

[2] [2]

Yo Joong Choe, Jiyeon Ham, Kyubyong Park, and Yeoil Yoon. 2019. A neural grammatical error correction system built on better pre-training and sequential transfer learning. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications . Association for Computational Linguistics, Florence, Italy, 213–227. https://d...

work page 2019

[3] [3]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating Copying Mechanism in Sequence-to-Sequence Learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1631–1640

work page 2016

[4] [4]

Svanhvít Lilja Ingólfsdóttir, Pétur Orri Ragnarsson, Haukur Páll Jónsson, Haukur Barri Símonarson, Vilhjálmur Þorsteinsson, and Vésteinn Snæbjarnarson. 2023. Byte-Level Grammatical Error Correction Using Synthetic and Curated Corpora. arXiv preprint arXiv:2305.17906 (2023)

work page arXiv 2023

[5] [5]

Robin Jia and Percy Liang. 2016. Data Recombination for Neural Semantic Parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . 12–22

work page 2016

[6] [6]

Marcin Junczys-Dowmunt, Roman Grundkiewicz, Shubha Guha, and Kenneth Heafield. 2018. Approaching neural grammatical error correction as a low-resource machine translation task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). Associati...

work page 2018

[7] [7]

Minghao Li, Tengchao Lv, Lei Cui, Yijuan Lu, Dinei Florencio, Cha Zhang, Zhoujun Li, and Furu Wei. 2021. Trocr: transformer-based optical character recognition with pre-trained models. arXiv 2021. arXiv preprint arXiv:2109.10282 (2021)

work page arXiv 2021

[8] [8]

Jared Lichtarge, Chris Alberti, Shankar Kumar, Noam Shazeer, Niki Parmar, and Simon Tong. 2019. Corpora generation for grammatical error correction. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) . Association for Computatio...

work page 2019

[9] [9]

Aishik Rakshit, Samyak Mehta, and Anirban Dasgupta. 2023. A Novel Pipeline for Improving Optical Character Recognition through Post-processing Using Natural Language Processing. In2023 IEEE Guwahati Subsection Conference (GCON). IEEE, 01–06

work page 2023

[10] [10]

Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1073–1083

work page 2017

[11] [11]

Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Berlin, Germany, 1715–1725. https://doi.org/P16-1162

work page 2016

[12] [12]

HAZ Shahgir and Khondker Salman Sayeed. 2023. Bangla Grammatical Error Detection Using T5 Transformer Model. arXiv preprint arXiv:2303.10612 (2023)

work page arXiv 2023

[13] [13]

Nishant Subramani, Alexandre Matton, Malcolm Greaves, and Adrian Lam. 2020. A survey of deep learning approaches for ocr and document understanding. arXiv preprint arXiv:2011.13534 (2020)

work page arXiv 2020

[14] [14]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems 27 (2014). https://doi.org/10.48550/arXiv.1409.3215

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.3215 2014

[15] [15]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). https://doi.org/10.48550/arXiv.1706.03762

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1706.03762 2017

[16] [16]

Wei Zhao, Liang Wang, Kewei Shen, Ruoyu Jia, and Jingming Liu. 2019. Improving grammatical error correction via pre-training a copy-augmented architecture with unlabeled data. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). ...

work page 2019