Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts
Pith reviewed 2026-05-16 21:04 UTC · model grok-4.3
The pith
An end-to-end pipeline recognizes Old Nepali handwritten text with 4.9% error.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We present the first end-to-end pipeline for Handwritten Text Recognition of Old Nepali. By exploring encoder-decoder architectures and data-centric techniques at the line level, our best model achieves a Character Error Rate of 4.9%. We also evaluate decoding strategies and analyze token-level confusions, and release the training code, model configurations, and evaluation scripts to support further research on HTR for low-resource historical scripts.
What carries the argument
Encoder-decoder model for line-level transcription of Old Nepali script, supported by data-centric training adjustments and token confusion analysis.
If this is right
- Large collections of Old Nepali manuscripts can be transcribed automatically with under five percent character error.
- Token-level confusion analysis identifies recurring mistakes in specific characters or combinations within the script.
- Open training code and scripts allow direct replication and adaptation for other low-resource historical languages.
- Evaluated decoding strategies provide concrete options for balancing speed and accuracy on this script.
Where Pith is reading between the lines
- The same line-level encoder-decoder approach could transfer to neighboring Indic scripts with modest additional labeled lines.
- Integration with existing digital archive platforms would let libraries begin bulk transcription of their Nepali holdings.
- Script-specific rules for correcting the remaining errors could push effective accuracy higher without new model training.
- Publication of even a small held-out public validation set would let independent groups confirm or refine the 4.9% figure.
Load-bearing premise
The confidential evaluation dataset represents the range of styles, conditions, and content found across Old Nepali manuscripts.
What would settle it
Testing the released model on an independent public collection of Old Nepali manuscript pages that yields a character error rate well above 4.9% would show the reported performance does not generalize.
Figures
read the original abstract
This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9\%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behavior and error patterns. Although the evaluation dataset is confidential, we release our training code, model configurations, and evaluation scripts to support further research on HTR for low-resource historical scripts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper presents the first end-to-end HTR pipeline for Old Nepali manuscripts using a line-level transcription approach. It systematically explores encoder-decoder architectures and data-centric techniques, reports a best-model CER of 4.9% on a confidential evaluation set, implements decoding strategies, and analyzes token-level confusions. Training code, model configurations, and evaluation scripts are released to support further research on low-resource historical scripts.
Significance. If the 4.9% CER generalizes beyond the undisclosed test set, the work would be a meaningful first step toward digitizing Nepal's historical manuscripts in a low-resource setting. The code release is a concrete strength that enables community extension even if the evaluation data remains private.
major comments (2)
- [Abstract] Abstract: the headline claim of 4.9% CER is evaluated exclusively on a confidential dataset whose cardinality (lines or pages), script-style coverage, temporal range, degradation distribution, and character overlap with training data are never reported. This information is load-bearing for assessing whether the result reflects generalization or a narrow slice.
- [Results] Results section: no baseline comparisons, train/test split statistics, or quantitative error analysis (beyond the single CER number) are supplied, making it impossible to situate the 'first end-to-end pipeline' claim against prior HTR work on related scripts.
minor comments (1)
- [Abstract] The abstract states that token-level confusion analysis was performed, yet no concrete confusion patterns or examples are previewed; adding one illustrative table or figure reference would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below. We are limited by the confidential nature of the evaluation set and cannot disclose protected details, but we will revise the manuscript to improve transparency and add requested analyses where feasible.
read point-by-point responses
-
Referee: [Abstract] Abstract: the headline claim of 4.9% CER is evaluated exclusively on a confidential dataset whose cardinality (lines or pages), script-style coverage, temporal range, degradation distribution, and character overlap with training data are never reported. This information is load-bearing for assessing whether the result reflects generalization or a narrow slice.
Authors: We agree that more context on the evaluation set would aid assessment of generalization. The evaluation dataset is confidential per institutional restrictions, so we cannot report its cardinality, coverage, temporal range, degradation distribution, or character overlap. In revision we will update the abstract to explicitly state these limitations and note that the 4.9% CER applies only to this confidential set. All training code, configurations, and evaluation scripts remain publicly released to support community use on other data. revision: partial
-
Referee: [Results] Results section: no baseline comparisons, train/test split statistics, or quantitative error analysis (beyond the single CER number) are supplied, making it impossible to situate the 'first end-to-end pipeline' claim against prior HTR work on related scripts.
Authors: We accept this point. The revised manuscript will add: (i) explicit train/test split statistics, (ii) baseline comparisons using standard encoder-decoder HTR models (e.g., CRNN and Transformer variants) trained on the same data, and (iii) expanded quantitative error analysis including per-character error rates and confusion matrices. These additions will better situate the work relative to prior HTR results on related Indic scripts. revision: yes
- Specific statistics on the confidential evaluation dataset (cardinality, script-style coverage, temporal range, degradation distribution, and character overlap with training data) cannot be disclosed due to confidentiality constraints.
Circularity Check
No circularity: standard empirical HTR pipeline with held-out evaluation
full rationale
The paper describes an empirical ML pipeline that trains encoder-decoder models on transcribed Old Nepali manuscript lines and reports CER on a held-out evaluation set. No equations, derivations, or self-referential definitions appear that would make any reported result equivalent to its inputs by construction. The central performance number (4.9% CER) is obtained from standard train/eval splitting rather than any fitted parameter renamed as a prediction or any uniqueness theorem imported via self-citation. The confidentiality of the test set raises reproducibility concerns but does not create circularity under the defined criteria, as the metric is externally falsifiable once the data are released.
Axiom & Free-Parameter Ledger
free parameters (1)
- encoder-decoder hyperparameters
axioms (1)
- domain assumption Line-level transcription suffices for accurate manuscript recognition
Reference graph
Works this paper leans on
-
[1]
Adaptive contrastive search: Uncertainty- guided decoding for open-ended text generation. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 15060–15080, Miami, Florida, USA. Association for Computational Lin- guistics. Alex Graves and Jürgen Schmidhuber. 2008. Offline handwriting recognition with multidimensional re- curren...
work page 2024
-
[2]
The Curious Case of Neural Text Degeneration
Synthetic data for text localization in natu- ral images. InEuropean Conference on Computer Vision, pages 231–246. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751. Internet Archive. 2025. Internet Archive. https:// archive.org. Accessed: 2025-07-10. Benjamin ...
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[3]
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
Trocr: Transformer-based optical character recognition with pre-trained models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102. Ze Liu, Yutong Hu, Yixuan Lin, Zhicheng Lin, Zihang Gao, Ze Han, Xiang Chen, and et al. 2021. Swin transformer: Hierarchical vision transformer using shifted windows.arXiv preprint a...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
The implications of handwritten text recog- nition for accessing the past at scale.Journal of Documentation, 80(7):148–167. Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66. Mahes Raj Pant and Philip H. Pierce. 1989.Administra- tive Documents of the Shah Dynasty Con...
-
[5]
Neural Machine Translation of Rare Words with Subword Units
Neural machine translation of rare words with subword units.Preprint, arXiv:1508.07909. Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304. Raymond Smith. 2007. An overview of the tes...
work page internal anchor Pith review Pith/arXiv arXiv 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.