arxiv: 2512.17111 · v2 · submitted 2025-12-18 · 💻 cs.LG

Digitizing Nepal's Written Heritage: A Comprehensive HTR Pipeline for Old Nepali Manuscripts

Anjali Sarawgi , Esteban Garces Arias , Christof Zotter This is my paper

Pith reviewed 2026-05-16 21:04 UTC · model grok-4.3

classification 💻 cs.LG

keywords HTROld Nepalihandwritten text recognitionlow-resource languagehistorical manuscriptsencoder-decodercharacter error ratedigitization

0 comments

The pith

An end-to-end pipeline recognizes Old Nepali handwritten text with 4.9% error.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the first complete pipeline for transcribing Old Nepali manuscripts through handwritten text recognition. It processes text at the line level and tests encoder-decoder models along with data improvements to lower recognition mistakes. The effort addresses a gap in tools for a low-resource historical script whose documents remain largely inaccessible in digital form. By releasing training code and evaluation scripts while holding the test set private, the work gives others a starting point for similar scripts.

Core claim

We present the first end-to-end pipeline for Handwritten Text Recognition of Old Nepali. By exploring encoder-decoder architectures and data-centric techniques at the line level, our best model achieves a Character Error Rate of 4.9%. We also evaluate decoding strategies and analyze token-level confusions, and release the training code, model configurations, and evaluation scripts to support further research on HTR for low-resource historical scripts.

What carries the argument

Encoder-decoder model for line-level transcription of Old Nepali script, supported by data-centric training adjustments and token confusion analysis.

If this is right

Large collections of Old Nepali manuscripts can be transcribed automatically with under five percent character error.
Token-level confusion analysis identifies recurring mistakes in specific characters or combinations within the script.
Open training code and scripts allow direct replication and adaptation for other low-resource historical languages.
Evaluated decoding strategies provide concrete options for balancing speed and accuracy on this script.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same line-level encoder-decoder approach could transfer to neighboring Indic scripts with modest additional labeled lines.
Integration with existing digital archive platforms would let libraries begin bulk transcription of their Nepali holdings.
Script-specific rules for correcting the remaining errors could push effective accuracy higher without new model training.
Publication of even a small held-out public validation set would let independent groups confirm or refine the 4.9% figure.

Load-bearing premise

The confidential evaluation dataset represents the range of styles, conditions, and content found across Old Nepali manuscripts.

What would settle it

Testing the released model on an independent public collection of Old Nepali manuscript pages that yields a character error rate well above 4.9% would show the reported performance does not generalize.

Figures

Figures reproduced from arXiv: 2512.17111 by Anjali Sarawgi, Christof Zotter, Esteban Garces Arias.

**Figure 2.** Figure 2: Sample of line image after pre-processing. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sample of processed data for the first stage. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Sample of processed data for the second stage. [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Printed Nagari script sample (top) and model’s [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Old Nepali script sample (top) and model’s [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Histogram showing the distribution of the line lengths across all text samples in the dataset. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Visualization of the three-stage training pipeline with example images from each dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Three samples of synthetic Devanagari line images generated from a text corpus, used in the first stage of [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Benchmarking examples showing input and outputs from two OCR baselines. The letters highlighted in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of Character Error Rate (CER) for all samples in the test set. [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Top 30 most frequent character-level confusions between ground truth and predictions. [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Evaluation of Character Error Rate against the total number of characters in the line. Each bin groups [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Error share between the top-10 characters with most errors and the remaining 70 characters. The error [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: Breakdown of 941 token-level errors based on relative probability. Out of these, 236 tokens were flagged [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: Stepwise reduction in Character Error Rate (CER) through successive improvements to the HTR pipeline. [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: Manuscript image of NGMPP DNA 14/50 (© National Archives Nepal) excerpted and cropped from (Zotter, 2018). Lines from this paragraph were segmented and used in our OCR dataset, with 80% allocated to training, and 10% each to validation and testing. This figure illustrates a representative sample of our dataset [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: Another sample of a manuscript image (NGMPP DNA 13/59 [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: Line-level segmentation of Figure [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: This image displays the upper dashed line ambiguity, which is normalized to reduce stylistic noise. [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

read the original abstract

This paper presents the first end-to-end pipeline for Handwritten Text Recognition (HTR) for Old Nepali, a historically significant but low-resource language. We adopt a line-level transcription approach and systematically explore encoder-decoder architectures and data-centric techniques to improve recognition accuracy. Our best model achieves a Character Error Rate (CER) of 4.9\%. In addition, we implement and evaluate decoding strategies and analyze token-level confusions to better understand model behavior and error patterns. Although the evaluation dataset is confidential, we release our training code, model configurations, and evaluation scripts to support further research on HTR for low-resource historical scripts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper builds the first reported end-to-end HTR pipeline for Old Nepali and releases training code, but the 4.9% CER rests on a confidential test set with no size, diversity, or baseline details.

read the letter

The headline takeaway is straightforward: this is the first end-to-end HTR pipeline aimed at Old Nepali manuscripts, and the authors release the training code, configs, and evaluation scripts. That release is the part that actually helps other people working on similar low-resource historical scripts. They walk through encoder-decoder variants, data-centric adjustments, decoding strategies, and token-level error patterns, which gives a practical sense of what works for this script and where the model trips up on confusions. For digital humanities work focused on Nepali heritage preservation, the pipeline and the code drop fill a clear gap that most general HTR papers skip over. The 4.9% CER is presented as the best result, and the approach looks like a standard but careful empirical setup rather than anything circular. The main weakness is the evaluation data. The test set is confidential, and the paper gives no numbers on how many lines or pages it contains, what range of manuscript styles or degradation levels it covers, or how it compares to any baselines. Without those details it is difficult to judge whether the reported error rate reflects real generalization or just performance on a convenient slice. Releasing the evaluation scripts helps a bit, but you still cannot reproduce or challenge the central number. This work is aimed at people who need tools for digitizing specific low-resource scripts rather than readers chasing broad methodological advances. A serious referee would be useful here because the application is concrete and the code release is real, but the review would need to press for more transparent dataset description and baseline numbers before the performance claim can be taken at face value. I would send it to review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. This paper presents the first end-to-end HTR pipeline for Old Nepali manuscripts using a line-level transcription approach. It systematically explores encoder-decoder architectures and data-centric techniques, reports a best-model CER of 4.9% on a confidential evaluation set, implements decoding strategies, and analyzes token-level confusions. Training code, model configurations, and evaluation scripts are released to support further research on low-resource historical scripts.

Significance. If the 4.9% CER generalizes beyond the undisclosed test set, the work would be a meaningful first step toward digitizing Nepal's historical manuscripts in a low-resource setting. The code release is a concrete strength that enables community extension even if the evaluation data remains private.

major comments (2)

[Abstract] Abstract: the headline claim of 4.9% CER is evaluated exclusively on a confidential dataset whose cardinality (lines or pages), script-style coverage, temporal range, degradation distribution, and character overlap with training data are never reported. This information is load-bearing for assessing whether the result reflects generalization or a narrow slice.
[Results] Results section: no baseline comparisons, train/test split statistics, or quantitative error analysis (beyond the single CER number) are supplied, making it impossible to situate the 'first end-to-end pipeline' claim against prior HTR work on related scripts.

minor comments (1)

[Abstract] The abstract states that token-level confusion analysis was performed, yet no concrete confusion patterns or examples are previewed; adding one illustrative table or figure reference would improve clarity.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below. We are limited by the confidential nature of the evaluation set and cannot disclose protected details, but we will revise the manuscript to improve transparency and add requested analyses where feasible.

read point-by-point responses

Referee: [Abstract] Abstract: the headline claim of 4.9% CER is evaluated exclusively on a confidential dataset whose cardinality (lines or pages), script-style coverage, temporal range, degradation distribution, and character overlap with training data are never reported. This information is load-bearing for assessing whether the result reflects generalization or a narrow slice.

Authors: We agree that more context on the evaluation set would aid assessment of generalization. The evaluation dataset is confidential per institutional restrictions, so we cannot report its cardinality, coverage, temporal range, degradation distribution, or character overlap. In revision we will update the abstract to explicitly state these limitations and note that the 4.9% CER applies only to this confidential set. All training code, configurations, and evaluation scripts remain publicly released to support community use on other data. revision: partial
Referee: [Results] Results section: no baseline comparisons, train/test split statistics, or quantitative error analysis (beyond the single CER number) are supplied, making it impossible to situate the 'first end-to-end pipeline' claim against prior HTR work on related scripts.

Authors: We accept this point. The revised manuscript will add: (i) explicit train/test split statistics, (ii) baseline comparisons using standard encoder-decoder HTR models (e.g., CRNN and Transformer variants) trained on the same data, and (iii) expanded quantitative error analysis including per-character error rates and confusion matrices. These additions will better situate the work relative to prior HTR results on related Indic scripts. revision: yes

standing simulated objections not resolved

Specific statistics on the confidential evaluation dataset (cardinality, script-style coverage, temporal range, degradation distribution, and character overlap with training data) cannot be disclosed due to confidentiality constraints.

Circularity Check

0 steps flagged

No circularity: standard empirical HTR pipeline with held-out evaluation

full rationale

The paper describes an empirical ML pipeline that trains encoder-decoder models on transcribed Old Nepali manuscript lines and reports CER on a held-out evaluation set. No equations, derivations, or self-referential definitions appear that would make any reported result equivalent to its inputs by construction. The central performance number (4.9% CER) is obtained from standard train/eval splitting rather than any fitted parameter renamed as a prediction or any uniqueness theorem imported via self-citation. The confidentiality of the test set raises reproducibility concerns but does not create circularity under the defined criteria, as the metric is externally falsifiable once the data are released.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Limited to abstract; standard ML assumptions about data representativeness and model generalization apply without detailed free parameters or invented entities visible.

free parameters (1)

encoder-decoder hyperparameters
Architectural choices and training parameters tuned on the task-specific data.

axioms (1)

domain assumption Line-level transcription suffices for accurate manuscript recognition
Core methodological choice stated in the abstract.

pith-pipeline@v0.9.0 · 5409 in / 1259 out tokens · 23395 ms · 2026-05-16T21:04:27.205733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages · 3 internal anchors

[1]

In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 15060–15080, Miami, Florida, USA

Adaptive contrastive search: Uncertainty- guided decoding for open-ended text generation. In Findings of the Association for Computational Lin- guistics: EMNLP 2024, pages 15060–15080, Miami, Florida, USA. Association for Computational Lin- guistics. Alex Graves and Jürgen Schmidhuber. 2008. Offline handwriting recognition with multidimensional re- curren...

work page 2024
[2]

The Curious Case of Neural Text Degeneration

Synthetic data for text localization in natu- ral images. InEuropean Conference on Computer Vision, pages 231–246. Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration.arXiv preprint arXiv:1904.09751. Internet Archive. 2025. Internet Archive. https:// archive.org. Accessed: 2025-07-10. Benjamin ...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Trocr: Transformer-based optical character recognition with pre-trained models. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102. Ze Liu, Yutong Hu, Yixuan Lin, Zhicheng Lin, Zihang Gao, Ze Han, Xiang Chen, and et al. 2021. Swin transformer: Hierarchical vision transformer using shifted windows.arXiv preprint a...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Nobuyuki Otsu

The implications of handwritten text recog- nition for accessing the past at scale.Journal of Documentation, 80(7):148–167. Nobuyuki Otsu. 1979. A threshold selection method from gray-level histograms.IEEE Transactions on Systems, Man, and Cybernetics, 9(1):62–66. Mahes Raj Pant and Philip H. Pierce. 1989.Administra- tive Documents of the Shah Dynasty Con...

work page arXiv 1979
[5]

Neural Machine Translation of Rare Words with Subword Units

Neural machine translation of rare words with subword units.Preprint, arXiv:1508.07909. Baoguang Shi, Xiang Bai, and Cong Yao. 2017. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition.IEEE Trans. Pattern Anal. Mach. Intell., 39(11):2298–2304. Raymond Smith. 2007. An overview of the tes...

work page internal anchor Pith review Pith/arXiv arXiv 2017