pith. sign in

arxiv: 2604.20813 · v1 · submitted 2026-04-22 · 💻 cs.CV

Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Pith reviewed 2026-05-10 00:33 UTC · model grok-4.3

classification 💻 cs.CV
keywords TigrinyaGe'ez scriptTrOCROCR adaptationcross-script transferloss weightingBPE tokenizersynthetic data
0
0 comments X

The pith

Extending TrOCR with Word-Aware Loss Weighting enables 0.22% character error rate on printed Tigrinya text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a pre-trained TrOCR model to recognize printed Tigrinya written in the Ge'ez script. It extends the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduces Word-Aware Loss Weighting to correct word-boundary errors that arise when Latin-centric tokenization rules meet a new script. The original model produces no usable output, but the adapted version reaches 0.22% character error rate and 97.20% exact match accuracy on 5000 synthetic test images. Ablation experiments establish that the loss weighting supplies most of the gain, outperforming simple vocabulary extension by two orders of magnitude in error reduction. The full process completes in under three hours on a single 8 GB GPU, with code, weights, and evaluation scripts released publicly.

Core claim

Starting from a pre-trained TrOCR model, we extend its byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone.

What carries the argument

Word-Aware Loss Weighting, a scheme that increases loss weight on word-boundary tokens to stop the model from learning incorrect spacing conventions when Latin BPE is applied to Ge'ez text.

If this is right

  • The adapted model reaches 0.22% character error rate and 97.20% exact match accuracy on synthetic Tigrinya images.
  • Word-Aware Loss Weighting drives nearly all performance gains over vocabulary extension alone.
  • The complete adaptation finishes in under three hours on a single consumer GPU.
  • The unmodified TrOCR model produces no usable Ge'ez output before these changes.
  • Code, model weights, and evaluation scripts are released for public reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The weighting technique may transfer to other scripts whose tokenization conventions differ from Latin BPE rules.
  • Heavy reliance on synthetic data leaves open whether real-world font and noise variations will preserve the reported accuracy.
  • The short training time and public release lower the cost of building OCR support for additional low-resource languages.
  • Combining the loss term with layout modeling could extend the method from isolated lines to full document pages.

Load-bearing premise

Results measured on synthetic GLOCR images will transfer to real printed Tigrinya documents that contain varied fonts, scanning artifacts, and layout noise.

What would settle it

Evaluating the adapted model on a collection of real scanned Tigrinya pages from printed sources and checking whether character error rate remains near 0.22 percent.

Figures

Figures reproduced from arXiv: 2604.20813 by Yonatan Haile Medhanie, Yuanhua Ni.

Figure 1
Figure 1. Figure 1: TrOCR architecture overview. The model combines a Vision Transformer encoder for image [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The failure mode observed with standard cross-entropy loss. The model consistently drops the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot failure versus fine-tuned performance, showing the effect of vocabulary extension and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training and validation loss curves during fine-tuning, showing smooth convergence of both losses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Bootstrap 95% confidence intervals for the best checkpoint of the TrOCR-Printed variant ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sample test set error predictions. (a) A labialized character recognition error. (b) Digits and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Tigrinya Ge'ez fidel matrix showing 33 base consonants with 7 vowel orders (231 syllographs), 4 labialized consonant groups (20 forms), and 8 punctuation marks. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
read the original abstract

Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the adaptation of the TrOCR model for printed Tigrinya text recognition using the Ge'ez script. Starting from a pre-trained TrOCR, the authors extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and propose Word-Aware Loss Weighting to address word-boundary errors arising from Latin-centric BPE on the new script. On a held-out test set of 5,000 synthetic images from the GLOCR dataset, the adapted model achieves 0.22% Character Error Rate and 97.20% exact match accuracy. An ablation study demonstrates that the proposed loss weighting is critical, reducing CER by two orders of magnitude relative to vocabulary extension alone. The training completes in under three hours on a single 8 GB GPU, and the code, model weights, and evaluation scripts are made publicly available.

Significance. Should the reported performance generalize beyond synthetic data, this would represent a useful contribution to OCR for low-resource scripts, particularly as the first TrOCR adaptation for Tigrinya. The efficient training regime on consumer hardware and full public release of code, weights, and scripts are notable strengths that support reproducibility and further work. The ablation isolating the effect of Word-Aware Loss Weighting provides clear evidence for the proposed component's impact within the synthetic setting.

major comments (2)
  1. [Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.
  2. [Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.
minor comments (2)
  1. [Abstract] Abstract: The claim that the unmodified model 'produces no usable output' would benefit from a quantitative baseline (e.g., CER or exact-match score) rather than a qualitative statement.
  2. [Conclusion] The manuscript would be strengthened by an explicit limitations paragraph discussing the domain gap between synthetic GLOCR images and real printed material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the strengths in efficiency and reproducibility. We address the two major comments point-by-point below. Both comments correctly identify that all quantitative results are on synthetic data; we will revise the manuscript to clarify this scope, temper claims, and add explicit limitations discussion.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.

    Authors: We agree that the absence of real-document evaluation limits the strength of the applicability claim in the title and abstract. The synthetic GLOCR data was chosen to provide a clean, reproducible testbed for isolating the tokenizer extension and Word-Aware Loss Weighting effects on Ge'ez script. In the revised version we will: (1) update the abstract to state results are obtained on synthetic images, (2) revise the title to 'Adapting TrOCR for Synthetic Printed Tigrinya Text Recognition' or add a qualifier, and (3) insert a new Limitations subsection that discusses font/ink/scan variations expected in real data and outlines future work on real corpora. These changes directly address the load-bearing concern without overstating current evidence. revision: yes

  2. Referee: [Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.

    Authors: We concur that the ablation is confined to synthetic data and that real printing/scanning artifacts could interact differently with the loss weighting. The Word-Aware Loss Weighting targets systematic word-boundary failures caused by Latin-centric BPE on Ge'ez, an issue rooted in the tokenizer itself and therefore likely to appear in real text as well. Nevertheless, we will revise the Experiments section to: (a) explicitly note the synthetic nature of the ablation, (b) add a paragraph explaining why the controlled setting still provides evidence for the component's utility, and (c) include a forward-looking discussion of how ink spread or layout noise might modulate the observed gains. No new real-data experiments are feasible at this stage, but the revisions will prevent over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical adaptation with ablation on held-out synthetic data

full rationale

The paper reports an empirical fine-tuning of TrOCR: tokenizer extension to Ge'ez characters plus a Word-Aware Loss Weighting scheme. Results (0.22% CER, 97.2% exact match) and the ablation (two-order CER reduction) are measured on a held-out test split of 5,000 synthetic GLOCR images. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The central claim is an experimental outcome, not a derived quantity equivalent to its inputs. No load-bearing self-citations or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard transfer-learning assumptions for vision transformers and the claim that synthetic GLOCR images are representative; no new physical entities or ad-hoc constants are introduced beyond the new loss term itself.

axioms (2)
  • domain assumption Pre-trained TrOCR weights provide a useful starting point for Ge'ez script after tokenizer extension
    Invoked in the adaptation procedure described in the abstract
  • domain assumption Synthetic printed images from GLOCR dataset match the distribution of real Tigrinya documents
    Required for the reported test-set metrics to generalize
invented entities (1)
  • Word-Aware Loss Weighting no independent evidence
    purpose: To penalize word-boundary errors when applying Latin-centric BPE to Ge'ez script
    New component introduced to resolve systematic failures; no independent falsifiable prediction outside the reported ablation is given

pith-pipeline@v0.9.0 · 5502 in / 1548 out tokens · 23129 ms · 2026-05-10T00:33:33.662959+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

  1. [1]

    Eberhard, Gary F

    David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Tigrinya. Ethnologue: Languages of the World, 27th edition, 2024. URLhttps://www.ethnologue.com/language/tir/. Accessed: 2024-12-15

  2. [2]

    TrOCR: Transformer-based optical character recognition with pre-trained models

    Minghao Li et al. TrOCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023

  3. [3]

    Asif Naeem

    Muhammad Danish Ali Cheema, Muhammad Danish Shaiq, Farhan Mirza, Adnan Kamal, and M. Asif Naeem. Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR).PeerJ Computer Science, 2024. doi:10.7717/peerj-cs.1964

  4. [4]

    Murugesh, K

    K. Murugesh, K. Sudharson, S. T. Kumar, R. Sanjiv, K. R. M. Raj, and R. Santhiya. Swin- TrOCR: A transformer-based approach for high-accuracy Tamil text recognition. In2025 3rd In- ternational Conference on Artificial Intelligence and Machine Learning Applications (AIMLA), 2025. doi:10.1109/AIMLA63829.2025.11041358

  5. [5]

    Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

    Yik Ho Marco Chung and Doyoung Choi. Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

  6. [6]

    Spanish trocr: Leveraging transfer learning for language adaptation,

    Filipe Lauar and Valentin Laurent. Spanish trocr: Leveraging transfer learning for language adaptation,

  7. [7]

    URLhttps://arxiv.org/abs/2407.06950

  8. [8]

    A blended attention-CTC network architecture for Amharic text-image recognition

    Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Marcus Liwicki, and Didier Stricker. A blended attention-CTC network architecture for Amharic text-image recognition. InProceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 169–176,

  9. [9]

    doi:10.5220/0010284204350441

  10. [10]

    Tigrinya OCR: Applying CRNN for text recognition

    Aaron Afewerki Hailu, Abiel Tesfamichael Hayleslassie, Danait Weldu Gebresilasie, Robel Estifanos Haile, Tesfana Tekeste Ghebremedhin, and Yemane Keleta Tedla. Tigrinya OCR: Applying CRNN for text recognition. InNeural Information Processing (ICONIP 2023), volume 14447 ofLecture Notes in Computer Science, pages 456–467. Springer, 2023. doi:10.1007/978-981...

  11. [11]

    DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

    Masato Fujitake. DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

  12. [12]

    Factored convolutional neural network for amharic character image recognition

    Berihu Hailu Belay, Tesfa Habtegebrial, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker. Factored convolutional neural network for amharic character image recognition. In2019 IEEE International Conference on Image Processing (ICIP), pages 2906–2910, 2019. doi:10.1109/ICIP.2019.8804407

  13. [13]

    Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020

    Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Million Meshesha, Marcus Liwicki, and Didier Stricker. Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020. doi:10.3390/app10031117

  14. [14]

    GLOCR: GeezLab OCR dataset, 2021

    Fitsum Gaim. GLOCR: GeezLab OCR dataset, 2021

  15. [15]

    UNK s everywhere: A dapting multilingual language models to new scripts

    Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. UNKs everywhere: Adapting mul- tilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 10186–10203, Online and Punta Cana, D...

  16. [16]

    Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages

    Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InProceedings of the 1st Workshop on Multilingual Representation Learning, pages 11–26, 2021

  17. [17]

    BEiT: BERT pre-training of image transformers

    Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://openreview. net/forum?id=p-BhZSz59o4. 12

  18. [18]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

  19. [19]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

  20. [20]

    Levenshtein

    Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966

  21. [21]

    Tibshirani.An Introduction to the Bootstrap

    Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. CRC Press, 1993

  22. [22]

    Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022

    Phillip Benjamin Strobel, Simon Clematide, Martin Volk, and Tobias Hodel. Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022. doi:10.48550/arXiv.2203.11008. 13