Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning
Pith reviewed 2026-05-10 00:33 UTC · model grok-4.3
The pith
Extending TrOCR with Word-Aware Loss Weighting enables 0.22% character error rate on printed Tigrinya text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Starting from a pre-trained TrOCR model, we extend its byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone.
What carries the argument
Word-Aware Loss Weighting, a scheme that increases loss weight on word-boundary tokens to stop the model from learning incorrect spacing conventions when Latin BPE is applied to Ge'ez text.
If this is right
- The adapted model reaches 0.22% character error rate and 97.20% exact match accuracy on synthetic Tigrinya images.
- Word-Aware Loss Weighting drives nearly all performance gains over vocabulary extension alone.
- The complete adaptation finishes in under three hours on a single consumer GPU.
- The unmodified TrOCR model produces no usable Ge'ez output before these changes.
- Code, model weights, and evaluation scripts are released for public reuse.
Where Pith is reading between the lines
- The weighting technique may transfer to other scripts whose tokenization conventions differ from Latin BPE rules.
- Heavy reliance on synthetic data leaves open whether real-world font and noise variations will preserve the reported accuracy.
- The short training time and public release lower the cost of building OCR support for additional low-resource languages.
- Combining the loss term with layout modeling could extend the method from isolated lines to full document pages.
Load-bearing premise
Results measured on synthetic GLOCR images will transfer to real printed Tigrinya documents that contain varied fonts, scanning artifacts, and layout noise.
What would settle it
Evaluating the adapted model on a collection of real scanned Tigrinya pages from printed sources and checking whether character error rate remains near 0.22 percent.
Figures
read the original abstract
Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes the adaptation of the TrOCR model for printed Tigrinya text recognition using the Ge'ez script. Starting from a pre-trained TrOCR, the authors extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and propose Word-Aware Loss Weighting to address word-boundary errors arising from Latin-centric BPE on the new script. On a held-out test set of 5,000 synthetic images from the GLOCR dataset, the adapted model achieves 0.22% Character Error Rate and 97.20% exact match accuracy. An ablation study demonstrates that the proposed loss weighting is critical, reducing CER by two orders of magnitude relative to vocabulary extension alone. The training completes in under three hours on a single 8 GB GPU, and the code, model weights, and evaluation scripts are made publicly available.
Significance. Should the reported performance generalize beyond synthetic data, this would represent a useful contribution to OCR for low-resource scripts, particularly as the first TrOCR adaptation for Tigrinya. The efficient training regime on consumer hardware and full public release of code, weights, and scripts are notable strengths that support reproducibility and further work. The ablation isolating the effect of Word-Aware Loss Weighting provides clear evidence for the proposed component's impact within the synthetic setting.
major comments (2)
- [Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.
- [Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.
minor comments (2)
- [Abstract] Abstract: The claim that the unmodified model 'produces no usable output' would benefit from a quantitative baseline (e.g., CER or exact-match score) rather than a qualitative statement.
- [Conclusion] The manuscript would be strengthened by an explicit limitations paragraph discussing the domain gap between synthetic GLOCR images and real printed material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for highlighting the strengths in efficiency and reproducibility. We address the two major comments point-by-point below. Both comments correctly identify that all quantitative results are on synthetic data; we will revise the manuscript to clarify this scope, temper claims, and add explicit limitations discussion.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.
Authors: We agree that the absence of real-document evaluation limits the strength of the applicability claim in the title and abstract. The synthetic GLOCR data was chosen to provide a clean, reproducible testbed for isolating the tokenizer extension and Word-Aware Loss Weighting effects on Ge'ez script. In the revised version we will: (1) update the abstract to state results are obtained on synthetic images, (2) revise the title to 'Adapting TrOCR for Synthetic Printed Tigrinya Text Recognition' or add a qualifier, and (3) insert a new Limitations subsection that discusses font/ink/scan variations expected in real data and outlines future work on real corpora. These changes directly address the load-bearing concern without overstating current evidence. revision: yes
-
Referee: [Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.
Authors: We concur that the ablation is confined to synthetic data and that real printing/scanning artifacts could interact differently with the loss weighting. The Word-Aware Loss Weighting targets systematic word-boundary failures caused by Latin-centric BPE on Ge'ez, an issue rooted in the tokenizer itself and therefore likely to appear in real text as well. Nevertheless, we will revise the Experiments section to: (a) explicitly note the synthetic nature of the ablation, (b) add a paragraph explaining why the controlled setting still provides evidence for the component's utility, and (c) include a forward-looking discussion of how ink spread or layout noise might modulate the observed gains. No new real-data experiments are feasible at this stage, but the revisions will prevent over-generalization. revision: yes
Circularity Check
No circularity: purely empirical adaptation with ablation on held-out synthetic data
full rationale
The paper reports an empirical fine-tuning of TrOCR: tokenizer extension to Ge'ez characters plus a Word-Aware Loss Weighting scheme. Results (0.22% CER, 97.2% exact match) and the ablation (two-order CER reduction) are measured on a held-out test split of 5,000 synthetic GLOCR images. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The central claim is an experimental outcome, not a derived quantity equivalent to its inputs. No load-bearing self-citations or uniqueness theorems appear in the provided text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Pre-trained TrOCR weights provide a useful starting point for Ge'ez script after tokenizer extension
- domain assumption Synthetic printed images from GLOCR dataset match the distribution of real Tigrinya documents
invented entities (1)
-
Word-Aware Loss Weighting
no independent evidence
Reference graph
Works this paper leans on
-
[1]
David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Tigrinya. Ethnologue: Languages of the World, 27th edition, 2024. URLhttps://www.ethnologue.com/language/tir/. Accessed: 2024-12-15
work page 2024
-
[2]
TrOCR: Transformer-based optical character recognition with pre-trained models
Minghao Li et al. TrOCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023
work page 2023
-
[3]
Muhammad Danish Ali Cheema, Muhammad Danish Shaiq, Farhan Mirza, Adnan Kamal, and M. Asif Naeem. Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR).PeerJ Computer Science, 2024. doi:10.7717/peerj-cs.1964
-
[4]
K. Murugesh, K. Sudharson, S. T. Kumar, R. Sanjiv, K. R. M. Raj, and R. Santhiya. Swin- TrOCR: A transformer-based approach for high-accuracy Tamil text recognition. In2025 3rd In- ternational Conference on Artificial Intelligence and Machine Learning Applications (AIMLA), 2025. doi:10.1109/AIMLA63829.2025.11041358
-
[5]
Yik Ho Marco Chung and Doyoung Choi. Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025
work page 2025
-
[6]
Spanish trocr: Leveraging transfer learning for language adaptation,
Filipe Lauar and Valentin Laurent. Spanish trocr: Leveraging transfer learning for language adaptation,
- [7]
-
[8]
A blended attention-CTC network architecture for Amharic text-image recognition
Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Marcus Liwicki, and Didier Stricker. A blended attention-CTC network architecture for Amharic text-image recognition. InProceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 169–176,
-
[9]
doi:10.5220/0010284204350441
-
[10]
Tigrinya OCR: Applying CRNN for text recognition
Aaron Afewerki Hailu, Abiel Tesfamichael Hayleslassie, Danait Weldu Gebresilasie, Robel Estifanos Haile, Tesfana Tekeste Ghebremedhin, and Yemane Keleta Tedla. Tigrinya OCR: Applying CRNN for text recognition. InNeural Information Processing (ICONIP 2023), volume 14447 ofLecture Notes in Computer Science, pages 456–467. Springer, 2023. doi:10.1007/978-981...
-
[11]
Masato Fujitake. DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023
-
[12]
Factored convolutional neural network for amharic character image recognition
Berihu Hailu Belay, Tesfa Habtegebrial, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker. Factored convolutional neural network for amharic character image recognition. In2019 IEEE International Conference on Image Processing (ICIP), pages 2906–2910, 2019. doi:10.1109/ICIP.2019.8804407
-
[13]
Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020
Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Million Meshesha, Marcus Liwicki, and Didier Stricker. Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020. doi:10.3390/app10031117
- [14]
-
[15]
UNK s everywhere: A dapting multilingual language models to new scripts
Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. UNKs everywhere: Adapting mul- tilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 10186–10203, Online and Punta Cana, D...
-
[16]
Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InProceedings of the 1st Workshop on Multilingual Representation Learning, pages 11–26, 2021
work page 2021
-
[17]
BEiT: BERT pre-training of image transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://openreview. net/forum?id=p-BhZSz59o4. 12
work page 2022
-
[18]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1907
-
[19]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019
work page 2019
-
[20]
Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966
work page 1966
-
[21]
Tibshirani.An Introduction to the Bootstrap
Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. CRC Press, 1993
work page 1993
-
[22]
Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022
Phillip Benjamin Strobel, Simon Clematide, Martin Volk, and Tobias Hodel. Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022. doi:10.48550/arXiv.2203.11008. 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.