Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Yonatan Haile Medhanie; Yuanhua Ni

arxiv: 2604.20813 · v1 · submitted 2026-04-22 · 💻 cs.CV

Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning

Yonatan Haile Medhanie , Yuanhua Ni This is my paper

Pith reviewed 2026-05-10 00:33 UTC · model grok-4.3

classification 💻 cs.CV

keywords TigrinyaGe'ez scriptTrOCROCR adaptationcross-script transferloss weightingBPE tokenizersynthetic data

0 comments

The pith

Extending TrOCR with Word-Aware Loss Weighting enables 0.22% character error rate on printed Tigrinya text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt a pre-trained TrOCR model to recognize printed Tigrinya written in the Ge'ez script. It extends the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduces Word-Aware Loss Weighting to correct word-boundary errors that arise when Latin-centric tokenization rules meet a new script. The original model produces no usable output, but the adapted version reaches 0.22% character error rate and 97.20% exact match accuracy on 5000 synthetic test images. Ablation experiments establish that the loss weighting supplies most of the gain, outperforming simple vocabulary extension by two orders of magnitude in error reduction. The full process completes in under three hours on a single 8 GB GPU, with code, weights, and evaluation scripts released publicly.

Core claim

Starting from a pre-trained TrOCR model, we extend its byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone.

What carries the argument

Word-Aware Loss Weighting, a scheme that increases loss weight on word-boundary tokens to stop the model from learning incorrect spacing conventions when Latin BPE is applied to Ge'ez text.

If this is right

The adapted model reaches 0.22% character error rate and 97.20% exact match accuracy on synthetic Tigrinya images.
Word-Aware Loss Weighting drives nearly all performance gains over vocabulary extension alone.
The complete adaptation finishes in under three hours on a single consumer GPU.
The unmodified TrOCR model produces no usable Ge'ez output before these changes.
Code, model weights, and evaluation scripts are released for public reuse.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The weighting technique may transfer to other scripts whose tokenization conventions differ from Latin BPE rules.
Heavy reliance on synthetic data leaves open whether real-world font and noise variations will preserve the reported accuracy.
The short training time and public release lower the cost of building OCR support for additional low-resource languages.
Combining the loss term with layout modeling could extend the method from isolated lines to full document pages.

Load-bearing premise

Results measured on synthetic GLOCR images will transfer to real printed Tigrinya documents that contain varied fonts, scanning artifacts, and layout noise.

What would settle it

Evaluating the adapted model on a collection of real scanned Tigrinya pages from printed sources and checking whether character error rate remains near 0.22 percent.

Figures

Figures reproduced from arXiv: 2604.20813 by Yonatan Haile Medhanie, Yuanhua Ni.

**Figure 2.** Figure 2: The failure mode observed with standard cross-entropy loss. The model consistently drops the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Zero-shot failure versus fine-tuned performance, showing the effect of vocabulary extension and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Training and validation loss curves during fine-tuning, showing smooth convergence of both losses [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Bootstrap 95% confidence intervals for the best checkpoint of the TrOCR-Printed variant ( [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Sample test set error predictions. (a) A labialized character recognition error. (b) Digits and [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Tigrinya Ge'ez fidel matrix showing 33 base consonants with 7 vowel orders (231 syllographs), 4 labialized consonant groups (20 forms), and 8 punctuation marks. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

read the original abstract

Transformer-based OCR models have shown strong performance on Latin and CJK scripts, but their application to African syllabic writing systems remains limited. We present the first adaptation of TrOCR for printed Tigrinya using the Ge'ez script. Starting from a pre-trained model, we extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and introduce Word-Aware Loss Weighting to resolve systematic word-boundary failures that arise when applying Latin-centric BPE conventions to a new script. The unmodified model produces no usable output on Ge'ez text. After adaptation, the TrOCR-Printed variant achieves 0.22% Character Error Rate and 97.20% exact match accuracy on a held-out test set of 5,000 synthetic images from the GLOCR dataset. An ablation study confirms that Word-Aware Loss Weighting is the critical component, reducing CER by two orders of magnitude compared to vocabulary extension alone. The full pipeline trains in under three hours on a single 8 GB consumer GPU. All code, model weights, and evaluation scripts are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Solid first TrOCR adaptation for Tigrinya on synthetic data with a useful loss fix, but the real-world transfer claim rests on untested assumptions.

read the letter

The paper delivers the first reported TrOCR model for printed Tigrinya in Ge'ez script, plus a Word-Aware Loss Weighting term that fixes the word-boundary problems BPE creates on this script. On their 5k held-out synthetic GLOCR images the adapted model reaches 0.22% CER and 97% exact match, and the ablation shows the new weighting drives most of the gain over plain vocabulary extension. They also release code, weights, and scripts, and the whole thing trains in a few hours on modest hardware. That combination of concrete numbers, ablation, and artifacts is the useful part; anyone working on OCR for other low-resource syllabic scripts can pick up the pipeline and try it directly. The soft spot is straightforward: every metric comes from synthetic images generated the same way as the training data. No scanned or photographed real documents appear in the evaluation, so we have no evidence the 0.22% CER survives font variation, ink spread, scanner noise, or layout issues that actual printed Tigrinya pages contain. The title and abstract frame the work as a solution for printed text recognition, yet the experiments only demonstrate that the loss term repairs synthetic artifacts. That gap is real and limits how far the central claim travels. For a reading group this is a maybe—narrow but cleanly executed applied work with public artifacts. I would not cite it in my own papers unless I needed the exact Tigrinya baseline. It still deserves peer review because the adaptation is reproducible, the ablation is honest, and the limitation is easy to flag for revision; an editor can ask for a small real-document test set without rejecting the core contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript describes the adaptation of the TrOCR model for printed Tigrinya text recognition using the Ge'ez script. Starting from a pre-trained TrOCR, the authors extend the byte-level BPE tokenizer to cover 230 Ge'ez characters and propose Word-Aware Loss Weighting to address word-boundary errors arising from Latin-centric BPE on the new script. On a held-out test set of 5,000 synthetic images from the GLOCR dataset, the adapted model achieves 0.22% Character Error Rate and 97.20% exact match accuracy. An ablation study demonstrates that the proposed loss weighting is critical, reducing CER by two orders of magnitude relative to vocabulary extension alone. The training completes in under three hours on a single 8 GB GPU, and the code, model weights, and evaluation scripts are made publicly available.

Significance. Should the reported performance generalize beyond synthetic data, this would represent a useful contribution to OCR for low-resource scripts, particularly as the first TrOCR adaptation for Tigrinya. The efficient training regime on consumer hardware and full public release of code, weights, and scripts are notable strengths that support reproducibility and further work. The ablation isolating the effect of Word-Aware Loss Weighting provides clear evidence for the proposed component's impact within the synthetic setting.

major comments (2)

[Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.
[Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.

minor comments (2)

[Abstract] Abstract: The claim that the unmodified model 'produces no usable output' would benefit from a quantitative baseline (e.g., CER or exact-match score) rather than a qualitative statement.
[Conclusion] The manuscript would be strengthened by an explicit limitations paragraph discussing the domain gap between synthetic GLOCR images and real printed material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the strengths in efficiency and reproducibility. We address the two major comments point-by-point below. Both comments correctly identify that all quantitative results are on synthetic data; we will revise the manuscript to clarify this scope, temper claims, and add explicit limitations discussion.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments section: The performance metrics (0.22% CER and 97.20% exact match) and the ablation results are obtained exclusively on 5,000 synthetic GLOCR images whose generation process matches the training distribution. No evaluation on real scanned or photographed Tigrinya documents (with font variation, ink spread, scanner noise, or layout complexity) is provided, which is load-bearing for the central claim of applicability to printed Tigrinya text recognition as stated in the title and abstract.

Authors: We agree that the absence of real-document evaluation limits the strength of the applicability claim in the title and abstract. The synthetic GLOCR data was chosen to provide a clean, reproducible testbed for isolating the tokenizer extension and Word-Aware Loss Weighting effects on Ge'ez script. In the revised version we will: (1) update the abstract to state results are obtained on synthetic images, (2) revise the title to 'Adapting TrOCR for Synthetic Printed Tigrinya Text Recognition' or add a qualifier, and (3) insert a new Limitations subsection that discusses font/ink/scan variations expected in real data and outlines future work on real corpora. These changes directly address the load-bearing concern without overstating current evidence. revision: yes
Referee: [Experiments] Ablation study (Experiments section): The two-order-of-magnitude CER reduction attributed to Word-Aware Loss Weighting is demonstrated only on synthetic data; without a corresponding real-document test set, it remains unclear whether the weighting resolves issues that would arise under actual printing and scanning conditions rather than synthetic word-boundary artifacts alone.

Authors: We concur that the ablation is confined to synthetic data and that real printing/scanning artifacts could interact differently with the loss weighting. The Word-Aware Loss Weighting targets systematic word-boundary failures caused by Latin-centric BPE on Ge'ez, an issue rooted in the tokenizer itself and therefore likely to appear in real text as well. Nevertheless, we will revise the Experiments section to: (a) explicitly note the synthetic nature of the ablation, (b) add a paragraph explaining why the controlled setting still provides evidence for the component's utility, and (c) include a forward-looking discussion of how ink spread or layout noise might modulate the observed gains. No new real-data experiments are feasible at this stage, but the revisions will prevent over-generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical adaptation with ablation on held-out synthetic data

full rationale

The paper reports an empirical fine-tuning of TrOCR: tokenizer extension to Ge'ez characters plus a Word-Aware Loss Weighting scheme. Results (0.22% CER, 97.2% exact match) and the ablation (two-order CER reduction) are measured on a held-out test split of 5,000 synthetic GLOCR images. No equations, first-principles derivations, or predictions are presented that reduce by construction to fitted parameters or self-citations. The central claim is an experimental outcome, not a derived quantity equivalent to its inputs. No load-bearing self-citations or uniqueness theorems appear in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The work rests on standard transfer-learning assumptions for vision transformers and the claim that synthetic GLOCR images are representative; no new physical entities or ad-hoc constants are introduced beyond the new loss term itself.

axioms (2)

domain assumption Pre-trained TrOCR weights provide a useful starting point for Ge'ez script after tokenizer extension
Invoked in the adaptation procedure described in the abstract
domain assumption Synthetic printed images from GLOCR dataset match the distribution of real Tigrinya documents
Required for the reported test-set metrics to generalize

invented entities (1)

Word-Aware Loss Weighting no independent evidence
purpose: To penalize word-boundary errors when applying Latin-centric BPE to Ge'ez script
New component introduced to resolve systematic failures; no independent falsifiable prediction outside the reported ablation is given

pith-pipeline@v0.9.0 · 5502 in / 1548 out tokens · 23129 ms · 2026-05-10T00:33:33.662959+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 1 internal anchor

[1]

Eberhard, Gary F

David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Tigrinya. Ethnologue: Languages of the World, 27th edition, 2024. URLhttps://www.ethnologue.com/language/tir/. Accessed: 2024-12-15

work page 2024
[2]

TrOCR: Transformer-based optical character recognition with pre-trained models

Minghao Li et al. TrOCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023

work page 2023
[3]

Asif Naeem

Muhammad Danish Ali Cheema, Muhammad Danish Shaiq, Farhan Mirza, Adnan Kamal, and M. Asif Naeem. Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR).PeerJ Computer Science, 2024. doi:10.7717/peerj-cs.1964

work page doi:10.7717/peerj-cs.1964 2024
[4]

Murugesh, K

K. Murugesh, K. Sudharson, S. T. Kumar, R. Sanjiv, K. R. M. Raj, and R. Santhiya. Swin- TrOCR: A transformer-based approach for high-accuracy Tamil text recognition. In2025 3rd In- ternational Conference on Artificial Intelligence and Machine Learning Applications (AIMLA), 2025. doi:10.1109/AIMLA63829.2025.11041358

work page doi:10.1109/aimla63829.2025.11041358 2025
[5]

Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

Yik Ho Marco Chung and Doyoung Choi. Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

work page 2025
[6]

Spanish trocr: Leveraging transfer learning for language adaptation,

Filipe Lauar and Valentin Laurent. Spanish trocr: Leveraging transfer learning for language adaptation,

work page
[7]

URLhttps://arxiv.org/abs/2407.06950

work page arXiv
[8]

A blended attention-CTC network architecture for Amharic text-image recognition

Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Marcus Liwicki, and Didier Stricker. A blended attention-CTC network architecture for Amharic text-image recognition. InProceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 169–176,

work page
[9]

doi:10.5220/0010284204350441

work page doi:10.5220/0010284204350441
[10]

Tigrinya OCR: Applying CRNN for text recognition

Aaron Afewerki Hailu, Abiel Tesfamichael Hayleslassie, Danait Weldu Gebresilasie, Robel Estifanos Haile, Tesfana Tekeste Ghebremedhin, and Yemane Keleta Tedla. Tigrinya OCR: Applying CRNN for text recognition. InNeural Information Processing (ICONIP 2023), volume 14447 ofLecture Notes in Computer Science, pages 456–467. Springer, 2023. doi:10.1007/978-981...

work page doi:10.1007/978-981-99-8184-7_35 2023
[11]

DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

Masato Fujitake. DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

work page arXiv 2023
[12]

Factored convolutional neural network for amharic character image recognition

Berihu Hailu Belay, Tesfa Habtegebrial, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker. Factored convolutional neural network for amharic character image recognition. In2019 IEEE International Conference on Image Processing (ICIP), pages 2906–2910, 2019. doi:10.1109/ICIP.2019.8804407

work page doi:10.1109/icip.2019.8804407 2019
[13]

Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020

Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Million Meshesha, Marcus Liwicki, and Didier Stricker. Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020. doi:10.3390/app10031117

work page doi:10.3390/app10031117 2020
[14]

GLOCR: GeezLab OCR dataset, 2021

Fitsum Gaim. GLOCR: GeezLab OCR dataset, 2021

work page 2021
[15]

UNK s everywhere: A dapting multilingual language models to new scripts

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. UNKs everywhere: Adapting mul- tilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 10186–10203, Online and Punta Cana, D...

work page doi:10.18653/v1/2021.emnlp-main.800 2021
[16]

Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InProceedings of the 1st Workshop on Multilingual Representation Learning, pages 11–26, 2021

work page 2021
[17]

BEiT: BERT pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://openreview. net/forum?id=p-BhZSz59o4. 12

work page 2022
[18]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019
[20]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966

work page 1966
[21]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. CRC Press, 1993

work page 1993
[22]

Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022

Phillip Benjamin Strobel, Simon Clematide, Martin Volk, and Tobias Hodel. Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022. doi:10.48550/arXiv.2203.11008. 13

work page doi:10.48550/arxiv.2203.11008 2022

[1] [1]

Eberhard, Gary F

David M. Eberhard, Gary F. Simons, and Charles D. Fennig. Tigrinya. Ethnologue: Languages of the World, 27th edition, 2024. URLhttps://www.ethnologue.com/language/tir/. Accessed: 2024-12-15

work page 2024

[2] [2]

TrOCR: Transformer-based optical character recognition with pre-trained models

Minghao Li et al. TrOCR: Transformer-based optical character recognition with pre-trained models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13094–13102, 2023

work page 2023

[3] [3]

Asif Naeem

Muhammad Danish Ali Cheema, Muhammad Danish Shaiq, Farhan Mirza, Adnan Kamal, and M. Asif Naeem. Adapting multilingual vision language transformers for low-resource Urdu optical character recognition (OCR).PeerJ Computer Science, 2024. doi:10.7717/peerj-cs.1964

work page doi:10.7717/peerj-cs.1964 2024

[4] [4]

Murugesh, K

K. Murugesh, K. Sudharson, S. T. Kumar, R. Sanjiv, K. R. M. Raj, and R. Santhiya. Swin- TrOCR: A transformer-based approach for high-accuracy Tamil text recognition. In2025 3rd In- ternational Conference on Artificial Intelligence and Machine Learning Applications (AIMLA), 2025. doi:10.1109/AIMLA63829.2025.11041358

work page doi:10.1109/aimla63829.2025.11041358 2025

[5] [5]

Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

Yik Ho Marco Chung and Doyoung Choi. Finetuning vision-language models as OCR systems for low-resource languages: A case study of Manchu, 2025

work page 2025

[6] [6]

Spanish trocr: Leveraging transfer learning for language adaptation,

Filipe Lauar and Valentin Laurent. Spanish trocr: Leveraging transfer learning for language adaptation,

work page

[7] [7]

URLhttps://arxiv.org/abs/2407.06950

work page arXiv

[8] [8]

A blended attention-CTC network architecture for Amharic text-image recognition

Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Marcus Liwicki, and Didier Stricker. A blended attention-CTC network architecture for Amharic text-image recognition. InProceedings of the 13th International Conference on Pattern Recognition Applications and Methods (ICPRAM), pages 169–176,

work page

[9] [9]

doi:10.5220/0010284204350441

work page doi:10.5220/0010284204350441

[10] [10]

Tigrinya OCR: Applying CRNN for text recognition

Aaron Afewerki Hailu, Abiel Tesfamichael Hayleslassie, Danait Weldu Gebresilasie, Robel Estifanos Haile, Tesfana Tekeste Ghebremedhin, and Yemane Keleta Tedla. Tigrinya OCR: Applying CRNN for text recognition. InNeural Information Processing (ICONIP 2023), volume 14447 ofLecture Notes in Computer Science, pages 456–467. Springer, 2023. doi:10.1007/978-981...

work page doi:10.1007/978-981-99-8184-7_35 2023

[11] [11]

DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

Masato Fujitake. DTrOCR: Decoder-only transformer for optical character recognition.arXiv preprint arXiv:2308.15996, 2023

work page arXiv 2023

[12] [12]

Factored convolutional neural network for amharic character image recognition

Berihu Hailu Belay, Tesfa Habtegebrial, Marcus Liwicki, Gebeyehu Belay, and Didier Stricker. Factored convolutional neural network for amharic character image recognition. In2019 IEEE International Conference on Image Processing (ICIP), pages 2906–2910, 2019. doi:10.1109/ICIP.2019.8804407

work page doi:10.1109/icip.2019.8804407 2019

[13] [13]

Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020

Berihu Hailu Belay, Tesfa Habtegebrial, Gebeyehu Belay, Million Meshesha, Marcus Liwicki, and Didier Stricker. Amharic OCR: An end-to-end learning.Applied Sciences, 10(3):1117, 2020. doi:10.3390/app10031117

work page doi:10.3390/app10031117 2020

[14] [14]

GLOCR: GeezLab OCR dataset, 2021

Fitsum Gaim. GLOCR: GeezLab OCR dataset, 2021

work page 2021

[15] [15]

UNK s everywhere: A dapting multilingual language models to new scripts

Jonas Pfeiffer, Ivan Vulić, Iryna Gurevych, and Sebastian Ruder. UNKs everywhere: Adapting mul- tilingual language models to new scripts. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih, editors,Proceedings of the 2021 Conference on Empirical Methods in Natu- ral Language Processing, pages 10186–10203, Online and Punta Cana, D...

work page doi:10.18653/v1/2021.emnlp-main.800 2021

[16] [16]

Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages

Kelechi Ogueji, Yuxin Zhu, and Jimmy Lin. Small data? No problem! exploring the viability of pretrained multilingual language models for low-resourced languages. InProceedings of the 1st Workshop on Multilingual Representation Learning, pages 11–26, 2021

work page 2021

[17] [17]

BEiT: BERT pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022. URLhttps://openreview. net/forum?id=p-BhZSz59o4. 12

work page 2022

[18] [18]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[19] [19]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations (ICLR), 2019

work page 2019

[20] [20]

Levenshtein

Vladimir I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals.Soviet Physics Doklady, 10(8):707–710, 1966

work page 1966

[21] [21]

Tibshirani.An Introduction to the Bootstrap

Bradley Efron and Robert J. Tibshirani.An Introduction to the Bootstrap. CRC Press, 1993

work page 1993

[22] [22]

Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022

Phillip Benjamin Strobel, Simon Clematide, Martin Volk, and Tobias Hodel. Transformer-based htr for historical documents.arXiv preprint arXiv:2203.11008, 2022. doi:10.48550/arXiv.2203.11008. 13

work page doi:10.48550/arxiv.2203.11008 2022