TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
UNK s everywhere: A dapting multilingual language models to new scripts
3 Pith papers cite this work. Polarity classification is still indexing.
3
Pith papers citing it
representative citing papers
First TrOCR adaptation for Tigrinya achieves 0.22% CER and 97.2% exact match using tokenizer extension plus Word-Aware Loss Weighting on 5000 synthetic GLOCR images.
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.
citing papers explorer
-
Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning
First TrOCR adaptation for Tigrinya achieves 0.22% CER and 97.2% exact match using tokenizer extension plus Word-Aware Loss Weighting on 5000 synthetic GLOCR images.