TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
UNK s everywhere: A dapting multilingual language models to new scripts
3 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
First TrOCR adaptation for Tigrinya achieves 0.22% CER and 97.2% exact match using tokenizer extension plus Word-Aware Loss Weighting on 5000 synthetic GLOCR images.
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.
citing papers explorer
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Adapting TrOCR for Printed Tigrinya Text Recognition: Word-Aware Loss Weighting for Cross-Script Transfer Learning
First TrOCR adaptation for Tigrinya achieves 0.22% CER and 97.2% exact match using tokenizer extension plus Word-Aware Loss Weighting on 5000 synthetic GLOCR images.
-
ANGOFA: Leveraging OFA Embedding Initialization and Synthetic Data for Angolan Language Model
Four MAFT-based PLMs for Angolan languages report 12.3-point gains over AfroXLMR-base and 3.8-point gains over OFA baselines on downstream tasks.