LLMs can compose surface-form tokens from base embeddings plus learned transformation vectors, freeing 10-40% of vocabulary slots while expanding coverage and preserving downstream performance across five languages.
Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3representative citing papers
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.
The paper introduces Language Specific Knowledge (LSK) and shows that selecting an optimal non-English language for a query can improve LLM performance on cultural and social norm datasets.
citing papers explorer
-
English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.