LLMs can compose surface-form tokens from base embeddings plus learned transformation vectors, freeing 10-40% of vocabulary slots while expanding coverage and preserving downstream performance across five languages.
Kiho Park, Yo Joong Choe, Yibo Jiang, and Victor Veitch
4 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 4representative citing papers
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.
The paper introduces Language Specific Knowledge (LSK) and shows that selecting an optimal non-English language for a query can improve LLM performance on cultural and social norm datasets.
TUDUM applies LoRA-based SFT on 15,991 Turkish reasoning examples followed by GRPO reinforcement learning on Turkish math problems to a 27B Qwen model, producing shorter Turkish reasoning traces with mixed benchmark results.
citing papers explorer
-
English is Not All You Need: Systematically Exploring the Role of Multilinguality in LLM Post-Training
Systematic experiments demonstrate that multilingual coverage in LLM post-training improves results for all languages and tasks compared to English-only, with low-resource languages gaining most and zero-shot transfer emerging at high diversity.
-
Language Specific Knowledge: Do Models Know Better in X than in English?
The paper introduces Language Specific Knowledge (LSK) and shows that selecting an optimal non-English language for a query can improve LLM performance on cultural and social norm datasets.
-
TUDUM: A Turkish-Thinking Reasoning Pipeline for Qwen3.5-27B
TUDUM applies LoRA-based SFT on 15,991 Turkish reasoning examples followed by GRPO reinforcement learning on Turkish math problems to a 27B Qwen model, producing shorter Turkish reasoning traces with mixed benchmark results.