TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
IT Technology
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3verdicts
UNVERDICTED 3representative citing papers
GaoYao supplies a unified three-layer framework and 182k native-quality samples in 26 languages to diagnose LLMs on general multilingual, cross-cultural, and monocultural tasks.
M-DaQ introduces a diversity-aware sampling framework combining a quality scoring model with maximal marginal relevance selection to build multilingual instruction fine-tuning datasets, yielding models with over 60% average win rates on Alpaca-Eval and MT-Bench across 18 languages.
citing papers explorer
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
The GaoYao Benchmark: A Comprehensive Framework for Evaluating Multilingual and Multicultural Abilities of Large Language Models
GaoYao supplies a unified three-layer framework and 182k native-quality samples in 26 languages to diagnose LLMs on general multilingual, cross-cultural, and monocultural tasks.
-
M-DaQ: Retrieving Samples with Multilingual Diversity and Quality for Instruction Fine-Tuning Datasets
M-DaQ introduces a diversity-aware sampling framework combining a quality scoring model with maximal marginal relevance selection to build multilingual instruction fine-tuning datasets, yielding models with over 60% average win rates on Alpaca-Eval and MT-Bench across 18 languages.