TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CL 3representative citing papers
SynPro uses RL-optimized rephrasing and reformatting of organic data to generate synthetic pretraining tokens that deliver 3.7-5.2x the effective learning of simple repetition and can exceed training on unique data at 1.1B scale.
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.
citing papers explorer
-
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
-
Generating Pretraining Tokens from Organic Data for Data-Bound Scaling
SynPro uses RL-optimized rephrasing and reformatting of organic data to generate synthetic pretraining tokens that deliver 3.7-5.2x the effective learning of simple repetition and can exceed training on unique data at 1.1B scale.
-
Scaling Diffusion Language Models via Adaptation from Autoregressive Models
Adapting autoregressive models via continual pre-training yields diffusion language models from 127M to 7B parameters that outperform prior diffusion models and compete with their autoregressive counterparts on language, reasoning, and commonsense benchmarks.