pith:36DVMUZE
TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment
By learning bilingual token alignments from monolingual representations, TokAlign++ rearranges parameters to adapt LLM vocabularies while preserving performance and boosting compression.
arxiv:2605.13429 v1 · 2026-05-13 · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{36DVMUZE3NW34S5T2DFB56FQE6}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.
The assumption that a bilingual token alignment lexicon learned solely from monolingual token representations will provide accurate enough mappings to allow parameter rearrangement and progressive fine-tuning to succeed with only minor performance loss.
TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.
References
Receipt and verification
| First computed | 2026-05-18T02:44:47.215788Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
df87565324db6dbe4bb3d0ca1ef8b027970dc678a49d364ea8e7727e14dcfbc3
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/36DVMUZE3NW34S5T2DFB56FQE6 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: df87565324db6dbe4bb3d0ca1ef8b027970dc678a49d364ea8e7727e14dcfbc3
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "7a6d459d6f44f508c222989d94abf30fa72714945586184c792f92a59b0e697f",
"cross_cats_sorted": [],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CL",
"submitted_at": "2026-05-13T12:23:24Z",
"title_canon_sha256": "c606c48b7c7bd35750f71889f07f8311eec096732906fb2ef2bb049199bbd750"
},
"schema_version": "1.0",
"source": {
"id": "2605.13429",
"kind": "arxiv",
"version": 1
}
}