Pith Number

pith:36DVMUZE

pith:2026:36DVMUZE3NW34S5T2DFB56FQE6

not attested not anchored not stored refs resolved

TokAlign++: Advancing Vocabulary Adaptation via Better Token Alignment

Chengqing Zong, Chong Li, Jiajun Zhang, Wen Yang, Yingzhuo Deng

By learning bilingual token alignments from monolingual representations, TokAlign++ rearranges parameters to adapt LLM vocabularies while preserving performance and boosting compression.

arxiv:2605.13429 v1 · 2026-05-13 · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{36DVMUZE3NW34S5T2DFB56FQE6}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Experimental results on 15 languages show that our method boosts the multilingual text compression rates and preserves most of the multilingual ability of vanilla models. It costs as few as 1k steps to restore the performance of the vanilla model. After unifying vocabularies between vanilla models, token-level distillation remarkably improves the base model with only 235M tokens.

C2weakest assumption

The assumption that a bilingual token alignment lexicon learned solely from monolingual token representations will provide accurate enough mappings to allow parameter rearrangement and progressive fine-tuning to succeed with only minor performance loss.

C3one line summary

TokAlign++ learns token alignments between LLM vocabularies from monolingual representations to enable faster adaptation, better text compression, and effective token-level distillation across 15 languages with minimal steps.

References

136 extracted · 136 resolved · 6 Pith anchors

[1] This is an example of sample bibitem article title , journal =

[2] This is an example of sample bibitem article title , booktitle =

[3] Scaling Learning Algorithms Towards

[4] and Osindero, Simon and Teh, Yee Whye , journal =

[5] Deep learning , author=. 2016 , publisher= 2016

Receipt and verification

First computed	2026-05-18T02:44:47.215788Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

df87565324db6dbe4bb3d0ca1ef8b027970dc678a49d364ea8e7727e14dcfbc3

Aliases

arxiv: 2605.13429 · arxiv_version: 2605.13429v1 · doi: 10.48550/arxiv.2605.13429 · pith_short_12: 36DVMUZE3NW3 · pith_short_16: 36DVMUZE3NW34S5T · pith_short_8: 36DVMUZE

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/36DVMUZE3NW34S5T2DFB56FQE6 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: df87565324db6dbe4bb3d0ca1ef8b027970dc678a49d364ea8e7727e14dcfbc3

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "7a6d459d6f44f508c222989d94abf30fa72714945586184c792f92a59b0e697f",
    "cross_cats_sorted": [],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2026-05-13T12:23:24Z",
    "title_canon_sha256": "c606c48b7c7bd35750f71889f07f8311eec096732906fb2ef2bb049199bbd750"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2605.13429",
    "kind": "arxiv",
    "version": 1
  }
}