X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation
Pith reviewed 2026-05-22 10:01 UTC · model grok-4.3
The pith
X-Token aligns mismatched tokenizers with a sparse projection matrix so full teacher distributions can guide student training without suppressing critical rare tokens.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
X-Token remedies two failures in full-distribution logit-based cross-tokenizer distillation: uncommon-token suppression when critical tokens fall outside the matched subset, and over-conservative 1-to-1 matching. It does so with two complementary objectives that share a sparse projection matrix W initialized from tokenizer-level string rules. P-KL removes partitioning entirely and aligns student and teacher distributions directly through W. H-KL retains a hybrid form but aligns each student token to its highest-ranked teacher equivalent under W. On Llama-3.2-1B the method outperforms GOLD by 3.82 points with a Qwen3-4B teacher and by 0.5 points with a Phi-4-Mini teacher; a two-teacher setup,
What carries the argument
The sparse projection matrix W, initialized from tokenizer-level string rules, that maps between student and teacher token spaces and is used by both the partition-free P-KL loss and the relaxed top-ranked H-KL loss.
If this is right
- On Llama-3.2-1B, X-Token outperforms GOLD by 3.82 average points with a Qwen3-4B teacher.
- The same method improves over GOLD by 0.5 points when the teacher is Phi-4-Mini.
- A two-teacher combination of Phi-4-mini and Llama-3B raises performance 1.3 points above single-teacher distillation.
- Tokens that previously fell into the unmatched subset, such as Llama's multi-digit numerals under digit-splitting Qwen supervision, are now preserved in the aligned distribution.
Where Pith is reading between the lines
- The same projection mechanism could be tested on other model families whose tokenizers split numbers or rare words differently.
- Making the initialization of W learnable rather than purely rule-based might further reduce residual misalignment.
- The two-teacher gains suggest the framework can scale to larger ensembles without requiring identical vocabularies.
Load-bearing premise
The sparse projection matrix W initialized from tokenizer-level string rules supplies an alignment accurate enough that critical but uncommon tokens are not suppressed during training.
What would settle it
Running the same distillation setup but replacing the string-rule W with a random projection and finding that GSM8K performance remains near the 2.56 baseline of the partitioned failure case would falsify the claim that the chosen initialization provides sufficient alignment.
read the original abstract
Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes X-Token for cross-tokenizer knowledge distillation between models with incompatible vocabularies. It diagnoses two failures in prior full-distribution logit-based methods—an uncommon-token failure where critical tokens (e.g., Llama multi-digit numerals) are suppressed under strict partitioning, and over-conservative 1-to-1 matching—and introduces complementary losses: P-KL, which removes partitioning and aligns distributions via a sparse projection matrix W initialized from tokenizer string rules, and H-KL, which relaxes matching to top-ranked alignments under W. Both extend to multi-teacher settings. On Llama-3.2-1B, the method reports outperforming GOLD by +3.82 average points with a Qwen3-4B teacher and +0.5 with Phi-4-Mini, plus +1.3 points from a two-teacher setup.
Significance. If the central claims hold, the work supplies a targeted, drop-in logit-based remedy for vocabulary mismatch in knowledge distillation, a practical bottleneck as tokenizer diversity grows. The explicit diagnosis of failure modes, the shared projection mechanism, and the multi-teacher extension are strengths; the reported gains on GSM8K and aggregate benchmarks indicate potential utility for improving small-model reasoning performance without auxiliary components.
major comments (2)
- [§3 (P-KL and W)] §3 (P-KL formulation and W initialization): the claim that the string-rule-initialized sparse projection W sufficiently aligns critical uncommon tokens (e.g., Llama's ~1100 multi-digit numerals) so that P-KL eliminates suppression is load-bearing for the +3.82 gain and the contrast with the 12.89-to-2.56 GSM8K drop. No coverage statistics, zero-row counts, or ablation on numeral tokens are reported to confirm that W maps these tokens with non-zero fidelity rather than leaving them unmapped.
- [Experiments / abstract] Experimental section / abstract results: the reported +3.82 and +0.5 average-point improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is unclear whether the gains over GOLD are robust or could be explained by variance, especially given the dependence on the unverified W mapping.
minor comments (2)
- [Abstract] Abstract: 'Qwen3-4B' should be checked for consistency with the exact model name used in the experiments and tables.
- [§3] Notation: an explicit equation or short algorithm box for the string-rule construction of W would improve reproducibility and allow readers to assess coverage without re-deriving the rules.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments highlight important aspects of our claims regarding the projection matrix W and the robustness of the reported gains. We address each major comment below and outline revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: §3 (P-KL formulation and W initialization): the claim that the string-rule-initialized sparse projection W sufficiently aligns critical uncommon tokens (e.g., Llama's ~1100 multi-digit numerals) so that P-KL eliminates suppression is load-bearing for the +3.82 gain and the contrast with the 12.89-to-2.56 GSM8K drop. No coverage statistics, zero-row counts, or ablation on numeral tokens are reported to confirm that W maps these tokens with non-zero fidelity rather than leaving them unmapped.
Authors: We agree that explicit verification of W's coverage for uncommon tokens is needed to support the uncommon-token failure diagnosis. In the revised manuscript we will add: (i) zero-row counts for the student vocabulary under the initialized W, (ii) coverage statistics broken down by token category (including numerals), and (iii) a targeted ablation measuring performance when numeral tokens are explicitly masked from W. These additions will directly confirm non-zero fidelity for the ~1100 Llama multi-digit numerals and clarify their contribution to the observed GSM8K recovery. revision: yes
-
Referee: Experimental section / abstract results: the reported +3.82 and +0.5 average-point improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is unclear whether the gains over GOLD are robust or could be explained by variance, especially given the dependence on the unverified W mapping.
Authors: We acknowledge that statistical validation is essential for establishing robustness. In the revision we will rerun the primary Llama-3.2-1B experiments across at least three random seeds, report mean and standard deviation for all metrics, and include paired statistical tests (e.g., t-tests) comparing X-Token against GOLD. These results will be added to both the main tables and the abstract to demonstrate that the gains are not attributable to run-to-run variance. revision: yes
Circularity Check
No circularity: new losses and heuristic W are independent of reported gains
full rationale
The paper identifies two concrete shortcomings of prior full-distribution logit KD (uncommon-token suppression and over-strict matching), then defines P-KL (partition-free alignment via sparse W) and H-KL (relaxed top-rank matching under W). W itself is constructed from external tokenizer string rules rather than fitted to the target task or derived from the loss; the claimed +3.82 / +0.5 / +1.3 point gains are direct empirical comparisons on standard benchmarks against GOLD and single-teacher baselines. No equation reduces to a fitted parameter renamed as prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Tokenizer-level string rules produce a useful initial sparse projection matrix W that aligns semantically related tokens across vocabularies
invented entities (1)
-
Sparse projection matrix W
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules)
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Un- locking on-policy distillation for any model family, 2025
Carlos Miguel Patiño, Kashif Rasul, Quentin Gal- louédec, Ben Burtenshaw, Sergio Paniego, Vaibhav Srivastav, Thibaud Frere, Ed Beeching, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Un- locking on-policy distillation for any model family, 2025
work page 2025
-
[2]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[3]
FitNets: Hints for Thin Deep Nets
Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: hints for thin deep nets (2014). arXiv preprint arXiv:1412.6550, 3, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[4]
Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational confer- ence on machine learning, pages 1607–1616. PMLR, 2018
work page 2018
-
[5]
Relational knowledge distillation
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019
work page 2019
-
[6]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs
Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaud- hary, Congcong Chen, et al. Phi-4-mini techni- cal report: Compact yet powerful multimodal lan- guage models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
Dual-space knowledge distilla- tion for large language models
Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distilla- tion for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18164–18181, 2024
work page 2024
-
[10]
Towards cross-tokenizer distil- lation: the universal logit distillation loss for llms
Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distil- lation: the universal logit distillation loss for llms. arXiv preprint arXiv:2402.12030, 2024
-
[11]
Benjamin Minixhofer, Ivan Vulić, and Edoardo Maria Ponti. Universal cross-tokenizer distillation via approximate likelihood matching.arXiv preprint arXiv:2503.20083, 2025
-
[12]
Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, et al. Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025
-
[13]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[14]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Measuring Mathematical Problem Solving With the MATH Dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical prob- lem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga- vatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021
work page 2021
-
[17]
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019
work page 2019
-
[18]
Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024
Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024
-
[19]
Benjamin Minixhofer, Edoardo M Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.Advances in Neural Information Processing Systems, 37:46791– 46818, 2024
work page 2024
-
[20]
Cross-Tokenizer LLM Distillation through a Byte-Level Interface
Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer llm distillation through a byte-level interface.arXiv preprint arXiv:2604.07466, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[21]
Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InPro- ceedings of the 2018 conference on empirical methods in natural language processing: System demonstra- tions, pages 66–71, 2018
work page 2018
-
[22]
Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (vol- ume 1: long papers), pages 1715–1725, 2016. 9 X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation Suppressive Gradients From The Commo...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.