pith. sign in

arxiv: 2605.21699 · v1 · pith:ENZGID2Unew · submitted 2026-05-20 · 💻 cs.LG · cs.CL

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

Pith reviewed 2026-05-22 10:01 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords knowledge distillationcross-tokenizerlogit-based distillationprojection matrixtokenizer alignmentmulti-teacher distillationlanguage model compression
0
0 comments X

The pith

X-Token aligns mismatched tokenizers with a sparse projection matrix so full teacher distributions can guide student training without suppressing critical rare tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to show that logit-based knowledge distillation works across incompatible vocabularies when the alignment between token spaces is handled by a projection rather than strict partitions or heuristics. Existing full-distribution methods either drop important tokens into an unmatched set, as with Llama multi-digit numbers under Qwen supervision, or apply overly rigid one-to-one rules that miss near-equivalent surface forms. X-Token introduces P-KL, which drops the partition and aligns the entire distributions through a shared sparse matrix W, plus H-KL, which relaxes matching to the top-ranked mapping under the same W. Both losses extend to multiple teachers and produce measurable gains on Llama-3.2-1B. A sympathetic reader would care because the approach lets stronger teachers with different tokenizers transfer their full output knowledge without extra model components or severe performance loss on tasks like GSM8K.

Core claim

X-Token remedies two failures in full-distribution logit-based cross-tokenizer distillation: uncommon-token suppression when critical tokens fall outside the matched subset, and over-conservative 1-to-1 matching. It does so with two complementary objectives that share a sparse projection matrix W initialized from tokenizer-level string rules. P-KL removes partitioning entirely and aligns student and teacher distributions directly through W. H-KL retains a hybrid form but aligns each student token to its highest-ranked teacher equivalent under W. On Llama-3.2-1B the method outperforms GOLD by 3.82 points with a Qwen3-4B teacher and by 0.5 points with a Phi-4-Mini teacher; a two-teacher setup,

What carries the argument

The sparse projection matrix W, initialized from tokenizer-level string rules, that maps between student and teacher token spaces and is used by both the partition-free P-KL loss and the relaxed top-ranked H-KL loss.

If this is right

  • On Llama-3.2-1B, X-Token outperforms GOLD by 3.82 average points with a Qwen3-4B teacher.
  • The same method improves over GOLD by 0.5 points when the teacher is Phi-4-Mini.
  • A two-teacher combination of Phi-4-mini and Llama-3B raises performance 1.3 points above single-teacher distillation.
  • Tokens that previously fell into the unmatched subset, such as Llama's multi-digit numerals under digit-splitting Qwen supervision, are now preserved in the aligned distribution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection mechanism could be tested on other model families whose tokenizers split numbers or rare words differently.
  • Making the initialization of W learnable rather than purely rule-based might further reduce residual misalignment.
  • The two-teacher gains suggest the framework can scale to larger ensembles without requiring identical vocabularies.

Load-bearing premise

The sparse projection matrix W initialized from tokenizer-level string rules supplies an alignment accurate enough that critical but uncommon tokens are not suppressed during training.

What would settle it

Running the same distillation setup but replacing the string-rule W with a random projection and finding that GSM8K performance remains near the 2.56 baseline of the partitioned failure case would falsify the claim that the chosen initialization provides sufficient alignment.

read the original abstract

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes X-Token for cross-tokenizer knowledge distillation between models with incompatible vocabularies. It diagnoses two failures in prior full-distribution logit-based methods—an uncommon-token failure where critical tokens (e.g., Llama multi-digit numerals) are suppressed under strict partitioning, and over-conservative 1-to-1 matching—and introduces complementary losses: P-KL, which removes partitioning and aligns distributions via a sparse projection matrix W initialized from tokenizer string rules, and H-KL, which relaxes matching to top-ranked alignments under W. Both extend to multi-teacher settings. On Llama-3.2-1B, the method reports outperforming GOLD by +3.82 average points with a Qwen3-4B teacher and +0.5 with Phi-4-Mini, plus +1.3 points from a two-teacher setup.

Significance. If the central claims hold, the work supplies a targeted, drop-in logit-based remedy for vocabulary mismatch in knowledge distillation, a practical bottleneck as tokenizer diversity grows. The explicit diagnosis of failure modes, the shared projection mechanism, and the multi-teacher extension are strengths; the reported gains on GSM8K and aggregate benchmarks indicate potential utility for improving small-model reasoning performance without auxiliary components.

major comments (2)
  1. [§3 (P-KL and W)] §3 (P-KL formulation and W initialization): the claim that the string-rule-initialized sparse projection W sufficiently aligns critical uncommon tokens (e.g., Llama's ~1100 multi-digit numerals) so that P-KL eliminates suppression is load-bearing for the +3.82 gain and the contrast with the 12.89-to-2.56 GSM8K drop. No coverage statistics, zero-row counts, or ablation on numeral tokens are reported to confirm that W maps these tokens with non-zero fidelity rather than leaving them unmapped.
  2. [Experiments / abstract] Experimental section / abstract results: the reported +3.82 and +0.5 average-point improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is unclear whether the gains over GOLD are robust or could be explained by variance, especially given the dependence on the unverified W mapping.
minor comments (2)
  1. [Abstract] Abstract: 'Qwen3-4B' should be checked for consistency with the exact model name used in the experiments and tables.
  2. [§3] Notation: an explicit equation or short algorithm box for the string-rule construction of W would improve reproducibility and allow readers to assess coverage without re-deriving the rules.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of our claims regarding the projection matrix W and the robustness of the reported gains. We address each major comment below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (P-KL formulation and W initialization): the claim that the string-rule-initialized sparse projection W sufficiently aligns critical uncommon tokens (e.g., Llama's ~1100 multi-digit numerals) so that P-KL eliminates suppression is load-bearing for the +3.82 gain and the contrast with the 12.89-to-2.56 GSM8K drop. No coverage statistics, zero-row counts, or ablation on numeral tokens are reported to confirm that W maps these tokens with non-zero fidelity rather than leaving them unmapped.

    Authors: We agree that explicit verification of W's coverage for uncommon tokens is needed to support the uncommon-token failure diagnosis. In the revised manuscript we will add: (i) zero-row counts for the student vocabulary under the initialized W, (ii) coverage statistics broken down by token category (including numerals), and (iii) a targeted ablation measuring performance when numeral tokens are explicitly masked from W. These additions will directly confirm non-zero fidelity for the ~1100 Llama multi-digit numerals and clarify their contribution to the observed GSM8K recovery. revision: yes

  2. Referee: Experimental section / abstract results: the reported +3.82 and +0.5 average-point improvements lack error bars, multiple random seeds, or statistical tests. Without these, it is unclear whether the gains over GOLD are robust or could be explained by variance, especially given the dependence on the unverified W mapping.

    Authors: We acknowledge that statistical validation is essential for establishing robustness. In the revision we will rerun the primary Llama-3.2-1B experiments across at least three random seeds, report mean and standard deviation for all metrics, and include paired statistical tests (e.g., t-tests) comparing X-Token against GOLD. These results will be added to both the main tables and the abstract to demonstrate that the gains are not attributable to run-to-run variance. revision: yes

Circularity Check

0 steps flagged

No circularity: new losses and heuristic W are independent of reported gains

full rationale

The paper identifies two concrete shortcomings of prior full-distribution logit KD (uncommon-token suppression and over-strict matching), then defines P-KL (partition-free alignment via sparse W) and H-KL (relaxed top-rank matching under W). W itself is constructed from external tokenizer string rules rather than fitted to the target task or derived from the loss; the claimed +3.82 / +0.5 / +1.3 point gains are direct empirical comparisons on standard benchmarks against GOLD and single-teacher baselines. No equation reduces to a fitted parameter renamed as prediction, no self-citation supplies a uniqueness theorem, and no ansatz is smuggled in. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of string-rule initialization for the projection matrix and on the assumption that the resulting alignment preserves dark knowledge for uncommon tokens; no free parameters are explicitly fitted in the abstract description beyond standard distillation hyperparameters.

axioms (1)
  • domain assumption Tokenizer-level string rules produce a useful initial sparse projection matrix W that aligns semantically related tokens across vocabularies
    Invoked when describing initialization of W for both P-KL and H-KL
invented entities (1)
  • Sparse projection matrix W no independent evidence
    purpose: To map student token probabilities onto teacher token space without requiring exact 1-to-1 token identity
    New component introduced to address uncommon-token failure and over-conservative matching

pith-pipeline@v0.9.0 · 5935 in / 1469 out tokens · 49959 ms · 2026-05-22T10:01:02.160251+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 9 internal anchors

  1. [1]

    Un- locking on-policy distillation for any model family, 2025

    Carlos Miguel Patiño, Kashif Rasul, Quentin Gal- louédec, Ben Burtenshaw, Sergio Paniego, Vaibhav Srivastav, Thibaud Frere, Ed Beeching, Lewis Tun- stall, Leandro von Werra, and Thomas Wolf. Un- locking on-policy distillation for any model family, 2025

  2. [2]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Dis- tilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015

  3. [3]

    FitNets: Hints for Thin Deep Nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: hints for thin deep nets (2014). arXiv preprint arXiv:1412.6550, 3, 2014

  4. [4]

    Born again neural networks

    Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational confer- ence on machine learning, pages 1607–1616. PMLR, 2018

  5. [5]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3967–3976, 2019

  6. [6]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schel- ten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs

    Abdelrahman Abouelenin, Atabak Ashfaq, Adam Atkinson, Hany Awadalla, Nguyen Bach, Jianmin Bao, Alon Benhaim, Martin Cai, Vishrav Chaud- hary, Congcong Chen, et al. Phi-4-mini techni- cal report: Compact yet powerful multimodal lan- guage models via mixture-of-loras.arXiv preprint arXiv:2503.01743, 2025

  8. [8]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  9. [9]

    Dual-space knowledge distilla- tion for large language models

    Songming Zhang, Xue Zhang, Zengkui Sun, Yufeng Chen, and Jinan Xu. Dual-space knowledge distilla- tion for large language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18164–18181, 2024

  10. [10]

    Towards cross-tokenizer distil- lation: the universal logit distillation loss for llms

    Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross-tokenizer distil- lation: the universal logit distillation loss for llms. arXiv preprint arXiv:2402.12030, 2024

  11. [11]

    Universal cross-tokenizer distillation via approximate likelihood matching.arXiv preprint arXiv:2503.20083, 2025

    Benjamin Minixhofer, Ivan Vulić, and Edoardo Maria Ponti. Universal cross-tokenizer distillation via approximate likelihood matching.arXiv preprint arXiv:2503.20083, 2025

  12. [12]

    Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

    Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, et al. Nemotron-climb: Clustering-based iterative data mixture bootstrap- ping for language model pre-training.arXiv preprint arXiv:2504.13161, 2025

  13. [13]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Stein- hardt. Measuring massive multitask language under- standing.arXiv preprint arXiv:2009.03300, 2020

  14. [14]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  15. [15]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical prob- lem solving with the math dataset.arXiv preprint arXiv:2103.03874, 2021

  16. [16]

    Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhaga- vatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.Communications of the ACM, 64(9):99–106, 2021

  17. [17]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th annual meeting of the association for computational linguistics, pages 4791–4800, 2019

  18. [18]

    Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024

    Fanqi Wan, Xinting Huang, Deng Cai, Xiaojun Quan, Wei Bi, and Shuming Shi. Knowledge fusion of large language models.arXiv preprint arXiv:2401.10491, 2024

  19. [19]

    Zero-shot tokenizer transfer.Advances in Neural Information Processing Systems, 37:46791– 46818, 2024

    Benjamin Minixhofer, Edoardo M Ponti, and Ivan Vulić. Zero-shot tokenizer transfer.Advances in Neural Information Processing Systems, 37:46791– 46818, 2024

  20. [20]

    Cross-Tokenizer LLM Distillation through a Byte-Level Interface

    Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer llm distillation through a byte-level interface.arXiv preprint arXiv:2604.07466, 2026

  21. [21]

    Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

    Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. InPro- ceedings of the 2018 conference on empirical methods in natural language processing: System demonstra- tions, pages 66–71, 2018

  22. [22]

    <bos>" (=

    Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with sub- word units. InProceedings of the 54th annual meeting of the association for computational linguistics (vol- ume 1: long papers), pages 1715–1725, 2016. 9 X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation Suppressive Gradients From The Commo...