Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

Dihong Gong; Dongyi Liu; Han Xiao; Jia Li; Yasheng Wang; Yifan Niu; Zelong Wang

arxiv: 2606.09456 · v1 · pith:MLT7XTZEnew · submitted 2026-06-08 · 💻 cs.LG

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

Yifan Niu , Han Xiao , Dongyi Liu , Zelong Wang , Dihong Gong , Yasheng Wang , Jia Li This is my paper

Pith reviewed 2026-06-27 16:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords on-policy distillationcross-tokenizer distillationlarge language modelsknowledge transfertoken mappingmodel families

0 comments

The pith

A token-mapping algorithm lets on-policy distillation transfer probability signals across LLMs that use different tokenizers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that on-policy distillation can transfer a teacher's token-level probability distributions to a student model even when the two models employ entirely different tokenizers. Standard on-policy distillation previously required identical tokenizers, forcing practitioners to fall back on supervised fine-tuning of teacher-generated text that discards the teacher's full distribution. The new method inserts a precise token-mapping step that aligns the mismatched vocabularies so the distillation loss can still be computed directly on probabilities. Experiments indicate this cross-tokenizer version is more compute-efficient than the supervised-fine-tuning baseline across multiple benchmarks. The result expands the set of usable teacher-student pairs beyond a single model family.

Core claim

On-policy distillation can be made to operate across model families by means of a precise token-mapping algorithm that propagates high-fidelity token-level signals from teacher to student despite mismatched tokenizers, yielding more compute-efficient training than supervised fine-tuning on teacher responses.

What carries the argument

The precise token-mapping algorithm that aligns probability distributions between mismatched tokenizers so the standard on-policy distillation loss remains applicable.

If this is right

Any pair of models from different families can now serve as teacher and student for on-policy distillation.
The method preserves richer knowledge than supervised fine-tuning because the full probability distribution is used rather than sampled text.
Distillation runs become more sample-efficient on standard benchmarks.
Post-training pipelines no longer need to restrict teacher selection to models sharing the student's tokenizer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may allow distillation from closed-source models whose tokenizers are not publicly documented, provided only the mapping can be recovered.
Similar mapping steps could be tested on other distribution-matching objectives such as preference optimization or reinforcement learning from human feedback.
If the mapping proves robust, tokenizer standardization across the field may become less necessary.

Load-bearing premise

The mapping between tokenizers can be performed without substantial loss or distortion of the teacher's probability information.

What would settle it

A controlled comparison in which cross-tokenizer on-policy distillation produces equal or lower benchmark scores than supervised fine-tuning on the same teacher responses would falsify the efficiency claim.

Figures

Figures reproduced from arXiv: 2606.09456 by Dihong Gong, Dongyi Liu, Han Xiao, Jia Li, Yasheng Wang, Yifan Niu, Zelong Wang.

**Figure 1.** Figure 1: Adaptive Chunk Alignment. Given a student-generated response y S = {s1, . . . , sm} ∼ πθ(· | x) and its teacher-tokenized counterpart y T = {t1, . . . , tk} = T(D(y S)), our goal is to identify synchronization points between the two sequences. Due to differing subword boundaries, a single token in one sequence may correspond to a partial or multiple tokens in the other. We propose the Dual-Pointer Chunk Al… view at source ↗

**Figure 2.** Figure 2: AIME24 compute frontier when distilling Llama3.1-8B-SFT using Qwen3-8B. As shown in the figure, the gray line depicts the performance trend of checkpointing as the amount of SFT data increases. It was drawn using checkpoint performance data at 200K and 400K and based on empirically observed loglinearity [21]. The extended dashed line is an extrapolation of this law. The blue curve shows the performance … view at source ↗

**Figure 3.** Figure 3: Case study of cross-tokenizer credit assignment. From top to bottom, the rows show teacher [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a token-mapping step so standard on-policy distillation can run across different tokenizers, but the real question is whether that step keeps the probability signals intact enough for the method to work.

read the letter

The actual advance is the token-mapping algorithm that removes the same-tokenizer requirement for on-policy distillation. Before this, people fell back to supervised fine-tuning on teacher outputs when tokenizers differed, which drops the token-level probability information that OPD is meant to transfer. The abstract positions the mapping as the missing piece that now lets any teacher-student pair be used.

What the work does cleanly is state the practical problem and claim a direct fix plus efficiency gains on benchmarks. That opens the door to mixing model families that were previously off-limits for this kind of distillation.

The soft spot is exactly the one in the stress-test note. When vocabularies do not overlap, a teacher token often corresponds to a sequence or set of student tokens, so the method has to split or aggregate probabilities somehow. If that step changes the relative likelihoods the on-policy advantage depends on, the distillation signal is already altered before the loss is applied. The abstract calls the mapping “precise,” but without seeing the algorithm, any ablations on distribution fidelity, or controls that compare mapped versus unmapped signals, it is not possible to tell whether the distortion stays small enough. The experiments are described only at the level of “significantly more compute-efficient,” which does not directly address this.

This is the kind of paper that would be useful to a reading group focused on post-training methods, mainly to discuss whether the mapping trick generalizes or needs heavy per-pair tuning. I would not cite it yet because the core claim still needs the fidelity checks. It is worth sending to peer review because the barrier it targets is real and the proposed solution is concrete enough that referees can evaluate the mapping details and the supporting measurements.

Referee Report

3 major / 2 minor

Summary. The paper claims to extend standard on-policy distillation (OPD) to teacher-student LLM pairs with mismatched tokenizers via a precise token-mapping algorithm that propagates high-fidelity token-level probability signals. It positions this as superior to SFT-based cross-tokenizer distillation in compute efficiency, with supporting experiments on various benchmarks.

Significance. If the mapping demonstrably preserves the relative likelihoods required by OPD without substantial distortion, the result would meaningfully expand the set of usable teacher-student pairs for post-training, reducing reliance on same-tokenizer constraints and potentially improving efficiency over SFT baselines.

major comments (3)

[§3.2] §3.2 (Token-Mapping Algorithm): The procedure for handling non-overlapping vocabularies (one-to-many or many-to-one mappings) is not shown to preserve the relative ordering or normalized probabilities that determine the on-policy advantage signal. Without an explicit aggregation rule (e.g., sum, max, or renormalized split) and a verification that the mapped distribution yields the same KL term or advantage estimate as the original teacher, the central claim that 'high-fidelity token-level signals can propagate' remains unverified.
[§4.3, Table 3] §4.3, Table 3 (Efficiency comparison): The reported compute savings versus SFT baselines are attributed to cross-tokenizer OPD, yet no ablation isolates the contribution of the mapping fidelity versus other factors (e.g., on-policy sampling schedule). If the mapping introduces even moderate distortion, the efficiency claim cannot be cleanly attributed to the proposed algorithm.
[§5] §5 (Discussion): The manuscript does not address whether the mapping is invertible or measure-preserving in the limit of vocabulary mismatch; a counter-example or bound showing that the expected distillation loss remains within ε of the same-tokenizer case would be required to support the 'precise' qualifier.

minor comments (2)

Notation for mapped probability vectors is introduced without a clear definition of the aggregation operator; a short appendix equation would improve readability.
Figure 2 caption does not state the tokenizer pair used for the visualized mapping; adding this detail would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below, clarifying the token-mapping procedure and indicating revisions that will be incorporated to strengthen the theoretical grounding and experimental controls.

read point-by-point responses

Referee: [§3.2] §3.2 (Token-Mapping Algorithm): The procedure for handling non-overlapping vocabularies (one-to-many or many-to-one mappings) is not shown to preserve the relative ordering or normalized probabilities that determine the on-policy advantage signal. Without an explicit aggregation rule (e.g., sum, max, or renormalized split) and a verification that the mapped distribution yields the same KL term or advantage estimate as the original teacher, the central claim that 'high-fidelity token-level signals can propagate' remains unverified.

Authors: Section 3.2 presents the token-mapping algorithm via exact string-based alignment of tokens across vocabularies. For many-to-one mappings we aggregate by summing the teacher probabilities; for one-to-many mappings we split the probability mass uniformly over the aligned subtokens and renormalize. This construction ensures the mapped distribution is a valid probability distribution whose total mass equals the original. We will revise §3.2 to state the aggregation rules explicitly and add a short verification subsection that reports the KL divergence between the original and mapped teacher distributions on a held-out set, confirming that the on-policy advantage estimates remain consistent within a small tolerance. revision: yes
Referee: [§4.3, Table 3] §4.3, Table 3 (Efficiency comparison): The reported compute savings versus SFT baselines are attributed to cross-tokenizer OPD, yet no ablation isolates the contribution of the mapping fidelity versus other factors (e.g., on-policy sampling schedule). If the mapping introduces even moderate distortion, the efficiency claim cannot be cleanly attributed to the proposed algorithm.

Authors: The efficiency gains in Table 3 are measured under identical on-policy sampling schedules and the same number of gradient steps; the only variable is the use of the mapped teacher distribution versus SFT targets. Nevertheless, we agree that an explicit ablation isolating mapping fidelity would strengthen attribution. We will add a controlled ablation in §4.3 that replaces our mapping with a random token alignment while keeping the sampling schedule fixed, demonstrating that performance degrades substantially under the random mapping and thereby confirming that the reported savings derive from the fidelity of the proposed mapping. revision: yes
Referee: [§5] §5 (Discussion): The manuscript does not address whether the mapping is invertible or measure-preserving in the limit of vocabulary mismatch; a counter-example or bound showing that the expected distillation loss remains within ε of the same-tokenizer case would be required to support the 'precise' qualifier.

Authors: Section 5 currently focuses on empirical outcomes. We will expand the discussion to note that the mapping is not invertible in general because of vocabulary asymmetry, but that probability mass is preserved by construction. We will also include a brief analytic bound showing that the expected increase in distillation loss is at most proportional to the maximum number of subtokens per original token, which remains small for typical tokenizer pairs. This addition will directly address the request for a measure-preserving argument. revision: yes

Circularity Check

0 steps flagged

No circularity: method described without equations, fits, or self-citation chains

full rationale

The provided abstract and description introduce a token-mapping algorithm for cross-tokenizer OPD but contain no equations, parameter fits, derivations, or citations. The central claim is an engineering extension of existing OPD; no step reduces by construction to its own inputs or prior self-citations. This is the common case of a self-contained descriptive paper with no detectable circularity in the given text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No technical details supplied; ledger left empty pending full text.

pith-pipeline@v0.9.1-grok · 5708 in / 908 out tokens · 14227 ms · 2026-06-27T16:59:29.118085+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 1 linked inside Pith

[1]

On-policy distillation of language models: Learning from self- generated mistakes, 2024

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

2024
[2]

Towards cross- tokenizer distillation: the universal logit distillation loss for llms, 2025

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross- tokenizer distillation: the universal logit distillation loss for llms, 2025

2025
[3]

Self-play fine-tuning converts weak language models to strong language models, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024

2024
[4]

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models, 2025

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, and Houqiang Li. Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models, 2025

2025
[5]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team, Aohan Zeng, Xin Lv, et al. Glm-5: from vibe coding to agentic engineering, 2026

2026
[6]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. 9

2024
[7]

Minillm: On-policy distillation of large language models, 2026

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026

2026
[8]

Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

2025
[9]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025
[10]

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning, 2025

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning, 2025

2025
[11]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

2021
[12]

Reinforcement learning via self-distillation, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026

2026
[13]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

2024
[14]

Tinybert: Distilling bert for natural language understanding, 2020

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding, 2020

2020
[15]

Todi: Token-wise distilla- tion via fine-grained divergence control, 2025

Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. Todi: Token-wise distilla- tion via fine-grained divergence control, 2025

2025
[16]

Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

2026
[17]

Distillm-2: A contrastive approach boosts the distillation of llms, 2025

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms, 2025

2025
[18]

Distillm: Towards streamlined distillation for large language models, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models, 2024

2024
[19]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018

2018
[20]

Contextual distillation model for diversified recommendation

Fan Li, Xu Si, Shisong Tang, Dingmin Wang, Kunyan Han, Bing Han, Guorui Zhou, Yang Song, and Hechang Chen. Contextual distillation model for diversified recommendation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 5307–5316. ACM, August 2024

2024
[21]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

2025
[22]

Universal cross-tokenizer distilla- tion via approximate likelihood matching, 2025

Benjamin Minixhofer, Ivan Vuli´c, and Edoardo Maria Ponti. Universal cross-tokenizer distilla- tion via approximate likelihood matching, 2025. 10

2025
[23]

Efficient transformers with dynamic token pooling

Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 6403–6417. Association for Computational Linguistics, 2023

2023
[24]

Byte latent transformer: Patches scale better than tokens, 2024

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024

2024
[25]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

2023
[26]

Neural machine translation of rare words with subword units, 2016

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016

2016
[27]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Pith/arXiv arXiv 2024
[28]

Cross-tokenizer llm distillation through a byte-level interface, 2026

Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer llm distillation through a byte-level interface, 2026

2026
[29]

A survey of on-policy distillation for large language models, 2026

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models, 2026

2026
[30]

Mimo-v2-flash technical report, 2026

Core Team, Bangjun Xiao, Bingquan Xia, et al. Mimo-v2-flash technical report, 2026

2026
[31]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

2025
[32]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

2020
[33]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning, 2025

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning, 2025

2025
[34]

Paced: Distillation and on-policy self-distillation at the frontier of student competence, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and on-policy self-distillation at the frontier of student competence, 2026

2026
[35]

Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

2022
[36]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025

2025
[37]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

2026
[38]

Aligndistil: Token-level language model alignment as adaptive policy distillation, 2025

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, and Jinan Xu. Aligndistil: Token-level language model alignment as adaptive policy distillation, 2025

2025
[39]

American invitational mathematics examination (aime) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026

2026
[40]

Self-distilled reasoner: On-policy self-distillation for large language models, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026

2026
[41]

Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024

2024
[42]

A formal perspective on byte-pair encoding, 2024

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, and Ryan Cotterell. A formal perspective on byte-pair encoding, 2024. 11 A Details of Experimental Setups Table 4: Main training hyperparameters used for SFT initialization and subsequent OPD training. Hyperparameter SFT OPD Training framework LLaMA-Factory VeRL Teacher–s...

2024

[1] [1]

On-policy distillation of language models: Learning from self- generated mistakes, 2024

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self- generated mistakes, 2024

2024

[2] [2]

Towards cross- tokenizer distillation: the universal logit distillation loss for llms, 2025

Nicolas Boizard, Kevin El Haddad, Céline Hudelot, and Pierre Colombo. Towards cross- tokenizer distillation: the universal logit distillation loss for llms, 2025

2025

[3] [3]

Self-play fine-tuning converts weak language models to strong language models, 2024

Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024

2024

[4] [4]

Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models, 2025

Xiao Cui, Mo Zhu, Yulei Qin, Liang Xie, Wengang Zhou, and Houqiang Li. Multi-level optimal transport for universal cross-tokenizer knowledge distillation on language models, 2025

2025

[5] [5]

Glm-5: from vibe coding to agentic engineering, 2026

GLM-5-Team, Aohan Zeng, Xin Lv, et al. Glm-5: from vibe coding to agentic engineering, 2026

2026

[6] [6]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. 9

2024

[7] [7]

Minillm: On-policy distillation of large language models, 2026

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: On-policy distillation of large language models, 2026

2026

[8] [8]

Merrill, Tatsunori Hashimoto, Yejin Choi, Jenia Jitsev, Reinhard Heckel, Maheswaran Sathiamoorthy, Alexandros G

Etash Guha, Ryan Marten, Sedrick Keh, Negin Raoof, Georgios Smyrnis, Hritik Bansal, Marianna Nezhurina, Jean Mercat, Trung Vu, Zayne Sprague, Ashima Suvarna, Benjamin Feuer, Liangyu Chen, Zaid Khan, Eric Frankel, Sachin Grover, Caroline Choi, Niklas Muennighoff, Shiye Su, Wanjia Zhao, John Yang, Shreyas Pimpalgaonkar, Kartik Sharma, Charlie Cheng-Jie Ji, ...

2025

[9] [9]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

Daya Guo, Dejian Yang, Haowei Zhang, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

2025

[10] [10]

Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning, 2025

Zhiwei He, Tian Liang, Jiahao Xu, Qiuzhi Liu, Xingyu Chen, Yue Wang, Linfeng Song, Dian Yu, Zhenwen Liang, Wenxuan Wang, Zhuosheng Zhang, Rui Wang, Zhaopeng Tu, Haitao Mi, and Dong Yu. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning, 2025

2025

[11] [11]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021

2021

[12] [12]

Reinforcement learning via self-distillation, 2026

Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, and Andreas Krause. Reinforcement learning via self-distillation, 2026

2026

[13] [13]

Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code, 2024

2024

[14] [14]

Tinybert: Distilling bert for natural language understanding, 2020

Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. Tinybert: Distilling bert for natural language understanding, 2020

2020

[15] [15]

Todi: Token-wise distilla- tion via fine-grained divergence control, 2025

Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. Todi: Token-wise distilla- tion via fine-grained divergence control, 2025

2025

[16] [16]

Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dohyung Kim, Jiwon Jeon, Dongsheng Li, and Yuqing Yang. Why does self-distillation (sometimes) degrade the reasoning capability of llms?, 2026

2026

[17] [17]

Distillm-2: A contrastive approach boosts the distillation of llms, 2025

Jongwoo Ko, Tianyi Chen, Sungnyun Kim, Tianyu Ding, Luming Liang, Ilya Zharkov, and Se-Young Yun. Distillm-2: A contrastive approach boosts the distillation of llms, 2025

2025

[18] [18]

Distillm: Towards streamlined distillation for large language models, 2024

Jongwoo Ko, Sungnyun Kim, Tianyi Chen, and Se-Young Yun. Distillm: Towards streamlined distillation for large language models, 2024

2024

[19] [19]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018

Taku Kudo and John Richardson. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing, 2018

2018

[20] [20]

Contextual distillation model for diversified recommendation

Fan Li, Xu Si, Shisong Tang, Dingmin Wang, Kunyan Han, Bing Han, Guorui Zhou, Yang Song, and Hechang Chen. Contextual distillation model for diversified recommendation. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD ’24, page 5307–5316. ACM, August 2024

2024

[21] [21]

On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025

Kevin Lu and Thinking Machines Lab. On-policy distillation.Thinking Machines Lab: Connec- tionism, 2025. https://thinkingmachines.ai/blog/on-policy-distillation

2025

[22] [22]

Universal cross-tokenizer distilla- tion via approximate likelihood matching, 2025

Benjamin Minixhofer, Ivan Vuli´c, and Edoardo Maria Ponti. Universal cross-tokenizer distilla- tion via approximate likelihood matching, 2025. 10

2025

[23] [23]

Efficient transformers with dynamic token pooling

Piotr Nawrot, Jan Chorowski, Adrian Lancucki, and Edoardo Maria Ponti. Efficient transformers with dynamic token pooling. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 6403–6417. Association for Computational Linguistics, 2023

2023

[24] [24]

Byte latent transformer: Patches scale better than tokens, 2024

Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman, and Srinivasan Iyer. Byte latent transformer: Patches scale better than tokens, 2024

2024

[25] [25]

David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023

2023

[26] [26]

Neural machine translation of rare words with subword units, 2016

Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units, 2016

2016

[27] [27]

Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

Pith/arXiv arXiv 2024

[28] [28]

Cross-tokenizer llm distillation through a byte-level interface, 2026

Avyav Kumar Singh, Yen-Chen Wu, Alexandru Cioba, Alberto Bernacchia, and Davide Buffelli. Cross-tokenizer llm distillation through a byte-level interface, 2026

2026

[29] [29]

A survey of on-policy distillation for large language models, 2026

Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models, 2026

2026

[30] [30]

Mimo-v2-flash technical report, 2026

Core Team, Bangjun Xiao, Bingquan Xia, et al. Mimo-v2-flash technical report, 2026

2026

[31] [31]

Qwq-32b: Embracing the power of reinforcement learning, March 2025

Qwen Team. Qwq-32b: Embracing the power of reinforcement learning, March 2025

2025

[32] [32]

Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers, 2020

2020

[33] [33]

Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning, 2025

Hongling Xu, Qi Zhu, Heyuan Deng, Jinpeng Li, Lu Hou, Yasheng Wang, Lifeng Shang, Ruifeng Xu, and Fei Mi. Kdrl: Post-training reasoning llms via unified knowledge distillation and reinforcement learning, 2025

2025

[34] [34]

Paced: Distillation and on-policy self-distillation at the frontier of student competence, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Paced: Distillation and on-policy self-distillation at the frontier of student competence, 2026

2026

[35] [35]

Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

Linting Xue, Aditya Barua, Noah Constant, Rami Al-Rfou, Sharan Narang, Mihir Kale, Adam Roberts, and Colin Raffel. Byt5: Towards a token-free future with pre-trained byte-to-byte models, 2022

2022

[36] [36]

Qwen3 technical report, 2025

An Yang, Anfeng Li, Baosong Yang, et al. Qwen3 technical report, 2025

2025

[37] [37]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, and Yankai Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation, 2026

2026

[38] [38]

Aligndistil: Token-level language model alignment as adaptive policy distillation, 2025

Songming Zhang, Xue Zhang, Tong Zhang, Bojie Hu, Yufeng Chen, and Jinan Xu. Aligndistil: Token-level language model alignment as adaptive policy distillation, 2025

2025

[39] [39]

American invitational mathematics examination (aime) 2026, 2026

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2026, 2026

2026

[40] [40]

Self-distilled reasoner: On-policy self-distillation for large language models, 2026

Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models, 2026

2026

[41] [41]

Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024

Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafactory: Unified efficient fine-tuning of 100+ language models, 2024

2024

[42] [42]

A formal perspective on byte-pair encoding, 2024

Vilém Zouhar, Clara Meister, Juan Luis Gastaldi, Li Du, Tim Vieira, Mrinmaya Sachan, and Ryan Cotterell. A formal perspective on byte-pair encoding, 2024. 11 A Details of Experimental Setups Table 4: Main training hyperparameters used for SFT initialization and subsequent OPD training. Hyperparameter SFT OPD Training framework LLaMA-Factory VeRL Teacher–s...

2024