TIP: Token Importance in On-Policy Distillation
Pith reviewed 2026-05-22 10:42 UTC · model grok-4.3
The pith
High-entropy and overconfident low-entropy tokens carry the densest learning signal in on-policy distillation, so selecting under 10 percent of tokens can nearly match full-token baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Informative tokens arise in two regions: high student entropy positions and low student entropy positions that also show high teacher-student divergence. Retaining 50 percent of tokens by entropy-based sampling matches or exceeds all-token training while lowering peak memory by as much as 47 percent. Isolating the low-entropy high-divergence subset lets training on fewer than 10 percent of tokens nearly match full baselines, and Q3-only training on under 20 percent of tokens surpasses full-token on-policy distillation on the DeepPlanning benchmark.
What carries the argument
TIP (Token Importance in on-Policy distillation), a two-axis taxonomy that classifies every token by student entropy and teacher-student divergence to decide which positions to retain.
If this is right
- Entropy sampling of 50 percent of tokens reduces peak memory by up to 47 percent while matching or exceeding full-token performance.
- Low-entropy high-divergence tokens supply dense corrective signal, allowing training on fewer than 10 percent of tokens to nearly match full baselines.
- On long-horizon agentic planning, training only on the selected subset from one model family can surpass full-token on-policy distillation.
- The two-axis view supplies explicit rules that combine uncertainty and disagreement for token selection.
Where Pith is reading between the lines
- The same selection logic could be tested in offline distillation or reinforcement-learning-from-human-feedback pipelines to reduce token throughput.
- Dynamic re-weighting that changes the entropy and divergence thresholds as training progresses might further improve sample efficiency.
- Extending the taxonomy with additional uncertainty signals such as token-level loss curvature could refine the active set even more.
Load-bearing premise
The performance gains from entropy-plus-divergence selection will continue to hold for teacher-student pairs and tasks outside the three model families and three benchmarks examined.
What would settle it
Apply the same entropy-plus-divergence token filter to a fourth model family on a new benchmark such as code generation and check whether the reduced-token run still reaches within 1 percent of the full-token baseline accuracy.
Figures
read the original abstract
On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces TIP, a two-axis taxonomy for token importance in on-policy distillation (OPD) based on student entropy and teacher-student divergence. It argues that high-entropy positions and low-entropy high-divergence positions (where the student is overconfident and wrong) carry the densest learning signal. Empirically, entropy-based sampling of 50% tokens matches or exceeds full-token OPD with up to 47% peak memory reduction, while selecting under 10% low-entropy high-divergence tokens nearly matches baselines. These findings are validated across Qwen3/Llama/Qwen2.5 teacher-student pairs on MATH-500, AIME 2024/2025, and the DeepPlanning benchmark, with an implementation extending the open OPSD repository.
Significance. If the empirical results hold, the work offers a practical, low-overhead method to reduce memory and compute in OPD for large language models without sacrificing performance. The multi-model, multi-benchmark validation and the open-source extension provide concrete evidence of utility under realistic GPU constraints. The taxonomy supplies a useful organizing lens even if the precise selection rules require further tuning.
major comments (2)
- [Experiments section] Experiments section (MATH-500/AIME/DeepPlanning results): The headline claims that 50% entropy sampling matches full training and <10% low-entropy high-divergence tokens nearly match baselines lack reported statistical significance, error bars across random seeds, or explicit controls for total compute and training steps. These details are load-bearing for interpreting the memory savings (up to 47%) as robust rather than potentially confounded by run-to-run variance.
- [Abstract and validation paragraphs] Abstract and validation paragraphs: The TIP taxonomy and type-aware rules are motivated and tested only on three model families (Qwen3, Llama, Qwen2.5) and three benchmarks. Nothing in the reported experiments rules out that the second region (overconfident wrong tokens) is less dense or less corrective on other domains, scales, or training regimes; this directly limits treating the <10% token retention result as a general property of OPD.
minor comments (2)
- [Experimental setup] The description of how exact entropy and divergence thresholds are chosen (fixed vs. percentile-based, per-model or global) is not fully specified in the experimental setup, making reproduction of the precise <10% selection rule difficult.
- [Figures] Figure captions and axis labels for the entropy-divergence scatter plots could more clearly indicate the density of selected tokens in each quadrant.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation of minor revision. The comments correctly identify areas where additional rigor in reporting and scope clarification will strengthen the manuscript. We respond to each major comment below.
read point-by-point responses
-
Referee: Experiments section (MATH-500/AIME/DeepPlanning results): The headline claims that 50% entropy sampling matches full training and <10% low-entropy high-divergence tokens nearly match baselines lack reported statistical significance, error bars across random seeds, or explicit controls for total compute and training steps. These details are load-bearing for interpreting the memory savings (up to 47%) as robust rather than potentially confounded by run-to-run variance.
Authors: We agree that the absence of multi-seed statistics and explicit compute controls weakens the robustness of the reported memory savings. In the revised manuscript we will add results from at least three independent random seeds for the primary comparisons, include error bars or standard deviations, and explicitly confirm that all methods are trained for the same number of optimization steps with matched batch sizes and optimizer hyperparameters. Peak memory measurements will be reported with the same hardware and sequence-length settings to isolate the effect of token selection. revision: yes
-
Referee: Abstract and validation paragraphs: The TIP taxonomy and type-aware rules are motivated and tested only on three model families (Qwen3, Llama, Qwen2.5) and three benchmarks. Nothing in the reported experiments rules out that the second region (overconfident wrong tokens) is less dense or less corrective on other domains, scales, or training regimes; this directly limits treating the <10% token retention result as a general property of OPD.
Authors: We acknowledge the limited scope of the current empirical validation. The selected models and benchmarks were chosen to cover recent open-source families and both short- and long-horizon reasoning tasks where on-policy distillation is practically relevant. In the revision we will add a limitations paragraph, moderate the language in the abstract and conclusion to present the <10% retention result as strong evidence within the tested regimes rather than a universal property, and outline directions for broader evaluation. revision: yes
Circularity Check
No circularity: empirical token selection results are independent of inputs
full rationale
The paper's claims rest on direct empirical comparisons of entropy-based and divergence-based token sampling against full-token baselines across three model families and benchmarks. The TIP taxonomy organizes observed patterns and the theoretical explanation for entropy's incompleteness is motivated by those observations rather than reducing any result to a fitted parameter or self-citation by construction. Implementation extends a prior repository but does not load-bear the performance claims, which are falsifiable via the reported experiments. No equations or derivations in the provided text exhibit self-definitional or fitted-input circularity.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 9 Pith papers
-
TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
-
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
-
Rubric-based On-policy Distillation
Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
-
SOD: Step-wise On-policy Distillation for Small Language Model Agents
SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
-
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
-
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.
Reference graph
Works this paper leans on
-
[1]
Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston
URL https: //arxiv.org/abs/2306.13649. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.ICML,
-
[2]
The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models
Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
URLhttps://arxiv.org/abs/2407.21783. Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowledge distillation for pre-training language models.arXiv preprint arXiv:2410.17215,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Distilling the Knowledge in a Neural Network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,
Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,
-
[6]
URLhttps://arxiv.org/abs/2305.12870. Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,
-
[7]
10 Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token- selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,
-
[8]
Sequence-Level Knowledge Distillation
URL https: //arxiv.org/abs/1606.07947. M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.NeurIPS,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
URL https://arxiv.org/abs/2412.15115. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InInternational Conference on Machine Learning (ICML),
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
CRISP: Compressed Reasoning via Iterative Self-Policy Distillation
Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,
Almog Tavor, Itay Ebenspanger, Neil Cnaan, and Mor Geva. Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,
-
[12]
Jiapeng Wang, Yiwen Hu, Yanzipeng Gao, Haoyu Wang, Shuo Wang, Hongyu Lu, Jiaxin Mao, Wayne Xin Zhao, Junyi Li, and Ji-Rong Wen. Entropy-guided token dropout: Training autoregres- sive language models with limited domain data.arXiv preprint arXiv:2512.23422, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-hui Chen,...
- [13]
-
[14]
Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and self-distillation at the frontier of student competence.arX...
-
[15]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,
work page internal anchor Pith review Pith/arXiv arXiv
-
[16]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Oral. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. DeepPlanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,
-
[18]
arXiv preprint arXiv:2602.01288 , year=
Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou, et al. EDIS: Diagnosing LLM reasoning via entropy dynamics.arXiv preprint arXiv:2602.01288,
-
[19]
Assumption 2(Token-separable approximation).For tractability, we neglect off-diagonal gradient interactions across token positions. Concretely, fort̸=s we treat the centered cross-token covariance E[(gt −¯µt)(gs −¯µs)⊤] as lower-order, so that the quadratic term admits a token-separable approximation. Derivation.ExpandL(θ−ηˆg)via smoothness whereˆg= P t w...
work page 2026
-
[20]
off” (54.4%), restating the problem, while the teacher prefers “written
Best@16 results show the same pattern: overconfident-token training improves the upper tail of performance, not just the mean. Figure 4 complements Table 7 with a finer-grained view. The Avg@16 panels confirm the main- text findings: Q3-only 20% leads for both teacher sizes (12.6 and 13.6 vs. baselines of 11.7 and 12.8), and entropy-only 50% improves over...
work page 2048
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.