TIP: Token Importance in On-Policy Distillation

Alborz Geramifard; Hejian Sang; Ran He; Yuanda Xu; Zhengze Zhou; Zhipeng Wang

arxiv: 2604.14084 · v4 · pith:NG7FBDOSnew · submitted 2026-04-15 · 💻 cs.LG · cs.AI

TIP: Token Importance in On-Policy Distillation

Yuanda Xu , Hejian Sang , Zhengze Zhou , Ran He , Zhipeng Wang , Alborz Geramifard This is my paper

Pith reviewed 2026-05-22 10:42 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationtoken importanceknowledge distillationentropy samplingteacher-student divergencememory-efficient traininglarge language modelsselective training

0 comments

The pith

High-entropy and overconfident low-entropy tokens carry the densest learning signal in on-policy distillation, so selecting under 10 percent of tokens can nearly match full-token baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks which token positions supply the strongest learning signal when a student model trains on its own rollouts under token-level teacher supervision. It identifies two key regions: positions where the student shows high uncertainty, and positions where the student is overconfident yet disagrees with the teacher. Entropy alone already lets the method keep only half the tokens while matching or beating full training and cutting peak memory by up to 47 percent. Adding the second region of low-entropy but high-divergence tokens shrinks the active set to fewer than 10 percent of tokens without losing performance. The authors organize these observations into a two-axis taxonomy they call TIP and test it across three model families on math and long-horizon planning benchmarks.

Core claim

Informative tokens arise in two regions: high student entropy positions and low student entropy positions that also show high teacher-student divergence. Retaining 50 percent of tokens by entropy-based sampling matches or exceeds all-token training while lowering peak memory by as much as 47 percent. Isolating the low-entropy high-divergence subset lets training on fewer than 10 percent of tokens nearly match full baselines, and Q3-only training on under 20 percent of tokens surpasses full-token on-policy distillation on the DeepPlanning benchmark.

What carries the argument

TIP (Token Importance in on-Policy distillation), a two-axis taxonomy that classifies every token by student entropy and teacher-student divergence to decide which positions to retain.

If this is right

Entropy sampling of 50 percent of tokens reduces peak memory by up to 47 percent while matching or exceeding full-token performance.
Low-entropy high-divergence tokens supply dense corrective signal, allowing training on fewer than 10 percent of tokens to nearly match full baselines.
On long-horizon agentic planning, training only on the selected subset from one model family can surpass full-token on-policy distillation.
The two-axis view supplies explicit rules that combine uncertainty and disagreement for token selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection logic could be tested in offline distillation or reinforcement-learning-from-human-feedback pipelines to reduce token throughput.
Dynamic re-weighting that changes the entropy and divergence thresholds as training progresses might further improve sample efficiency.
Extending the taxonomy with additional uncertainty signals such as token-level loss curvature could refine the active set even more.

Load-bearing premise

The performance gains from entropy-plus-divergence selection will continue to hold for teacher-student pairs and tasks outside the three model families and three benchmarks examined.

What would settle it

Apply the same entropy-plus-divergence token filter to a fourth model family on a new benchmark such as code generation and check whether the reduced-token run still reaches within 1 percent of the full-token baseline accuracy.

Figures

Figures reproduced from arXiv: 2604.14084 by Alborz Geramifard, Hejian Sang, Ran He, Yuanda Xu, Zhengze Zhou, Zhipeng Wang.

**Figure 1.** Figure 1: Cross-task summary: average accuracy by selection method. Each panel shows one benchmark; bar height is the mean accuracy (mean@16) averaged across three teacher–student pairs for mathematical reasoning (Qwen3-8B→4B, Llama-70B→8B, Qwen2.5-14B→1.5B) and across two teacher sizes (14B, 32B) for DeepPlanning. Methods: Base. = all-token OPD (100%); Ent. 50%/20% = entropy-based token selection at the stated rete… view at source ↗

**Figure 2.** Figure 2: TIP taxonomy as a two-axis map. Entropy determines whether the student is uncertain or confident; divergence determines whether the teacher agrees or disagrees. Q1 and Q2 are visible to entropy-based methods, while Q3 is the low-entropy blind spot that requires divergence to detect. Student entropy. ht = H [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Entropy sampling across retention ratios. Accuracy (mean@16) on three benchmarks as a function of retention ratio. Retaining 50% of tokens with entropy-based sampling matches or outperforms the all-token baseline across model pairs. At very low retention, entropy-only selection begins to plateau or degrade. B.1 Teacher Entropy Is Uninformative Teacher entropy is near-zero everywhere (mean 0.031, std 0.055 … view at source ↗

**Figure 4.** Figure 4: Token selection for agentic OPD on 20% held-out travel-planning queries. Top row: Avg@16; Bottom row: Best@16 (Pass@16). Within each row the left panel uses the 14B teacher and the right panel uses the 32B teacher. Q3-only 20% matches or exceeds the full-token baseline in every setting, consistent with [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

read the original abstract

On-policy knowledge distillation (OPD) trains a student on its own rollouts under token-level supervision from a teacher. Not all token positions matter equally, but existing views of token importance are incomplete. We ask a direct question: which tokens carry the most useful learning signal in OPD? Our answer is that informative tokens come from two regions: positions with high student entropy, and positions with low student entropy plus high teacher--student divergence, where the student is overconfident and wrong. Empirically, student entropy is a strong first-order proxy: retaining $50\%$ of tokens with entropy-based sampling matches or exceeds all-token training while reducing peak memory by up to $47\%$. But entropy alone misses a second important region. When we isolate low-entropy, high-divergence tokens, training on fewer than $10\%$ of all tokens nearly matches full-token baselines, showing that overconfident tokens carry dense corrective signal despite being nearly invisible to entropy-only rules. We organize these findings with TIP (Token Importance in on-Policy distillation), a two-axis taxonomy over student entropy and teacher--student divergence, and give a theoretical explanation for why entropy is useful yet structurally incomplete. This view motivates type-aware token selection rules that combine uncertainty and disagreement. We validate this picture across three teacher--student pairs spanning Qwen3, Llama, and Qwen2.5 on MATH-500 and AIME 2024/2025, and on the DeepPlanning benchmark for long-horizon agentic planning, where Q3-only training on $<$$20\%$ of tokens surpasses full-token OPD. Our experiments are implemented by extending the OPD repository https://github.com/HJSang/OPSD_OnPolicyDistillation, which supports memory-efficient distillation of larger models under limited GPU budgets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The TIP taxonomy adds a useful second axis to token selection in on-policy distillation, showing that low-entropy high-divergence tokens carry dense signal, but the gains are shown only on a narrow set of models and tasks.

read the letter

The paper's core claim is that on-policy distillation can drop most tokens without losing much by using student entropy as a first cut and then adding teacher-student divergence to catch overconfident mistakes. Retaining 50% tokens via entropy sampling matches or beats full training with up to 47% lower peak memory, and under 10% tokens focused on the low-entropy high-divergence region comes close to baselines on the tested setups. That is the practical takeaway worth noting first.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TIP, a two-axis taxonomy for token importance in on-policy distillation (OPD) based on student entropy and teacher-student divergence. It argues that high-entropy positions and low-entropy high-divergence positions (where the student is overconfident and wrong) carry the densest learning signal. Empirically, entropy-based sampling of 50% tokens matches or exceeds full-token OPD with up to 47% peak memory reduction, while selecting under 10% low-entropy high-divergence tokens nearly matches baselines. These findings are validated across Qwen3/Llama/Qwen2.5 teacher-student pairs on MATH-500, AIME 2024/2025, and the DeepPlanning benchmark, with an implementation extending the open OPSD repository.

Significance. If the empirical results hold, the work offers a practical, low-overhead method to reduce memory and compute in OPD for large language models without sacrificing performance. The multi-model, multi-benchmark validation and the open-source extension provide concrete evidence of utility under realistic GPU constraints. The taxonomy supplies a useful organizing lens even if the precise selection rules require further tuning.

major comments (2)

[Experiments section] Experiments section (MATH-500/AIME/DeepPlanning results): The headline claims that 50% entropy sampling matches full training and <10% low-entropy high-divergence tokens nearly match baselines lack reported statistical significance, error bars across random seeds, or explicit controls for total compute and training steps. These details are load-bearing for interpreting the memory savings (up to 47%) as robust rather than potentially confounded by run-to-run variance.
[Abstract and validation paragraphs] Abstract and validation paragraphs: The TIP taxonomy and type-aware rules are motivated and tested only on three model families (Qwen3, Llama, Qwen2.5) and three benchmarks. Nothing in the reported experiments rules out that the second region (overconfident wrong tokens) is less dense or less corrective on other domains, scales, or training regimes; this directly limits treating the <10% token retention result as a general property of OPD.

minor comments (2)

[Experimental setup] The description of how exact entropy and divergence thresholds are chosen (fixed vs. percentile-based, per-model or global) is not fully specified in the experimental setup, making reproduction of the precise <10% selection rule difficult.
[Figures] Figure captions and axis labels for the entropy-divergence scatter plots could more clearly indicate the density of selected tokens in each quadrant.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation of minor revision. The comments correctly identify areas where additional rigor in reporting and scope clarification will strengthen the manuscript. We respond to each major comment below.

read point-by-point responses

Referee: Experiments section (MATH-500/AIME/DeepPlanning results): The headline claims that 50% entropy sampling matches full training and <10% low-entropy high-divergence tokens nearly match baselines lack reported statistical significance, error bars across random seeds, or explicit controls for total compute and training steps. These details are load-bearing for interpreting the memory savings (up to 47%) as robust rather than potentially confounded by run-to-run variance.

Authors: We agree that the absence of multi-seed statistics and explicit compute controls weakens the robustness of the reported memory savings. In the revised manuscript we will add results from at least three independent random seeds for the primary comparisons, include error bars or standard deviations, and explicitly confirm that all methods are trained for the same number of optimization steps with matched batch sizes and optimizer hyperparameters. Peak memory measurements will be reported with the same hardware and sequence-length settings to isolate the effect of token selection. revision: yes
Referee: Abstract and validation paragraphs: The TIP taxonomy and type-aware rules are motivated and tested only on three model families (Qwen3, Llama, Qwen2.5) and three benchmarks. Nothing in the reported experiments rules out that the second region (overconfident wrong tokens) is less dense or less corrective on other domains, scales, or training regimes; this directly limits treating the <10% token retention result as a general property of OPD.

Authors: We acknowledge the limited scope of the current empirical validation. The selected models and benchmarks were chosen to cover recent open-source families and both short- and long-horizon reasoning tasks where on-policy distillation is practically relevant. In the revision we will add a limitations paragraph, moderate the language in the abstract and conclusion to present the <10% retention result as strong evidence within the tested regimes rather than a universal property, and outline directions for broader evaluation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical token selection results are independent of inputs

full rationale

The paper's claims rest on direct empirical comparisons of entropy-based and divergence-based token sampling against full-token baselines across three model families and benchmarks. The TIP taxonomy organizes observed patterns and the theoretical explanation for entropy's incompleteness is motivated by those observations rather than reducing any result to a fitted parameter or self-citation by construction. Implementation extends a prior repository but does not load-bear the performance claims, which are falsifiable via the reported experiments. No equations or derivations in the provided text exhibit self-definitional or fitted-input circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are stated in the abstract; the work is purely empirical with implicit thresholds for high/low entropy and divergence.

pith-pipeline@v0.9.0 · 5881 in / 997 out tokens · 31786 ms · 2026-05-22T10:42:43.211653+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment
cs.AI 2026-05 unverdicted novelty 7.0

TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...
The Extrapolation Cliff in On-Policy Distillation of Near-Deterministic Structured Outputs
cs.LG 2026-05 unverdicted novelty 7.0

On-policy distillation has an extrapolation cliff at closed-form lambda*(p,b,c) set by teacher modal probability, warm-start mass, and clip strength, past which training shifts from format-preserving to format-collapsing.
Rubric-based On-policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module focus and update directions, enabling EffOPD to accelerate training 3x with comparable performance.
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

On-policy distillation gains efficiency from early foresight in module allocation and low-rank update directions, enabling EffOPD to accelerate training by 3x via adaptive extrapolation without extra modules or tuning.
SOD: Step-wise On-policy Distillation for Small Language Model Agents
cs.CL 2026-05 unverdicted novelty 6.0

SOD reweights on-policy distillation strength step-by-step using divergence to stabilize tool use in small language model agents, yielding up to 20.86% gains and 26.13% on AIME 2025 for a 0.6B model.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT enlarges the supervision space in cross-tokenizer on-policy distillation using short jointly tokenizable multi-token continuations, producing consistent gains over shared-token baselines on math and code benchmarks.
SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 6.0

SimCT recovers discarded teacher signal in cross-tokenizer on-policy distillation by enlarging supervision to jointly realizable multi-token continuations, yielding consistent gains on math reasoning and code generati...
Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation
cs.CL 2026-05 unverdicted novelty 5.0

On-policy distillation gains efficiency from early foresight in module allocation and update directions, which the proposed EffOPD method exploits for 3x faster training with comparable performance.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 6 Pith papers · 8 internal anchors

[1]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

URL https: //arxiv.org/abs/2306.13649. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.ICML,

work page arXiv
[2]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowledge distillation for pre-training language models.arXiv preprint arXiv:2410.17215,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,

work page arXiv
[6]

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee

URLhttps://arxiv.org/abs/2305.12870. Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

work page arXiv
[7]

Explain in your own words: Improving reasoning via token- selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

10 Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token- selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

work page arXiv
[8]

Sequence-Level Knowledge Distillation

URL https: //arxiv.org/abs/1606.07947. M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.NeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InInternational Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv
[10]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,

Almog Tavor, Itay Ebenspanger, Neil Cnaan, and Mor Geva. Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,

work page arXiv
[12]

Entropy-guided token dropout: Training autoregres- sive language models with limited domain data.arXiv preprint arXiv:2512.23422, 2025a

Jiapeng Wang, Yiwen Hu, Yanzipeng Gao, Haoyu Wang, Shuo Wang, Hongyu Lu, Jiaxin Mao, Wayne Xin Zhao, Junyi Li, and Ji-Rong Wen. Entropy-guided token dropout: Training autoregres- sive language models with limited domain data.arXiv preprint arXiv:2512.23422, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-hui Chen,...

work page arXiv
[13]

URL https://arxiv.org/abs/2002. 10957. Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, and Jianfei Cai. SPINE: Token-selective test-time reinforcement learning with entropy-band regularization.arXiv preprint arXiv:2511.17938,

work page arXiv 2002
[14]

Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420, 2026a

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and self-distillation at the frontier of student competence.arX...

work page arXiv
[15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin

Oral. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. DeepPlanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,

work page arXiv
[18]

arXiv preprint arXiv:2602.01288 , year=

Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou, et al. EDIS: Diagnosing LLM reasoning via entropy dynamics.arXiv preprint arXiv:2602.01288,

work page arXiv
[19]

large-divergence

Assumption 2(Token-separable approximation).For tractability, we neglect off-diagonal gradient interactions across token positions. Concretely, fort̸=s we treat the centered cross-token covariance E[(gt −¯µt)(gs −¯µs)⊤] as lower-order, so that the quadratic term admits a token-separable approximation. Derivation.ExpandL(θ−ηˆg)via smoothness whereˆg= P t w...

work page 2026
[20]

off” (54.4%), restating the problem, while the teacher prefers “written

Best@16 results show the same pattern: overconfident-token training improves the upper tail of performance, not just the mean. Figure 4 complements Table 7 with a finer-grained view. The Avg@16 panels confirm the main- text findings: Q3-only 20% leads for both teacher sizes (12.6 and 13.6 vs. baselines of 11.7 and 12.8), and entropy-only 50% improves over...

work page 2048

[1] [1]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston

URL https: //arxiv.org/abs/2306.13649. Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. Curriculum learning.ICML,

work page arXiv

[2] [2]

The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models

Ganqu Cui, Yuchen Zhang, Jiacheng Chen, Lifan Yuan, Zhi Wang, Yuxin Zuo, Haozhan Li, Yuchen Fan, Huayu Chen, Weize Chen, Zhiyuan Liu, Hao Peng, Lei Bai, Wanli Ouyang, Yu Cheng, Bowen Zhou, and Ning Ding. The entropy mechanism of reinforcement learning for reasoning language models.arXiv preprint arXiv:2505.22617,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

The Llama 3 Herd of Models

URLhttps://arxiv.org/abs/2407.21783. Yuxian Gu, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. MiniPLM: Knowledge distillation for pre-training language models.arXiv preprint arXiv:2410.17215,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network.arXiv preprint arXiv:1503.02531,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,

Haiduo Huang, Jiangcheng Song, Yadong Zhang, and Pengju Ren. SelecTKD: Selective token- weighted knowledge distillation for LLMs.arXiv preprint arXiv:2510.24021,

work page arXiv

[6] [6]

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee

URLhttps://arxiv.org/abs/2305.12870. Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-aware on-policy distillation of language models.arXiv preprint arXiv:2603.07079,

work page arXiv

[7] [7]

Explain in your own words: Improving reasoning via token- selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

10 Minsang Kim and Seung Jun Baek. Explain in your own words: Improving reasoning via token- selective dual knowledge distillation.arXiv preprint arXiv:2603.13260,

work page arXiv

[8] [8]

Sequence-Level Knowledge Distillation

URL https: //arxiv.org/abs/1606.07947. M Pawan Kumar, Benjamin Packer, and Daphne Koller. Self-paced learning for latent variable models.NeurIPS,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Qwen2.5 Technical Report

URL https://arxiv.org/abs/2412.15115. Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. Learning to reweight examples for robust deep learning. InInternational Conference on Machine Learning (ICML),

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

CRISP: Compressed Reasoning via Iterative Self-Policy Distillation

Hejian Sang, Yuanda Xu, Zhengze Zhou, Ran He, Zhipeng Wang, and Jiachen Sun. CRISP: Compressed reasoning via iterative self-policy distillation.arXiv preprint arXiv:2603.05433,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,

Almog Tavor, Itay Ebenspanger, Neil Cnaan, and Mor Geva. Rethinking selective knowledge distillation.arXiv preprint arXiv:2602.01395,

work page arXiv

[12] [12]

Entropy-guided token dropout: Training autoregres- sive language models with limited domain data.arXiv preprint arXiv:2512.23422, 2025a

Jiapeng Wang, Yiwen Hu, Yanzipeng Gao, Haoyu Wang, Shuo Wang, Hongyu Lu, Jiaxin Mao, Wayne Xin Zhao, Junyi Li, and Ji-Rong Wen. Entropy-guided token dropout: Training autoregres- sive language models with limited domain data.arXiv preprint arXiv:2512.23422, 2025a. Shenzhi Wang, Le Yu, Chang Gao, Chujie Zheng, Shixuan Liu, Rui Lu, Kai Dang, Xiong-hui Chen,...

work page arXiv

[13] [13]

URL https://arxiv.org/abs/2002. 10957. Jianghao Wu, Yasmeen George, Jin Ye, Yicheng Wu, Daniel F. Schmidt, and Jianfei Cai. SPINE: Token-selective test-time reinforcement learning with entropy-band regularization.arXiv preprint arXiv:2511.17938,

work page arXiv 2002

[14] [14]

Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420, 2026a

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. Overconfident errors need stronger correction: Asymmetric confidence penalties for reinforcement learning.arXiv preprint arXiv:2602.21420, 2026a. Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, and Zhipeng Wang. PACED: Distillation and self-distillation at the frontier of student competence.arX...

work page arXiv

[15] [15]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

work page internal anchor Pith review Pith/arXiv arXiv

[16] [16]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin

Oral. Yinger Zhang, Shutong Jiang, Renhao Li, Jianhong Tu, Yang Su, Lianghao Deng, Xudong Guo, Chenxu Lv, and Junyang Lin. DeepPlanning: Benchmarking long-horizon agentic planning with verifiable constraints.arXiv preprint arXiv:2601.18137,

work page arXiv

[18] [18]

arXiv preprint arXiv:2602.01288 , year=

Chenghua Zhu, Siyan Wu, Xiangkang Zeng, Zishan Xu, Zhaolu Kang, Yifu Guo, Yuquan Lu, Junduan Huang, Guojing Zhou, et al. EDIS: Diagnosing LLM reasoning via entropy dynamics.arXiv preprint arXiv:2602.01288,

work page arXiv

[19] [19]

large-divergence

Assumption 2(Token-separable approximation).For tractability, we neglect off-diagonal gradient interactions across token positions. Concretely, fort̸=s we treat the centered cross-token covariance E[(gt −¯µt)(gs −¯µs)⊤] as lower-order, so that the quadratic term admits a token-separable approximation. Derivation.ExpandL(θ−ηˆg)via smoothness whereˆg= P t w...

work page 2026

[20] [20]

off” (54.4%), restating the problem, while the teacher prefers “written

Best@16 results show the same pattern: overconfident-token training improves the upper tail of performance, not just the mean. Figure 4 complements Table 7 with a finer-grained view. The Avg@16 panels confirm the main- text findings: Q3-only 20% leads for both teacher sizes (12.6 and 13.6 vs. baselines of 11.7 and 12.8), and entropy-only 50% improves over...

work page 2048