arxiv: 2605.06387 · v3 · submitted 2026-05-07 · 💻 cs.LG · cs.AI

Recognition: unknown

Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

Nan Jia , Haojin Yang , Xing Ma , Jiesong Lian , Shuailiang Zhang , Weipeng Zhang , Ke Zeng , Xunliang Cai

show 1 more author

Zequn Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:10 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords on-policy distillationpolicy gradientmathematical reasoningimitation learningreinforcement learningtoken-level feedbackadvantage weighting

0 comments

The pith

Asymmetric On-Policy Distillation replaces negative reinforcement with localized divergence minimization for non-positive advantages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard on-policy distillation uses advantage-weighted policy gradients but encounters high variance updates, vanishing gradients where advantage is zero, and exploration bottlenecks when corrective signals are weak. The paper introduces Asymmetric On-Policy Distillation that keeps positive reinforcement learning intact while switching to localized divergence minimization against the teacher in regions of non-positive advantage. On mathematical reasoning benchmarks this change produces average gains of 4.09 points from strong initialization and 8.34 points from weak initialization, while sustaining higher policy entropy and better retention of capabilities during later tool-use adaptation steps.

Core claim

AOPD replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning, yielding consistent improvements over standard OPD on mathematical reasoning benchmarks.

What carries the argument

Asymmetric handling of advantage regions that applies policy-gradient reinforcement only where advantage is positive and switches to localized teacher divergence minimization elsewhere.

If this is right

Student policies reach higher final accuracy on mathematical reasoning tasks.
Policy entropy stays elevated throughout training rather than collapsing.
Sequential adaptation to tool-use tasks preserves more of the original capability.
Performance gains appear under both strong and weak starting checkpoints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same region-specific switch could be tested in other on-policy RL settings that currently rely on full advantage-weighted gradients.
Token-level teacher signals may allow similar asymmetric treatment in non-math domains once the advantage signal is available.
Optimal radius or weighting for the localized divergence term remains open for tuning.

Load-bearing premise

That switching to localized divergence minimization in non-positive advantage regions resolves the three listed weaknesses without creating new training instabilities.

What would settle it

Run standard OPD and AOPD side-by-side on the same math-reasoning benchmarks while logging policy entropy, gradient norms, and final accuracy; if the entropy and accuracy gaps disappear or new instabilities appear, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.06387 by Haojin Yang, Jiesong Lian, Ke Zeng, Nan Jia, Shuailiang Zhang, Weipeng Zhang, Xing Ma, Xunliang Cai, Zequn Sun.

**Figure 1.** Figure 1: Overview of Asymmetric On-Policy Distillation (AOPD). The asymmetry comes from using two different learning modes on student-generated trajectories: preserving exploitation on aligned positions and invoking teacher guidance on bottleneck positions. Left (Exploitation): When the student’s reasoning aligns with the teacher, AOPD reinforces successful exploration. Right (Imitation): When the student encounter… view at source ↗

**Figure 2.** Figure 2: Observation and analysis of On-policy Distillation. view at source ↗

**Figure 3.** Figure 3: Gradient norm under different values of β. However, evaluating such a divergence at every intervened position is computationally prohibitive in large-vocabulary LLM training. We therefore instantiate the guidance term on the teacher-selected top-K support in Eq. 6. Under top-K truncation, the objective is no longer a full-vocabulary divergence, but a correction objective defined on a teacher-selected dom… view at source ↗

**Figure 4.** Figure 4: Training dynamics under different divergence-guidance strategies. view at source ↗

**Figure 5.** Figure 5: Policy entropy during training. These two observations jointly suggest that AOPD does not continuously reshape the full policy during training. Instead, it preserves a broader policy space while restricting teacher intervention to a limited set of difficult positions. This training behavior is consistent with the continual learning results in Section 6.3, where AOPD retains prior reasoning ability better … view at source ↗

**Figure 6.** Figure 6: Average math score training dynamics under view at source ↗

**Figure 7.** Figure 7: Ablation study on the JSD parameter β. cussed in Section 5.1, once intervention is restricted to a teacher-defined support, the correction should remain weighted according to the teacher distribution on that support view at source ↗

**Figure 8.** Figure 8: Ablation study on top-K and intervention location. K increases from 8 to 16 and 32 in Figure 8a, the final Pass@1 scores on AIME 2024, AIME 2025, and HMMT 2025(Feb) improves, suggesting that a larger top-K preserves a richer portion of the teacher distribution and thus provides more complete signals. At the same time, we observe that smaller K often yields larger gains at the early stage of training in Fi… view at source ↗

**Figure 9.** Figure 9: Training dynamics under different β values. The ablation results in Section 6.4 demonstrate that this apparent convergence coincides with severe degradation in reasoning capability, with β = 0.0 attaining merely 33.7% Pass@1 compared to 53.4% under forward KL. We attribute this to a reward hacking phenomenon in on-policy distillation. The student discovers a mode-collapsed strategy that artificially suppre… view at source ↗

**Figure 10.** Figure 10: Detailed training dynamics of Qwen3-8B-Base under weak initialization. view at source ↗

**Figure 11.** Figure 11: Detailed training dynamics of Qwen3-8B-Base under strong initialization. view at source ↗

**Figure 11.** Figure 11: Training dynamics under different τ values. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

read the original abstract

On-policy distillation (OPD) trains a student on its own trajectories with token-level teacher feedback and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its standard advantage weighted policy gradient suffers from three structural weaknesses, including high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks when corrective signals are insufficient. We therefore propose Asymmetric On-Policy Distillation (AOPD), which replaces ineffective negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show that AOPD consistently outperforms standard OPD, with average gains of 4.09 / 8.34 under strong / weak initialization, respectively. AOPD also maintains higher policy entropy during training and better capability retention during sequential tool-use adaptation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AOPD swaps negative RL for localized divergence minimization in bad regions of on-policy distillation, which looks like a reasonable targeted fix with modest reported gains on math tasks.

read the letter

The main takeaway is that this paper identifies three concrete problems in standard on-policy distillation—high variance updates, vanishing gradients at zero advantage, and exploration limits—and proposes to handle non-positive advantage regions with divergence minimization instead of negative reinforcement while leaving positive signals alone. That asymmetric split is the actual new piece, and it is presented as a direct response to observed weaknesses rather than a broad new theory. The experiments claim average gains of 4.09 points under strong initialization and 8.34 under weak on mathematical reasoning benchmarks, along with higher maintained entropy and less capability loss in sequential tool-use adaptation. If the full paper backs those numbers with proper variance and ablations, the change could be a useful practical adjustment for people already running on-policy loops. The work does a clean job of tying the proposed fix to the listed problems without circular derivations or invented quantities. The math stays straightforward and the empirical hook is stated plainly. The soft spots are mostly in the reporting: the abstract gives no statistical significance, run-to-run variance, exact baseline details, or isolated ablation of the divergence term, so the size and reliability of the gains are still hard to judge from what is shown. The central assumption—that swapping the negative part fully resolves the three issues without new instabilities or extra tuning—also needs more stress-testing than the current description provides. This is aimed at researchers working on distillation and RL for reasoning models who already use on-policy setups. A reader in that niche would find the specific change and the entropy/retention observations worth seeing. It deserves a serious referee because the idea is grounded and the claims are falsifiable, even if the paper will probably need tighter evidence and more controls before acceptance.

Referee Report

2 major / 2 minor

Summary. The paper identifies three structural weaknesses in standard on-policy distillation (OPD) — high variance updates, vanishing gradients in zero-advantage regions, and exploration bottlenecks — and proposes Asymmetric On-Policy Distillation (AOPD) to address them by replacing negative reinforcement with localized divergence minimization in non-positive advantage regions while preserving positive reinforcement learning. Experiments on mathematical reasoning benchmarks show AOPD consistently outperforms standard OPD, with average gains of 4.09 under strong initialization and 8.34 under weak initialization, while maintaining higher policy entropy during training and better capability retention during sequential tool-use adaptation.

Significance. If the results hold under rigorous validation, AOPD provides a practical algorithmic refinement to on-policy distillation that better balances exploitation and imitation at the token level. The reported gains, entropy preservation, and improved retention in adaptation scenarios represent concrete empirical strengths for reasoning-focused language model training.

major comments (2)

[Experiments] Experiments section: the central claim of consistent outperformance with gains of 4.09/8.34 rests on benchmark results, yet the manuscript provides no details on statistical significance, variance or standard deviations across runs, number of random seeds, or exact baseline implementations and hyperparameter settings.
[Method] Method and Experiments: no ablation study isolates the localized divergence minimization component from the positive reinforcement term, leaving open whether the three identified weaknesses are resolved without introducing new instabilities or requiring extensive retuning.

minor comments (2)

Specify the exact mathematical reasoning benchmarks (e.g., GSM8K, MATH) and the precise metrics used for capability retention in the tool-use adaptation experiments.
[Introduction] The abstract and introduction would benefit from a brief illustrative example or diagram showing how the asymmetric update differs from standard advantage-weighted gradients in zero- or negative-advantage tokens.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and will revise the manuscript to strengthen the experimental reporting and add the requested ablation analysis.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim of consistent outperformance with gains of 4.09/8.34 rests on benchmark results, yet the manuscript provides no details on statistical significance, variance or standard deviations across runs, number of random seeds, or exact baseline implementations and hyperparameter settings.

Authors: We agree that the absence of statistical details, variance measures, seed counts, and precise hyperparameter specifications weakens the empirical claims. In the revised manuscript we will report results over 5 random seeds with means and standard deviations, include paired t-test p-values for the reported gains, and add an appendix with exact baseline implementations, learning rates, and all other hyperparameters used for both strong and weak initialization settings. revision: yes
Referee: [Method] Method and Experiments: no ablation study isolates the localized divergence minimization component from the positive reinforcement term, leaving open whether the three identified weaknesses are resolved without introducing new instabilities or requiring extensive retuning.

Authors: We concur that an ablation isolating the localized divergence minimization term is necessary to substantiate that the three structural weaknesses are addressed by the asymmetric design. We will add this ablation study in the revision, comparing (i) full AOPD, (ii) standard OPD (positive reinforcement only), and (iii) a symmetric divergence variant applied to all tokens. The new experiments will also monitor entropy and training stability metrics to check for introduced instabilities or retuning requirements. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper identifies three structural weaknesses in standard on-policy distillation's advantage-weighted policy gradient (high variance, vanishing gradients in zero-advantage regions, exploration bottlenecks) and proposes AOPD as an algorithmic replacement of negative reinforcement with localized divergence minimization in non-positive advantage regions. No load-bearing equations, predictions, or first-principles results reduce by construction to fitted parameters, self-definitions, or self-citation chains. The contribution is framed as an empirical algorithmic change, with performance gains (4.09/8.34 on math benchmarks) and secondary metrics (entropy, capability retention) presented as direct experimental evidence rather than derived outputs that loop back to inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based on abstract only. No explicit free parameters or invented entities are introduced. The approach rests on standard reinforcement learning assumptions about advantage estimation and policy gradients.

axioms (1)

standard math Standard RL assumptions hold, including valid advantage estimation and policy gradient applicability to token-level distillation.
The method extends policy gradient updates and advantage weighting without re-deriving them.

pith-pipeline@v0.9.0 · 5460 in / 1201 out tokens · 62961 ms · 2026-05-14T21:10:56.034278+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 12 canonical work pages · 4 internal anchors

[1]

OpenAI o3-mini System Card , howpublished =
[2]

Qwen3 Technical Report

An Yang and Anfeng Li and Baosong Yang and Beichen Zhang and Binyuan Hui and Bo Zheng and Bowen Yu and Chang Gao and Chengen Huang and others , title =. arXiv preprint arXiv:2505.09388 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[3]

DeepSeek-AI and Daya Guo and Dejian Yang and Haowei Zhang and Junxiao Song and Peiyi Wang and Qihao Zhu and Runxin Xu and Ruoyu Zhang and Shirong Ma and Xiao Bi and Xiaokang Zhang and Xingkai Yu and Yu Wu and Z. F. Wu and Zhibin Gou and Zhihong Shao , title =. arXiv preprint arXiv:2501.12948 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[4]

The Surprising Effectiveness of Negative Reinforcement in

Xinyu Zhu and Mengzhou Xia and Zhepei Wei and Wei-Lin Chen and Danqi Chen and Yu Meng , booktitle=. The Surprising Effectiveness of Negative Reinforcement in. 2026 , url=

2026
[5]

2025 , eprint=

Rethinking Sample Polarity in Reinforcement Learning with Verifiable Rewards , author=. 2025 , eprint=

2025
[6]

Caixia Yan, Xiaojun Chang, Minnan Luo, Huan Liu, Xiaoqin Zhang, and Qinghua Zheng

Xu, Xiaohan and Li, Ming and Tao, Chongyang and Shen, Tao and Cheng, Reynold and Li, Jinyang and Xu, Can and Tao, Dacheng and Zhou, Tianyi , title =. arXiv preprint arXiv:2402.13116 , year =

work page arXiv
[7]

Transactions of the Association for Computational Linguistics , volume =

Zhu, Xunyu and Li, Jian and Liu, Yong and Ma, Can and Wang, Weiping , title =. Transactions of the Association for Computational Linguistics , volume =. 2024 , doi =

2024
[8]

and Vinyals, Oriol and Dean, Jeffrey , title =

Hinton, Geoffrey E. and Vinyals, Oriol and Dean, Jeffrey , title =. NIPS Deep Learning and Representation Learning Workshop , year =
[9]

Findings of the Association for Computational Linguistics: ACL 2023 , pages =

Hsieh, Cheng-Yu and Li, Chun-Liang and Yeh, Chih-Kuan and Nakhost, Hootan and Fujii, Yasuhisa and Ratner, Alexander and Krishna, Ranjay and Lee, Chen-Yu and Pfister, Tomas , title =. Findings of the Association for Computational Linguistics: ACL 2023 , pages =

2023
[10]

, title =

Tian, Yijun and Han, Yikun and Chen, Xiusi and Wang, Wei and Chawla, Nitesh V. , title =. Proceedings of the 18th ACM International Conference on Web Search and Data Mining , pages =. 2025 , url =

2025
[11]

Proceedings of the Twelfth International Conference on Learning Representations , year =

Gu, Yuxian and Dong, Li and Wei, Furu and Huang, Minlie , title =. Proceedings of the Twelfth International Conference on Learning Representations , year =
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models , author=. arXiv preprint arXiv:2402.03300 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Proceedings of the Twelfth International Conference on Learning Representations , year =

Agarwal, Rishabh and Vieillard, Nino and Zhou, Yongchao and Stanczyk, Piotr and Ramos, Sabela and Geist, Matthieu and Bachem, Olivier , title =. Proceedings of the Twelfth International Conference on Learning Representations , year =
[14]

https://thinkingmachines.ai/blog/ on-policy-distillation/

Lu, Kevin and Thinking Machines Lab , title =. Thinking Machines Lab: Connectionism , year =. doi:10.64434/tml.20251026 , url =

work page doi:10.64434/tml.20251026
[15]

Self-Distillation Enables Continual Learning , journal =

Shenfeld, Idan and Damani, Mehul and H. Self-Distillation Enables Continual Learning , journal =. 2026 , url =

2026
[16]

2026 , eprint=

Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models , author=. 2026 , eprint=

2026
[17]

2026 , eprint=

OPSDL: On-Policy Self-Distillation for Long-Context Language Models , author=. 2026 , eprint=

2026
[18]

Advances in Neural Information Processing Systems , volume =

Bengio, Samy and Vinyals, Oriol and Jaitly, Navdeep and Shazeer, Noam , title =. Advances in Neural Information Processing Systems , volume =. 2015 , url =

2015
[19]

arXiv preprint arXiv:2504.11456 , year=

He, Zhiwei and Liang, Tian and Xu, Jiahao and Liu, Qiuzhi and Chen, Xingyu and Wang, Yue and Song, Linfeng and Yu, Dian and Liang, Zhenwen and Wang, Wenxuan and Zhang, Zhuosheng and Wang, Rui and Tu, Zhaopeng and Mi, Haitao and Yu, Dong , title =. arXiv preprint arXiv:2504.11456 , year =

work page arXiv
[20]

Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

Sequence-level knowledge distillation , author=. Proceedings of the 2016 conference on empirical methods in natural language processing , pages=

2016
[21]

Proceedings of the 31st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =

Jia, Chen , title =. Proceedings of the 31st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , year =
[22]

Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026

Yang, Wenkai and Liu, Weijie and Xie, Ruobing and Yang, Kai and Yang, Saiyong and Lin, Yankai , title =. arXiv preprint arXiv:2602.12125 , year =

work page arXiv
[23]

Advances in Neural Information Processing Systems , volume =

Yan, Jianhao and Li, Yafu and Hu, Zican and Wang, Zhi and Cui, Ganqu and Qu, Xiaoye and Cheng, Yu and Zhang, Yue , title =. Advances in Neural Information Processing Systems , volume =. 2025 , url =

2025
[24]

Proceedings of the Fourteenth International Conference on Learning Representations , year =

Ma, Lu and Liang, Hao and Qiang, Meiyi and Tang, Lexiang and Ma, Xiaochen and Wong, Zhen Hao and Niu, Junbo and Shen, Chengyu and He, Runming and Cui, Bin and Zhang, Wentao , title =. Proceedings of the Fourteenth International Conference on Learning Representations , year =
[25]

Proceedings of the Fourteenth International Conference on Learning Representations , year =

Zhang, Wenhao and Xie, Yuexiang and Sun, Yuchang and Chen, Yanxi and Wang, Guoyin and Li, Yaliang and Ding, Bolin and Zhou, Jingren , title =. Proceedings of the Fourteenth International Conference on Learning Representations , year =
[26]

arXiv preprint arXiv:2509.06948 , year =

Chen, Liang and Han, Xueting and Shen, Li and Bai, Jing and Wong, Kam-Fai , title =. arXiv preprint arXiv:2509.06948 , year =

work page arXiv
[27]

Proceedings of the Fourteenth International Conference on Learning Representations , year =

Fu, Yuqian and Chen, Tinghong and Chai, Jiajun and Wang, Xihuai and Tu, Songjun and Yin, Guojun and Lin, Wei and Zhang, Qichao and Zhu, Yuanheng and Zhao, Dongbin , title =. Proceedings of the Fourteenth International Conference on Learning Representations , year =
[28]

M., and Titov, I

Huang, Zeyu and Cheng, Tianhao and Qiu, Zihan and Wang, Zili and Xu, Yinghui and Ponti, Edoardo M. and Titov, Ivan , title =. arXiv preprint arXiv:2507.01679 , year =

work page arXiv
[29]

Step-wise adaptive integration of supervised fine-tuning and reinforcement learning for task-specific llms.arXiv preprint arXiv:2505.13026, 2025

Chen, Jiaqi and Liu, Fazhong and Liu, Minghao and Luo, Yuhan and Qin, Erqu and Zheng, Haoran and Dong, Tian and Zhu, Haojin and Meng, Yan and Wang, Xiao , title =. arXiv preprint arXiv:2505.13026 , year =

work page arXiv
[30]

Proceedings of the Fourteenth International Conference on Learning Representations , year =

Guha, Etash and Marten, Ryan and Keh, Sedrick and Raoof, Negin and Smyrnis, Georgios and Bansal, Hritik and Nezhurina, Marianna and Mercat, Jean and Vu, Trung and Sprague, Zayne and others , title =. Proceedings of the Fourteenth International Conference on Learning Representations , year =
[31]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

Tang, Qiaoyu and Deng, Ziliang and Lin, Hongyu and Han, Xianpei and Liang, Qiao and Cao, Boxi and Sun, Le , title =. arXiv preprint arXiv:2306.05301 , year =

work page arXiv
[32]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

2024
[33]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. arXiv preprint arXiv:2406.01574 , year=

work page internal anchor Pith review Pith/arXiv arXiv