Blockwise Policy-Drift Gating for On-Policy Distillation

Haiyun Jiang; Liwen Zheng

arxiv: 2606.24084 · v1 · pith:YCUF56C4new · submitted 2026-06-23 · 💻 cs.LG · cs.AI· cs.CL

Blockwise Policy-Drift Gating for On-Policy Distillation

Liwen Zheng , Haiyun Jiang This is my paper

Pith reviewed 2026-06-26 00:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords on-policy distillationpolicy driftblockwise gatingrollout reusemath reasoningpass@8student policydistillation robustness

0 comments

The pith

Blockwise policy-drift gating raises mean pass@8 from 0.4978 to 0.5160 in sampled-token on-policy distillation on four math benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces blockwise policy-drift gating as a lightweight addition to on-policy distillation. It measures log-probability shifts between an earlier and current version of the student policy along the sampled trajectory, aggregates those shifts inside fixed token blocks, and applies the resulting gates to reweight the distillation losses. The gates are detached and mean-normalized so they do not alter the teacher targets or the rollout policy itself. The authors test the idea inside a uniform 200-step training budget on Qwen3 across AIME24, AIME25, MATH500, and AMC23, using pass@8 as the main metric. A reader would care because on-policy methods become brittle on long reasoning chains when rollouts are reused, and this supplies a student-only control that lifts solve rates without extra teacher computation.

Core claim

Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across the four benchmarks. On the Teacher-TopK/LSM variant the same block size yields the highest four-benchmark mean among all trained students. The method therefore treats local old-current policy drift as a usable control signal for reused rollouts and positions block-level gating as a simple default repair for OPD fragility.

What carries the argument

blockwise policy-drift gating, which computes log-probability shifts between behavior and current student on the sampled path, aggregates the shifts over fixed blocks, and uses the detached mean-normalized values to reweight position losses

If this is right

Local old-current policy drift functions as a practical control signal when rollouts are reused in OPD.
Block-level aggregation of drift yields measurable robustness gains on long-horizon math reasoning under a fixed training budget.
The repair leaves teacher targets, teacher top-K supports, and the rollout policy unchanged.
Fixed 64-token blocks produce the strongest result among the tested configurations on the four-benchmark mean.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same local-drift signal could be tested on tasks outside math reasoning to check whether the benefit is domain-specific.
Combining block gating with other reported OPD repairs such as local teacher-support matching might produce additive effects.
Varying block size or making the block boundaries adaptive to drift magnitude are direct next experiments that stay within the same student-only framework.

Load-bearing premise

The reported pass@8 gains come from the blockwise gating itself rather than from other uncontrolled differences among the six training variants or from the specific 200-step budget and Qwen3 base model.

What would settle it

An ablation that keeps every other training detail identical but removes the block gating (or replaces the gates with random values) and shows that the 0.0182 mean pass@8 lift disappears.

read the original abstract

On-policy distillation (OPD) trains a student policy using teacher signals computed on trajectories sampled by the student itself. Recent work shows that sampled-token OPD can be fragile on long-horizon reasoning tasks and that local teacher-support matching is a simple and effective repair. This paper introduces blockwise policy-drift gating, a lightweight student-only old-current drift controller for OPD under rollout reuse. The method computes log-probability shifts between the behavior student and the current student on the sampled token path, aggregates these shifts over fixed blocks or spans, and uses the resulting detached, mean-normalized gates to reweight OPD position losses. It does not change teacher targets, teacher top-K supports, or the rollout policy. In a six-variant Qwen3 math reasoning benchmark with a uniform 200-step training budget for all trained variants, we use pass@8 as the primary problem-level solve-rate metric. Fixed 64-token block gating improves sampled-token OPD mean pass@8 from 0.4978 to 0.5160 across AIME24, AIME25, MATH500, and AMC23. On Teacher-TopK/LSM, Block64 gives the best four-benchmark mean pass@8 among trained students. The results identify local old-current policy drift as a practical control signal for reused OPD rollouts and motivate block-level gating as a simple default for improving solve-rate robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Blockwise drift gating is a clean student-only tweak for OPD but the 0.0182 pass@8 lift is too small and under-controlled to trust yet.

read the letter

The paper's main addition is a detached gate computed from the student's own log-prob shift between the behavior policy and the current one, averaged over fixed 64-token blocks and used to reweight the OPD loss. It leaves the teacher targets and rollout policy untouched, which keeps the change minimal.

That construction is new relative to the local teacher-support matching work it cites, and the student-only nature is a practical plus for reused rollouts. The six-variant Qwen3 setup with a locked 200-step budget is at least a consistent test bed.

The results are the weak part. The headline move from 0.4978 to 0.5160 mean pass@8 is modest, and the abstract gives no error bars, seed counts, or explicit statement that optimizer state, data order, and sampling seeds were identical across the six runs. The stress-test concern holds: without those controls the delta cannot be pinned on the gating. No ablation table or per-benchmark breakdown is visible either, so we cannot tell whether the gain is consistent or driven by one dataset.

This is for groups already running on-policy distillation on math or reasoning models and looking for cheap stability knobs. A reader who wants a new control signal to try could pull the idea, but anyone needing reproducible gains should wait for the full methods and variance numbers.

It is coherent enough to send to referees if the full paper supplies the missing experimental controls; otherwise it risks being a minor empirical note.

Referee Report

1 major / 0 minor

Summary. The paper proposes blockwise policy-drift gating for on-policy distillation (OPD), a student-only mechanism that aggregates log-probability shifts between behavior and current student policies over fixed token blocks (e.g., 64 tokens) and uses detached, mean-normalized gates to reweight position losses. It reports that, under a uniform 200-step training budget on Qwen3 across six variants, 64-token block gating raises mean pass@8 from 0.4978 (baseline sampled-token OPD) to 0.5160 on AIME24, AIME25, MATH500, and AMC23, and performs best among trained students on Teacher-TopK/LSM.

Significance. If the numerical gains can be shown to be robustly caused by the gating rather than uncontrolled experimental factors, the method supplies a lightweight, parameter-free control signal for mitigating local policy drift in rollout-reuse OPD settings. The approach preserves teacher targets and rollout policy, which is a practical strength for empirical controllers.

major comments (1)

[Abstract] Abstract: the central claim that fixed 64-token block gating produces a 0.0182 pass@8 lift is load-bearing for the paper yet is presented without error bars, statistical significance tests, ablation tables isolating block size from other variant differences, or an explicit statement that optimizer state, data order, and sampling seeds were locked identically across the six training runs; this prevents attribution of the delta to the gating mechanism.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for highlighting the need for stronger attribution of the reported gains to the gating mechanism. We address the concern point-by-point below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that fixed 64-token block gating produces a 0.0182 pass@8 lift is load-bearing for the paper yet is presented without error bars, statistical significance tests, ablation tables isolating block size from other variant differences, or an explicit statement that optimizer state, data order, and sampling seeds were locked identically across the six training runs; this prevents attribution of the delta to the gating mechanism.

Authors: We agree the abstract would benefit from additional controls. In revision we will add an explicit statement confirming that optimizer state, data order, and sampling seeds were held identical across the six variants. We will also expand the experimental section with a table that isolates block size while holding all other factors fixed. Error bars and formal significance tests cannot be added without new multi-seed runs, which exceed our current compute budget; we will instead note the single-run limitation and point to the consistent direction of improvement across all four benchmarks as supporting evidence. revision: partial

standing simulated objections not resolved

Error bars and statistical significance tests cannot be provided without repeating the full set of training runs under multiple seeds.

Circularity Check

0 steps flagged

No circularity: purely empirical controller with no derivation chain

full rationale

The paper introduces blockwise policy-drift gating as an empirical technique for on-policy distillation and reports pass@8 improvements across fixed training variants on math benchmarks. No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs, self-definitions, or self-citation chains. The central claim is an observed delta between training runs under a uniform 200-step budget; this is an experimental outcome, not a constructed result. Self-contained against external benchmarks with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the method is described at the level of an algorithmic modification without explicit mathematical assumptions.

pith-pipeline@v0.9.1-grok · 5784 in / 1090 out tokens · 23438 ms · 2026-06-26T00:55:15.861488+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 17 linked inside Pith

[1]

Distilling the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015
[2]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. In International Conference on Learning Representations, 2024

2024
[3]

Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017
[4]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, and others. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[5]

DAPO-Math-17k dataset

BytedTsinghua-SIA. DAPO-Math-17k dataset. Hugging Face dataset. Dataset page
[6]

MATH-500 split from the PRM800K repository

OpenAI. MATH-500 split from the PRM800K repository. GitHub repository forLet’s Verify Step by Step. MATH splits
[7]

2024 AIME I and AIME II problems

Art of Problem Solving. 2024 AIME I and AIME II problems. AoPS Wiki. AIME problems and solutions

2024
[8]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674, 2026

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674, 2026

Pith/arXiv arXiv 2026
[9]

2023 AMC 12A and 2023 AMC 12B problems

Art of Problem Solving. 2023 AMC 12A and 2023 AMC 12B problems. AoPS Wiki. AMC problems and solutions

2023
[10]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes.arXiv preprint arXiv:2603.25562, 2026

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes.arXiv preprint arXiv:2603.25562, 2026. 7

Pith/arXiv arXiv 2026
[11]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026
[12]

A Survey of On-Policy Distillation for Large Language Models.arXiv preprint arXiv:2604.00626, 2026

Mingyang Song and Mao Zheng. A Survey of On-Policy Distillation for Large Language Models.arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026
[13]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arXiv preprint arXiv:2604.08527, 2026

Pith/arXiv arXiv 2026
[14]

Entropy-Aware On-Policy Distillation of Language Models.arXiv preprint arXiv:2603.07079, 2026

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-Aware On-Policy Distillation of Language Models.arXiv preprint arXiv:2603.07079, 2026

Pith/arXiv arXiv 2026
[15]

Are Full Rollouts Necessary for On-Policy Distillation? arXiv preprint arXiv:2605.31490, 2026

Yaocheng Zhang, Jiajun Chai, Songjun Tu, Yuqian Fu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Are Full Rollouts Necessary for On-Policy Distillation? arXiv preprint arXiv:2605.31490, 2026

Pith/arXiv arXiv 2026
[16]

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning.arXiv preprint arXiv:2605.07804, 2026

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, and Jing Tang. Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning.arXiv preprint arXiv:2605.07804, 2026

Pith/arXiv arXiv 2026
[17]

TIP: Token Importance in On-Policy Distillation.arXiv preprint arXiv:2604.14084, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token Importance in On-Policy Distillation.arXiv preprint arXiv:2604.14084, 2026

Pith/arXiv arXiv 2026
[18]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting.arXiv preprint arXiv:2604.10688, 2026

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting.arXiv preprint arXiv:2604.10688, 2026

Pith/arXiv arXiv 2026
[19]

SG-OPD: Sign- Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling.arXiv preprint arXiv:2606.09304, 2026

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, and Xiaosong Yuan. SG-OPD: Sign- Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling.arXiv preprint arXiv:2606.09304, 2026

Pith/arXiv arXiv 2026
[20]

Trust Region On-Policy Distillation.arXiv preprint arXiv:2606.01249, 2026

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, and Yehui Tang. Trust Region On-Policy Distillation.arXiv preprint arXiv:2606.01249, 2026

Pith/arXiv arXiv 2026
[21]

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Huangjie Yuan, and Tao Feng. Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation. arXiv preprint arXiv:2606.02684, 2026

Pith/arXiv arXiv 2026
[22]

Xianwei Chen, Shimin Zhang, and Jibin Wu.f-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control.arXiv preprint arXiv:2605.17862, 2026

Pith/arXiv arXiv 2026
[23]

AsyncOPD: How Stale Can On-Policy Distillation Be? OpenReview, 2026

Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjun Kang, Sanghyun Park, Donghoon Kim, Minjae Lee, Minseo Kim, Rishabh Tiwari, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee. AsyncOPD: How Stale Can On-Policy Distillation Be? OpenReview, 2026

2026
[24]

Qwen3-4B-Base-GRPO model card

lllyx. Qwen3-4B-Base-GRPO model card. Hugging Face, 2026. Model page. 8

2026

[1] [1]

Distilling the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network.arXiv preprint arXiv:1503.02531, 2015

Pith/arXiv arXiv 2015

[2] [2]

On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes

Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos, Matthieu Geist, and Olivier Bachem. On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. In International Conference on Learning Representations, 2024

2024

[3] [3]

Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal Policy Optimization Algorithms.arXiv preprint arXiv:1707.06347, 2017

Pith/arXiv arXiv 2017

[4] [4]

Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, and others. Qwen3 Technical Report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[5] [5]

DAPO-Math-17k dataset

BytedTsinghua-SIA. DAPO-Math-17k dataset. Hugging Face dataset. Dataset page

[6] [6]

MATH-500 split from the PRM800K repository

OpenAI. MATH-500 split from the PRM800K repository. GitHub repository forLet’s Verify Step by Step. MATH splits

[7] [7]

2024 AIME I and AIME II problems

Art of Problem Solving. 2024 AIME I and AIME II problems. AoPS Wiki. AIME problems and solutions

2024

[8] [8]

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674, 2026

Jasper Dekoninck, Nikola Jovanović, Tim Gehrunger, Kári Rögnvaldsson, Ivo Petrov, Chenhao Sun, and Martin Vechev. Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs.arXiv preprint arXiv:2605.00674, 2026

Pith/arXiv arXiv 2026

[9] [9]

2023 AMC 12A and 2023 AMC 12B problems

Art of Problem Solving. 2023 AMC 12A and 2023 AMC 12B problems. AoPS Wiki. AMC problems and solutions

2023

[10] [10]

Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes.arXiv preprint arXiv:2603.25562, 2026

Yuqian Fu, Haohuan Huang, Kaiwen Jiang, Jiacai Liu, Zhuo Jiang, Yuanheng Zhu, and Dongbin Zhao. Revisiting On-Policy Distillation: Empirical Failure Modes and Simple Fixes.arXiv preprint arXiv:2603.25562, 2026. 7

Pith/arXiv arXiv 2026

[11] [11]

Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe.arXiv preprint arXiv:2604.13016, 2026

Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, and Ning Ding. Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe.arXiv preprint arXiv:2604.13016, 2026

Pith/arXiv arXiv 2026

[12] [12]

A Survey of On-Policy Distillation for Large Language Models.arXiv preprint arXiv:2604.00626, 2026

Mingyang Song and Mao Zheng. A Survey of On-Policy Distillation for Large Language Models.arXiv preprint arXiv:2604.00626, 2026

Pith/arXiv arXiv 2026

[13] [13]

Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models

Feng Luo, Yu-Neng Chuang, Guanchu Wang, Zicheng Xu, Xiaotian Han, Tianyi Zhang, and Vladimir Braverman. Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models. arXiv preprint arXiv:2604.08527, 2026

Pith/arXiv arXiv 2026

[14] [14]

Entropy-Aware On-Policy Distillation of Language Models.arXiv preprint arXiv:2603.07079, 2026

Woogyeol Jin, Taywon Min, Yongjin Yang, Swanand Ravindra Kadhe, Yi Zhou, Dennis Wei, Nathalie Baracaldo, and Kimin Lee. Entropy-Aware On-Policy Distillation of Language Models.arXiv preprint arXiv:2603.07079, 2026

Pith/arXiv arXiv 2026

[15] [15]

Are Full Rollouts Necessary for On-Policy Distillation? arXiv preprint arXiv:2605.31490, 2026

Yaocheng Zhang, Jiajun Chai, Songjun Tu, Yuqian Fu, Xiaohan Wang, Wei Lin, Guojun Yin, Qichao Zhang, Yuanheng Zhu, and Dongbin Zhao. Are Full Rollouts Necessary for On-Policy Distillation? arXiv preprint arXiv:2605.31490, 2026

Pith/arXiv arXiv 2026

[16] [16]

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning.arXiv preprint arXiv:2605.07804, 2026

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, and Jing Tang. Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning.arXiv preprint arXiv:2605.07804, 2026

Pith/arXiv arXiv 2026

[17] [17]

TIP: Token Importance in On-Policy Distillation.arXiv preprint arXiv:2604.14084, 2026

Yuanda Xu, Hejian Sang, Zhengze Zhou, Ran He, Zhipeng Wang, and Alborz Geramifard. TIP: Token Importance in On-Policy Distillation.arXiv preprint arXiv:2604.14084, 2026

Pith/arXiv arXiv 2026

[18] [18]

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting.arXiv preprint arXiv:2604.10688, 2026

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, and Xunliang Cai. SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting.arXiv preprint arXiv:2604.10688, 2026

Pith/arXiv arXiv 2026

[19] [19]

SG-OPD: Sign- Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling.arXiv preprint arXiv:2606.09304, 2026

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, and Xiaosong Yuan. SG-OPD: Sign- Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling.arXiv preprint arXiv:2606.09304, 2026

Pith/arXiv arXiv 2026

[20] [20]

Trust Region On-Policy Distillation.arXiv preprint arXiv:2606.01249, 2026

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, and Yehui Tang. Trust Region On-Policy Distillation.arXiv preprint arXiv:2606.01249, 2026

Pith/arXiv arXiv 2026

[21] [21]

Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation

Yuying Li, Leqi Zheng, Yongzi Yu, Wenrui Zhou, Xuchang Zhong, Xing Hu, Jing Jin, Huangjie Yuan, and Tao Feng. Filter, Then Reweight: Rethinking Optimization Granularity in On-Policy Distillation. arXiv preprint arXiv:2606.02684, 2026

Pith/arXiv arXiv 2026

[22] [22]

Xianwei Chen, Shimin Zhang, and Jibin Wu.f-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control.arXiv preprint arXiv:2605.17862, 2026

Pith/arXiv arXiv 2026

[23] [23]

AsyncOPD: How Stale Can On-Policy Distillation Be? OpenReview, 2026

Wonjun Kang, Kevin Galim, Seunghyuk Oh, Minjun Kang, Sanghyun Park, Donghoon Kim, Minjae Lee, Minseo Kim, Rishabh Tiwari, Yuchen Zeng, Hyung Il Koo, and Kangwook Lee. AsyncOPD: How Stale Can On-Policy Distillation Be? OpenReview, 2026

2026

[24] [24]

Qwen3-4B-Base-GRPO model card

lllyx. Qwen3-4B-Base-GRPO model card. Hugging Face, 2026. Model page. 8

2026