pith. sign in

arxiv: 2606.24143 · v1 · pith:RJNX5HCInew · submitted 2026-06-23 · 💻 cs.LG

AsyncOPD: How Stale Can On-Policy Distillation Be?

Pith reviewed 2026-06-26 00:38 UTC · model grok-4.3

classification 💻 cs.LG
keywords on-policy distillationasynchronous trainingstale dataKL divergenceLLM post-trainingMonte Carlo estimationtraining throughputreverse KL
0
0 comments X

The pith

Asynchronous on-policy distillation with finite teacher caches reaches comparable accuracy at 1.6× to 3.8× higher throughput than synchronous training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how stale rollouts affect on-policy distillation when rollout generation is decoupled from learner updates in LLM post-training. It shows that KL direction determines robustness: teacher-weighted forward KL tolerates staleness, while student-weighted reverse KL does not. For the vulnerable reverse-KL case, recomputing the loss signal under the current student at learner time outperforms stabilization methods from asynchronous RL. Finite teacher-score caches introduce a bias-variance tradeoff that multi-sample Monte Carlo estimation reduces while keeping the estimator correctable. These choices enable a fully asynchronous pipeline that raises training speed substantially without accuracy loss.

Core claim

On-policy distillation faces an on-policy systems bottleneck similar to RL because rollouts dominate time. Asynchronous pipelines introduce stale-policy data, and the study finds that teacher-weighted forward KL is robust to this staleness whereas student-weighted reverse KL is vulnerable. For reverse KL, recomputing the signal under the current student works better than borrowed RL techniques. Finite caches create bias-variance issues best addressed by multi-sample Monte Carlo, which preserves MC correctability while lowering one-sample variance. The resulting AsyncOPD pipeline improves throughput by 1.6× to 3.8× over strict synchronous training while reaching comparable accuracy.

What carries the argument

Recomputed reverse-KL estimators with multi-sample Monte Carlo correction on finite teacher-score caches, which manage staleness-induced bias and variance in local KL losses.

If this is right

  • Teacher-weighted forward KL remains accurate even with stale rollouts.
  • Recomputing the reverse-KL signal at learner time stabilizes training better than RL-derived stabilization methods.
  • Multi-sample Monte Carlo reduces variance in sparse and sampled reverse-KL estimators while preserving correctability.
  • Fully asynchronous OPD pipelines become practical without major accuracy penalties.
  • Training throughput rises by 1.6× to 3.8× while accuracy stays comparable to synchronous baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same estimator choices could reduce the rollout bottleneck in other teacher-guided on-policy methods beyond distillation.
  • Adaptive cache sizing or dynamic sample counts might further tune the bias-variance tradeoff as model scale increases.
  • OPD appears to have different staleness sensitivities than standard asynchronous RL, suggesting OPD-specific analysis is needed rather than direct transfer of RL techniques.
  • At even larger scales where rollout time dominates more severely, the relative throughput gains from asynchrony could exceed the reported range.

Load-bearing premise

The practical setting of local KL losses with finite teacher-score caches represents the dominant failure modes of stale reverse-KL estimators in future large-scale OPD.

What would settle it

An experiment on a workload where full-vocabulary teacher logits can be stored and transferred without cost, showing whether accuracy remains comparable when the finite-cache assumption is removed.

Figures

Figures reproduced from arXiv: 2606.24143 by Donghoon Kim, Hyung Il Koo, Kangwook Lee, Kevin Galim, Minjae Lee, Minjun Kang, Minseo Kim, Rishabh Tiwari, Sanghyun Park, Seunghyuk Oh, Wonjun Kang, Yuchen Zeng.

Figure 1
Figure 1. Figure 1: Estimator design for asynchronous OPD. (a) Dense KL is the full-vocabulary reference, but [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy comparison under staleness for forward- and reverse-KL OPD. Reverse KL starts [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy comparison under staleness for the advantage-and-clipping ablation. Recomputing [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Accuracy comparison under staleness for advanced asynchronous RL surrogates. Decou [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Aθ reduces the p99 ρθ tail under no clip [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy comparison under staleness for sampled MC versus stale top- [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Accuracy comparison under staleness for multi-sample MC. Increasing the number of [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Multi-sample MC (m = 2). Concretely, at each visited timestep t with prefix st, roll￾out samples at,1, . . . , at,m ∼ pold(· | st) and caches their rollout log probabilities and teacher scores. For no￾tational simplicity, write s = st and ai = at,i below. At learner time, we recompute Aθ(ai , s) and use the av￾eraged unclipped old-to-current IS surrogate LbMC m (θ; s) = − 1 m Pm i=1 ρθ(ai , s) sg(Aθ(ai , s… view at source ↗
Figure 9
Figure 9. Figure 9: Scheduler comparison for synchronous OPD, step-off scheduling, and AsyncOPD. Syn [PITH_FULL_IMAGE:figures/full_fig_p009_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy comparison under staleness for MC importance-sampling ablations. Increasing [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Train-time AIME24 Avg@32 for Qwen3-Base students with MC64 and MC1. Lines are [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Train-time AIME24 Avg@32 for Qwen3 1.7B, 4B, and 8B students with thinking disabled, [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
read the original abstract

On-policy distillation (OPD) trains a student on its own rollouts guided by teacher feedback and is becoming increasingly important for large language model (LLM) post-training. Like reinforcement learning (RL), however, OPD faces an on-policy systems bottleneck, as rollouts can dominate training time for reasoning workloads. Asynchronous training pipelines can alleviate this bottleneck by decoupling rollout generation from learner updates, but doing so introduces stale-policy data. While prior work has studied stale data in asynchronous RL, its effects in OPD remain underexplored. We present the first systematic study of staleness in asynchronous OPD, focusing on a practical setting where teacher feedback is implemented through local KL losses and full-vocabulary teacher logits are too expensive to store or transfer, necessitating finite teacher-score caches. We first show that KL direction changes the stale-data problem: teacher-weighted forward KL is more robust to stale rollouts, whereas student-weighted reverse KL is vulnerable. Second, for this vulnerable reverse-KL case, we study whether methods designed to stabilize asynchronous RL can mitigate OPD staleness. In our experiments, they do not improve over a simpler OPD-specific surrogate: recomputing the reverse-KL signal under the current student at learner time. Third, we analyze how finite teacher-score caches create a bias-variance tradeoff for sparse and sampled reverse-KL OPD estimators. This motivates multi-sample Monte Carlo (MC), which preserves MC correctability while reducing one-sample variance. Finally, we present and open-source AsyncOPD, a fully asynchronous OPD training pipeline built from these estimator choices. Experiments show that AsyncOPD improves training throughput by $1.6\times$ to $3.8\times$ over strict synchronous training while reaching comparable accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents the first systematic study of staleness effects in asynchronous on-policy distillation (OPD) for LLMs. Focusing on the practical setting of local KL losses with finite teacher-score caches (as full logits are too expensive), it shows that teacher-weighted forward KL is more robust to stale rollouts than student-weighted reverse KL; for the latter, a simple OPD-specific surrogate (recomputing reverse KL under the current student) outperforms methods from async RL; it analyzes the bias-variance tradeoff induced by finite caches and motivates multi-sample Monte Carlo estimators; and it introduces and open-sources AsyncOPD, which delivers 1.6×–3.8× throughput gains over synchronous training while reaching comparable accuracy.

Significance. If the empirical results hold, the work is significant for scaling LLM post-training pipelines, where rollout time dominates. The open-sourcing of AsyncOPD and the internally coherent bias-variance analysis of the cache-based reverse-KL estimators are clear strengths that support reproducibility and future work. The finding that KL direction qualitatively changes the stale-data problem is a useful conceptual contribution.

major comments (2)
  1. [Experiments] Experiments (throughput/accuracy tables and figures): the manuscript reports no variance or standard deviation across random seeds for the 1.6×–3.8× throughput and accuracy comparisons. This makes it difficult to assess whether the “comparable accuracy” claim is statistically reliable or sensitive to initialization.
  2. [Finite-cache analysis] Finite-cache analysis and MC surrogate section: while the bias-variance tradeoff for sparse/sampled reverse-KL estimators is derived, the paper provides no additional ablation varying cache size (beyond the main figures) or sampling density. Because the central claim rests on the practical finite-cache regime, this omission leaves open whether the observed robustness and throughput gains are sensitive to cache-size choices that would appear at larger scale.
minor comments (1)
  1. Notation for the MC surrogate and cache estimators could be clarified with an explicit algorithm box or pseudocode to make the bias-correction step easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation for minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments (throughput/accuracy tables and figures): the manuscript reports no variance or standard deviation across random seeds for the 1.6×–3.8× throughput and accuracy comparisons. This makes it difficult to assess whether the “comparable accuracy” claim is statistically reliable or sensitive to initialization.

    Authors: We agree that variance across random seeds would strengthen assessment of the comparable-accuracy claim. Our reported results used single seeds owing to the substantial compute cost of large-scale LLM training; in the revision we will rerun the key throughput/accuracy comparisons with 3–5 seeds and report means plus standard deviations. revision: yes

  2. Referee: [Finite-cache analysis] Finite-cache analysis and MC surrogate section: while the bias-variance tradeoff for sparse/sampled reverse-KL estimators is derived, the paper provides no additional ablation varying cache size (beyond the main figures) or sampling density. Because the central claim rests on the practical finite-cache regime, this omission leaves open whether the observed robustness and throughput gains are sensitive to cache-size choices that would appear at larger scale.

    Authors: The main experimental figures already vary cache size across the asynchronous regimes studied. To further address sensitivity at larger scales we will add an explicit ablation on cache size and sampling density (e.g., in an appendix) in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical throughput claims rest on external synchronous baselines

full rationale

The paper presents an empirical study of staleness effects in asynchronous OPD, proposing estimator choices (local KL, finite caches, MC surrogate for reverse KL) and validating them via direct measurements of throughput (1.6×–3.8×) and accuracy against strict synchronous training. No derivation chain, uniqueness theorem, or fitted parameter is invoked whose output is definitionally equivalent to its input; all performance numbers are obtained from external code baselines and workloads. The bias-variance analysis of cache-based estimators follows standard Monte Carlo principles without self-referential reduction. This is the normal case of a self-contained experimental paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is an empirical systems study with no mathematical free parameters or invented entities in the central claim. The only modeling choices are standard hyperparameters (cache size, number of MC samples) whose values are reported but not fitted to the accuracy result.

pith-pipeline@v0.9.1-grok · 5897 in / 1135 out tokens · 12871 ms · 2026-06-26T00:38:59.185144+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 1 canonical work pages

  1. [1]

    Agarwal, N

    R. Agarwal, N. Vieillard, Y . Zhou, P. Stanczyk, S. R. Garea, M. Geist, and O. Bachem. On- policy distillation of language models: Learning from self-generated mistakes. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview. net/forum?id=3zKtaqxLhW

  2. [2]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  3. [3]

    Devvrit, L

    F. Devvrit, L. Madaan, R. Tiwari, R. Bansal, S. S. Duvvuri, M. Zaheer, I. S. Dhillon, D. Brandfonbrener, and R. Agarwal. The art of scaling reinforcement learning compute for LLMs. InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=FMjeC9Msws

  4. [4]

    W. Fu, J. Gao, X. Shen, C. Zhu, Z. Mei, C. He, S. Xu, G. Wei, J. Mei, W. JIASHU, T. Yang, B. Yuan, and Y . Wu. AREAL: A large-scale asynchronous reinforcement learning system for language reasoning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=X9diEuva9R

  5. [5]

    W. Gao, Y . Zhao, D. An, T. Wu, L. Cao, S. Xiong, J. Huang, W. Wang, S. Yang, W. Su, et al. Rollpacker: Mitigating long-tail rollouts for fast, synchronous rl post-training.arXiv preprint arXiv:2509.21009, 2025

  6. [6]

    Y . Gu, L. Dong, F. Wei, and M. Huang. MiniLLM: Knowledge distillation of large language models. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5h0qf7IBZZ

  7. [7]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  8. [8]

    Z. He, T. Liang, J. Xu, Q. Liu, X. Chen, Y . Wang, L. Song, D. Yu, Z. Liang, W. Wang, et al. Deepmath-103k: A large-scale, challenging, decontaminated, and verifiable mathematical dataset for advancing reasoning.arXiv preprint arXiv:2504.11456, 2025

  9. [9]

    W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023

  10. [10]

    X. Li, S. Wu, and Z. Shen. A-3po: Accelerating asynchronous llm training with staleness-aware proximal policy approximation.arXiv preprint arXiv:2512.06547, 2025

  11. [11]

    Y . Li, Y . Zuo, B. He, J. Zhang, C. Xiao, C. Qian, T. Yu, H.-a. Gao, W. Yang, Z. Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  12. [12]

    On-policy distillation.ThinkingMachinesLab: Connectionism, 2025

    K. Lu and T. M. Lab. On-policy distillation.Thinking Machines Lab: Connectionism, 2025. doi: 10.64434/tml.20251026. https://thinkingmachines.ai/blog/on-policy-distillation

  13. [13]

    American Mathematics Competitions – AMC

    Mathematical Association of America. American Mathematics Competitions – AMC. https: //maa.org/, 2023. Accessed 2026-04-03

  14. [14]

    Noukhovitch, S

    M. Noukhovitch, S. Huang, S. Xhonneux, A. Hosseini, R. Agarwal, and A. Courville. Faster, more efficient RLHF through off-policy asynchronous learning. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=FhTAG591Ve

  15. [15]

    Paszke, S

    A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. InNIPS-W, 2017

  16. [16]

    Schulman, F

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  17. [17]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  18. [18]

    Sheng, Y

    G. Sheng, Y . Tong, B. Wan, W. Zhang, C. Jia, X. Wu, Y . Wu, X. Li, C. Zhang, Y . Peng, et al. Laminar: A scalable asynchronous rl post-training framework.arXiv preprint arXiv:2510.12633, 2025

  19. [19]

    Sheng, C

    G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient rlhf framework. InProceedings of the Twentieth European Conference on Computer Systems, pages 1279–1297, 2025

  20. [20]

    Song and M

    M. Song and M. Zheng. A survey of on-policy distillation for large language models.arXiv preprint arXiv:2604.00626, 2026

  21. [21]

    B. Xiao, B. Xia, B. Yang, B. Gao, B. Shen, C. Zhang, C. He, C. Lou, F. Luo, G. Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

  22. [22]

    Y . Xu, H. Sang, Z. Zhou, R. He, Z. Wang, and A. Geramifard. Tip: Token importance in on-policy distillation.arXiv preprint arXiv:2604.14084, 2026

  23. [23]

    R. Yan, Y . Jiang, T. Wu, J. Gao, Z. Mei, W. Fu, H. Mai, W. Wang, Y . Wu, and B. Yuan. Areal-hex: Accommodating asynchronous rl training over heterogeneous gpus.arXiv preprint arXiv:2511.00796, 2025

  24. [24]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  25. [25]

    W. Yang, W. Liu, R. Xie, K. Yang, S. Yang, and Y . Lin. Learning beyond teacher: Generalized on-policy distillation with reward extrapolation.arXiv preprint arXiv:2602.12125, 2026. 11

  26. [26]

    Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, W. Dai, T. Fan, G. Liu, L. Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  27. [27]

    A. Zeng, X. Lv, Z. Hou, Z. Du, Q. Zheng, B. Chen, D. Yin, C. Ge, C. Huang, C. Xie, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

  28. [28]

    Zhang, Y

    K. Zhang, Y . Zuo, B. He, Y . Sun, R. Liu, C. Jiang, Y . Fan, K. Tian, G. Jia, P. Li, et al. A survey of reinforcement learning for large reasoning models.arXiv preprint arXiv:2509.08827, 2025

  29. [29]

    Zhang, X

    S. Zhang, X. Zhang, T. Zhang, B. Hu, Y . Chen, and J. Xu. Kdflow: A user-friendly and efficient knowledge distillation framework for large language models.arXiv preprint arXiv:2603.01875, 2026

  30. [30]

    Zhang and T

    Y . Zhang and T. Math-AI. AIME 2024. https://huggingface.co/datasets/ Maxwell-Jia/AIME_2024, 2024. Hugging Face dataset; accessed 2026-04-03

  31. [31]

    Zhang and T

    Y . Zhang and T. Math-AI. AIME 2025. https://huggingface.co/datasets/ yentinglin/aime_2025, 2025. Hugging Face dataset; accessed 2026-04-03

  32. [32]

    S. Zhao, Z. Xie, M. Liu, J. Huang, G. Pang, F. Chen, and A. Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  33. [33]

    Zheng, J

    H. Zheng, J. Zhao, and B. Chen. Prosperity before collapse: How far can off-policy RL reach with stale data on LLMs? InThe F ourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=IIgl5MWelz

  34. [34]

    Zhong, Z

    Y . Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, N. Chen, Y . Chen, Y . Zhou, C. Wan, et al. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation. arXiv preprint arXiv:2504.15930, 2025. A Sparse and Monte Carlo Reverse-KL Implementations A.1 Sparse Top-kReverse-KL OPD The dense reverse-KL objective in Eq. (2) ...