arxiv: 2511.18871 · v7 · submitted 2025-11-24 · 💻 cs.LG · cs.AI

Periodic Asynchrony: An On-Policy Approach for Accelerating LLM Reinforcement Learning

Jian Lu This is my paper

Pith reviewed 2026-05-17 05:29 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords periodic asynchronyon-policy RLLLM reinforcement learningasynchronous frameworkthroughput improvementtri-model architectureshared-prompt attention

0 comments

The pith

Synchronizing model weights at the start of each iteration keeps asynchronous RL training on-policy for LLM post-training without changes to standard algorithms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a periodically asynchronous framework that separates inference and training for reinforcement learning in large language models, turning the process into an asynchronous producer-consumer pipeline. By synchronizing model weights only at the beginning of each training iteration and generating all rollouts from the same policy, it stays inherently on-policy. This avoids the off-policy bias of other asynchronous methods while enabling concurrent execution for better efficiency. A unified tri-model architecture and shared-prompt attention mechanism further support this by reducing redundant computations. Experiments show roughly 2x throughput gains on NPU and up to 3x speedups on GPU, with accuracy comparable to synchronous baselines.

Core claim

By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently on-policy without any modification to standard RL algorithms, thereby avoiding the off-policy bias introduced by existing asynchronous approaches and enabling faster training through separation of inference and training.

What carries the argument

The periodically asynchronous framework that synchronizes model weights at the start of each iteration, combined with a unified tri-model architecture and shared-prompt attention mechanism for efficient execution.

If this is right

Transforms synchronous RL training into an asynchronous pipeline for concurrent inference and training.
Achieves approximately 2x throughput improvement on NPU platforms from asynchronous execution.
Delivers speedups of up to 3x on GPU platforms compared to mainstream frameworks.
Maintains comparable accuracy while providing substantial end-to-end throughput gains.
Offers an algorithm-agnostic solution applicable to standard RL methods for scalable post-training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the periodic synchronization to other RL settings could improve efficiency in non-LLM domains.
The tri-model setup might inspire similar architectures for reducing compute in distributed training systems.
Further research could explore adaptive synchronization intervals based on model convergence rates.

Load-bearing premise

Synchronizing model weights only at the start of each training iteration and generating all rollouts from that fixed policy is sufficient to keep the entire process on-policy without introducing meaningful bias or instability over multiple iterations.

What would settle it

If training with this framework shows degraded final model performance or higher variance in learning curves compared to a synchronous on-policy baseline on standard LLM RL tasks, it would indicate that the periodic approach introduces unacceptable bias.

Figures

Figures reproduced from arXiv: 2511.18871 by Jian Lu.

**Figure 1.** Figure 1: Triple Models with Shared Distribution. reference logits. The reference network retains the original model weights, while the old policy network holds weights delayed by one training step relative to the policy model. The overall architecture of the reinforcement learning system constructed in this work is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Shared-prompt attention mask. 4. For probability computation, the prompt tokens are discarded and the loss is computed only over the response tokens: π = − CrossEntropy yˆ|xp|:|x| , y , (11) where yˆ|xp|:|x| represents the predicted logits of the response tokens. All other aspects remain the same as in standard training. This grouped micro-batch approach is generally more computationally efficient than… view at source ↗

**Figure 4.** Figure 4: Comparison of synchronous vs. asynchronous training. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Training step-wise average reward score comparison. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: TPS of our framework under different data-parallel scales. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Since the introduction of the GRPO algorithm, reinforcement learning (RL) has attracted increasing attention for LLM post-training, yet training efficiency remains a critical challenge. In mainstream RL frameworks, inference and training are co-located on the same devices, and their synchronous execution prevents concurrent inference and training. In this work, we revisit the strategy of separating inference and training deployment, and propose a periodically asynchronous framework that transforms synchronous RL training into an asynchronous producer-consumer pipeline. By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently on-policy -- without any modification to standard RL algorithms -- thereby avoiding the off-policy bias introduced by existing asynchronous approaches. We further introduce a unified tri-model architecture and a shared-prompt attention mechanism to support efficient asynchronous execution and reduce redundant computation. Experiments on NPU platforms show approximately 2x throughput improvement from asynchronous execution, with additional gains from system-level optimisations, substantially outperforming mainstream RL frameworks in end-to-end throughput, with speedups of up to 3x on GPU platforms, further confirming cross-architecture generalisability while maintaining comparable accuracy. The proposed framework thus offers a practical, algorithm-agnostic solution for scalable RL post-training without sacrificing on-policy correctness. Code available at: https://github.com/janelu9/EasyLLM

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Syncing weights at the start of each iteration keeps the async rollout generation on-policy for standard RL algorithms without any changes to them.

read the letter

This paper's main move is to sync model weights only at the beginning of each training iteration, then generate the full batch of rollouts from that single fixed policy snapshot. The result is an asynchronous producer-consumer pipeline that still qualifies as on-policy for off-the-shelf algorithms like GRPO, avoiding the bias that comes from continuous policy drift in other async setups. They add a tri-model architecture and shared-prompt attention to cut redundant work during generation and training separation. The reported numbers are 2x throughput on NPU hardware and up to 3x on GPUs, with accuracy staying comparable to synchronous baselines, plus a code release on GitHub. That combination of a simple synchronization rule and concrete system tweaks is what is actually new here. The work does a decent job showing how to turn a synchronous RL loop into an async one while preserving correctness, and the cross-architecture results add some credibility. The soft spots sit in the experimental side. The abstract gives headline speedups but leaves out error bars, component ablations, and checks on long-term stability or sensitivity to iteration length. It is not clear how much small policy shifts might accumulate if rollouts take variable time or if the shared-prompt mechanism introduces any subtle inconsistencies. These gaps are not fatal for an engineering paper, but they make the robustness claims harder to judge from the summary alone. This is aimed at practitioners scaling RL post-training for LLMs who are hitting inference-training bottlenecks on fixed hardware. Anyone building or tuning async training pipelines will pick up usable ideas from the design and the throughput data. It deserves peer review because the core mechanism is straightforward to test and the efficiency gains address a real, recurring problem in the area.

Referee Report

2 major / 2 minor

Summary. The paper claims that a periodically asynchronous producer-consumer pipeline for LLM RL (e.g., GRPO) can be made inherently on-policy by synchronizing model weights only at the start of each training iteration and generating all rollouts from that fixed policy snapshot. This avoids off-policy bias without algorithm changes. The authors introduce a unified tri-model architecture and shared-prompt attention mechanism to enable efficient asynchronous execution, reporting ~2x throughput gains on NPU platforms and up to 3x on GPU with comparable accuracy to synchronous baselines.

Significance. If the on-policy guarantee and throughput claims hold under rigorous validation, the work provides a practical, algorithm-agnostic engineering solution for scaling RL post-training of LLMs by decoupling inference and training. The open-source code link supports reproducibility, which strengthens the contribution for production-oriented ML systems.

major comments (2)

[Abstract / Experiments] Abstract and experimental results: the reported throughput improvements (2x on NPU, up to 3x on GPU) and comparable accuracy are presented without error bars, ablation studies on the synchronization interval, or detailed comparisons of on-policy bias metrics (e.g., KL divergence or policy shift over iterations). These omissions make it difficult to assess whether the central performance and correctness claims are robust.
[Method / Architecture] The description of the unified tri-model architecture and shared-prompt attention mechanism (introduced to support asynchronous execution) lacks a precise specification of how they interact with the periodic weight synchronization to preserve the fixed-policy guarantee across multiple iterations; a formal argument or pseudocode showing no policy drift would strengthen the on-policy claim.

minor comments (2)

[Method] Notation for the tri-model components and shared-prompt attention could be clarified with a diagram or explicit equations showing data flow between inference, training, and synchronization steps.
[Introduction] The paper would benefit from additional references to prior asynchronous RL frameworks in the LLM domain to better position the novelty of the periodic synchronization approach.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments highlight important aspects for strengthening the empirical validation and formal clarity of our on-policy claims. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental results: the reported throughput improvements (2x on NPU, up to 3x on GPU) and comparable accuracy are presented without error bars, ablation studies on the synchronization interval, or detailed comparisons of on-policy bias metrics (e.g., KL divergence or policy shift over iterations). These omissions make it difficult to assess whether the central performance and correctness claims are robust.

Authors: We agree that additional statistical detail and analysis would improve the robustness assessment. In the revised manuscript we will add error bars (mean ± standard deviation over at least three independent runs) to all throughput and accuracy figures. We will also include a new ablation subsection that varies the synchronization interval and reports its effects on both end-to-end throughput and on-policy metrics. Finally, we will add plots and tables of average KL divergence (and policy-shift statistics) between the fixed rollout policy and the training policy across iterations to empirically support the on-policy guarantee. revision: yes
Referee: [Method / Architecture] The description of the unified tri-model architecture and shared-prompt attention mechanism (introduced to support asynchronous execution) lacks a precise specification of how they interact with the periodic weight synchronization to preserve the fixed-policy guarantee across multiple iterations; a formal argument or pseudocode showing no policy drift would strengthen the on-policy claim.

Authors: We acknowledge that a more explicit formalization would strengthen the presentation. In the revised Method section we will insert a concise formal argument: because weight synchronization occurs exclusively at iteration boundaries and the inference model remains frozen for the entire rollout-generation phase of that iteration, every sample collected within the iteration is generated under an identical policy snapshot; consequently the collected data is on-policy with respect to the policy being optimized in the subsequent training step. We will also add pseudocode that shows the tri-model (actor, critic, reference) lifecycle, the periodic synchronization call, and the point at which shared-prompt attention is invoked, making the absence of intra-iteration policy drift explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; engineering framework is self-contained

full rationale

The paper proposes a periodically asynchronous RL framework for LLM post-training that synchronizes model weights only at the start of each iteration so that all rollouts in a batch are generated from one fixed policy snapshot. This design directly ensures the data remains on-policy for unmodified standard algorithms such as GRPO, as stated in the abstract, and the claim is supported by throughput experiments rather than any mathematical derivation or equations. No fitted parameters are renamed as predictions, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The approach is therefore an independent engineering construction validated empirically against external benchmarks, with no reduction of the central on-policy guarantee to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claims rest on the assumption that periodic weight synchronization suffices for on-policy correctness and on the effectiveness of the newly introduced tri-model and shared-prompt attention components; no free parameters are fitted in the abstract description.

axioms (1)

domain assumption Hardware platforms allow efficient separate deployment of inference and training without prohibitive communication overhead.
Invoked to enable the producer-consumer pipeline.

invented entities (2)

unified tri-model architecture no independent evidence
purpose: Support efficient asynchronous execution and reduce redundant computation.
New component introduced to enable the async framework.
shared-prompt attention mechanism no independent evidence
purpose: Reduce redundant computation in the asynchronous setup.
New mechanism introduced alongside the tri-model.

pith-pipeline@v0.9.0 · 5536 in / 1407 out tokens · 39661 ms · 2026-05-17T05:29:00.647574+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat orbit and 8-tick periodicity unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By synchronising model weights at the beginning of each training iteration and generating all rollouts from the same policy, the proposed framework remains inherently on-policy
IndisputableMonolith/Cost/FunctionalEquation.lean J(x) = ½(x + x⁻¹) − 1 uniqueness unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

periodic asynchrony... producer-consumer pipeline

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 7 internal anchors

[1]

Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training,

[Bartoldsonet al., 2025 ] Brian Bartoldson, Siddarth Venka- traman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, and Bhavya Kailkhura. Trajectory balance with asynchrony: Decoupling exploration and learning for fast, scalable llm post-training,

work page 2025
[2]

Training Verifiers to Solve Math Word Problems

[Cobbeet al., 2021 ] Karl Cobbe, Vineet Kosaraju, Moham- mad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Train- ing verifiers to solve math word problems.arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv 2021
[3]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

[Dao, 2023] Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691,

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning,

[DeepSeek-AI, 2025] DeepSeek-AI. Deepseek-r1: Incen- tivizing reasoning capability in llms via reinforcement learning,

work page 2025
[5]

Mindspeed rl: Distributed dataflow for scalable and efficient rl train- ing on ascend npu cluster,

[Fenget al., 2025 ] Laingjun Feng, Chenyi Pan, Xinjie Guo, Fei Mei, Benzhe Ning, Jianxiang Zhang, Xinyang Liu, Beirong Zhou, Zeng Shu, Chang Liu, Guang Yang, Zhenyu Han, Jiangben Wang, and Bo Wang. Mindspeed rl: Distributed dataflow for scalable and efficient rl train- ing on ascend npu cluster,

work page 2025
[6]

Areal: A large-scale asynchronous reinforcement learning system for language reasoning,

[Fuet al., 2025 ] Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. Areal: A large-scale asynchronous reinforcement learning system for language reasoning,

work page 2025
[7]

Deepseek-r1 incen- tivizes reasoning in llms through reinforcement learning

[Guoet al., 2025 ] Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incen- tivizes reasoning in llms through reinforcement learning. Nature, 645(8081):633–638,

work page 2025
[8]

OpenRLHF: An Easy-to-use, Scalable and High-performance RLHF Framework

[Huet al., 2024 ] Jian Hu, Xibin Wu, Zilin Zhu, Xianyu, Weixun Wang, Dehao Zhang, and Yu Cao. Openrlhf: An easy-to-use, scalable and high-performance rlhf frame- work.arXiv preprint arXiv:2405.11143,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Open r1: A fully open reproduction of deepseek-r1, January

[Hugging Face, 2025] Hugging Face. Open r1: A fully open reproduction of deepseek-r1, January

work page 2025
[10]

Gonzalez, Hao Zhang, and Ion Stoica

[Kwonet al., 2023 ] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page 2023
[11]

Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony,

[Luet al., 2025 ] Han Lu, Zichen Liu, Shaopan Xiong, Yancheng He, Wei Gao, Yanan Wu, Weixun Wang, Ji- ashun Liu, Yang Li, Haizhou Zhao, Ju Huang, Siran Yang, Xiaoyang Li, Yijia Luo, Zihe Liu, Ling Pan, Junchi Yan, Wei Wang, Wenbo Su, Jiamang Wang, Lin Qu, and Bo Zheng. Part ii: Roll flash – accelerating rlvr and agentic training with asynchrony,

work page 2025
[12]

Deepscaler: Sur- passing o1-preview with a 1.5b model by scaling rl

[Luo and others, 2025] Michael Luo et al. Deepscaler: Sur- passing o1-preview with a 1.5b model by scaling rl. https: //tinyurl.com/deepscaler-2025,

work page 2025
[13]

[Noukhovitchet al., 2024 ] Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agar- wal, and Aaron Courville

Notion Blog. [Noukhovitchet al., 2024 ] Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agar- wal, and Aaron Courville. Asynchronous rlhf: Faster and more efficient off-policy rl for language models.arXiv preprint arXiv:2410.18252,

work page arXiv 2024
[14]

Deepspeed: Sys- tem optimizations enable training deep learning models with over 100 billion parameters

[Rasleyet al., 2020 ] Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: Sys- tem optimizations enable training deep learning models with over 100 billion parameters. InProceedings of the 26th ACM SIGKDD international conference on knowl- edge discovery & data mining, pages 3505–3506,

work page 2020
[15]

Proximal Policy Optimization Algorithms

[Schulmanet al., 2017 ] John Schulman, Filip Wolski, Pra- fulla Dhariwal, Alec Radford, and Oleg Klimov. Prox- imal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

HybridFlow: A Flexible and Efficient RLHF Framework

[Shenget al., 2024 ] Guangming Sheng, Chi Zhang, Zil- ingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flex- ible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256,

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Laminar: A scalable asynchronous rl post- training framework,

[Shenget al., 2025 ] Guangming Sheng, Yuxuan Tong, Borui Wan, Wang Zhang, Chaobo Jia, Xibin Wu, Yuqi Wu, Xi- ang Li, Chi Zhang, Yanghua Peng, Haibin Lin, Xin Liu, and Chuan Wu. Laminar: A scalable asynchronous rl post- training framework,

work page 2025
[18]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

[Shoeybiet al., 2019 ] Mohammad Shoeybi, Mostofa Pat- wary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion pa- rameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 2019
[19]

Qwen3 technical report,

[Team, 2025] Qwen Team. Qwen3 technical report,

work page 2025
[20]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

[Weiet al., 2022 ] Jason Wei, Xuezhi Wang, Dale Schuur- mans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837,

work page 2022
[21]

Llamarl: A distributed asyn- chronous reinforcement learning framework for efficient large-scale llm training,

[Wuet al., 2025 ] Bo Wu, Sid Wang, Yunhao Tang, Jia Ding, Eryk Helenowski, Liang Tan, Tengyu Xu, Tushar Gowda, Zhengxing Chen, Chen Zhu, Xiaocheng Tang, Yundi Qian, Beibei Zhu, and Rui Hou. Llamarl: A distributed asyn- chronous reinforcement learning framework for efficient large-scale llm training,

work page 2025
[22]

Qwen2 Technical Report

[Yanget al., 2024 ] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yan...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Deepspeed-chat: Easy, fast and affordable rlhf training of chatgpt-like models at all scales.arXiv preprint arXiv:2308.01320, 2023

[Yaoet al., 2023 ] Zhewei Yao, Reza Yazdani Aminabadi, Olatunji Ruwase, Samyam Rajbhandari, Xiaoxia Wu, Am- mar Ahmad Awan, Jeff Rasley, Minjia Zhang, Cong- long Li, Connor Holmes, Zhongzhu Zhou, Michael Wy- att, Molly Smith, Lev Kurilenko, Heyang Qin, Masahiro Tanaka, Shuai Che, Shuaiwen Leon Song, and Yuxiong He. DeepSpeed-Chat: Easy, Fast and Affordabl...

work page arXiv 2023
[24]

American invitational mathematics examination (aime) 2024, 2024

[Zhang and Math-AI, 2024] Yifan Zhang and Team Math- AI. American invitational mathematics examination (aime) 2024, 2024

work page 2024