arxiv: 2505.24298 · v5 · submitted 2025-05-30 · 💻 cs.LG · cs.AI

Recognition: no theorem link

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu , Jiaxuan Gao , Xujie Shen , Chen Zhu , Zhiyu Mei , Chuyi He , Shusheng Xu , Guo Wei

show 5 more authors

Jun Mei Jiashu Wang Tongkai Yang Binhang Yuan Yi Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords asynchronous reinforcement learninglarge language modelsreasoningPPOtraining systemsGPU utilizationstaleness

0 comments

The pith

AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AReaL, an asynchronous RL system for training LLMs on reasoning. Synchronous systems waste GPU time waiting for the slowest rollout in a batch. AReaL lets rollout workers generate continuously and training workers update as data arrives. Workload balancing controls staleness while a modified PPO handles outdated samples. Experiments show the system delivers substantial speedups without losing performance on math and code benchmarks.

Core claim

AReaL is a fully asynchronous RL system that completely decouples generation from training. Rollout workers continuously generate outputs while training workers update the model on collected batches, with workload balancing to control data staleness and a staleness-enhanced PPO variant for stability. This leads to up to 2.77× training speedup compared to synchronous systems with the same number of GPUs while matching or improving final performance on reasoning benchmarks.

What carries the argument

The asynchronous decoupling of rollout workers from training workers, combined with workload balancing and staleness-enhanced PPO, which allows continuous generation and model updates without batch synchronization waits.

Load-bearing premise

That balancing rollout and training workloads plus the staleness-enhanced PPO variant can maintain training stability and effectiveness even with outdated samples.

What would settle it

Running the same benchmarks on a synchronous system with identical GPUs and observing no speedup or a performance drop would falsify the efficiency claim.

read the original abstract

Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AReaL gives a workable async RL setup for LLM reasoning that reports nearly 3x speedups with performance parity, though the staleness handling in PPO needs clearer isolated checks.

read the letter

The main point is that this paper builds a fully asynchronous RL trainer for LLMs: rollout workers keep generating outputs nonstop while training workers update the model on incoming batches, with workload balancing to limit data staleness and a modified PPO to cope with older samples. They report up to 2.77 times faster training on the same GPUs and matched or better results on math and code reasoning benchmarks compared to standard synchronous setups. The system optimizations for GPU utilization are the practical part that matters at scale. What is new is the specific combination of continuous rollouts, the balancing heuristic, and the staleness-aware PPO variant tailored to LLM workloads; prior async RL work exists in other domains but this targets the generation-training mismatch that hits reasoning tasks hard. The experiments are direct runtime comparisons on standard benchmarks, which is the right way to evaluate a systems claim like this. The code release is also a plus for anyone who wants to test it. The soft spot is the staleness-enhanced PPO. The abstract describes it at a high level but does not spell out the exact changes (clipping adjustments, importance sampling fixes, or age-based reweighting) or include an ablation that runs async training with and without the enhancement. Without those, it is difficult to separate the effect of the balancing (which may already keep staleness modest) from the PPO tweak itself. Variance numbers and fuller baseline descriptions would also help confirm the parity claim holds beyond the reported runs. This is a systems paper aimed at people who train large reasoning models and care about throughput. A reader focused on RL infrastructure will find the architecture and speedup numbers useful even if they want more detail on the stability mechanism. The work is empirical and falsifiable, so it deserves a serious referee to verify the implementation and run the missing checks rather than a desk reject.

Referee Report

3 major / 1 minor

Summary. The paper presents AReaL, a fully asynchronous RL system for LLM reasoning that decouples rollout generation from training. Rollout workers generate continuously while training workers update on collected batches; workload balancing controls staleness and a staleness-enhanced PPO variant is used to maintain stability. Experiments on math and code benchmarks report up to 2.77× training speedup versus synchronous baselines with matched or improved final performance.

Significance. If the empirical claims hold under broader validation, AReaL would demonstrate a practical path to higher GPU utilization in large-scale LLM RL without performance loss, addressing a central systems bottleneck as model sizes and reasoning tasks grow.

major comments (3)

[§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.
[§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.
[§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.

minor comments (1)

[Abstract] The abstract refers to 'a collection of system-level optimizations' without enumerating them; a brief list or pointer to the relevant subsection would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and reproducibility of our work. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.

read point-by-point responses

Referee: [§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.

Authors: We agree that §4 lacks a precise description of the modifications in the staleness-enhanced PPO variant. In the revised manuscript we will expand this section with the full algorithmic details, including any changes to clipping thresholds, importance-sampling corrections, and advantage re-weighting as a function of staleness age. These additions will directly support the stability claims for training with outdated samples. revision: yes
Referee: [§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.

Authors: We acknowledge that an ablation isolating the staleness enhancement (while retaining asynchronous execution) and plots/tables of performance versus measured staleness would strengthen the experimental section. We will add both in the revised manuscript: an ablation comparing the full AReaL system against an asynchronous baseline without the staleness enhancement, plus figures showing final performance and training curves as functions of average data staleness. These will help isolate the contribution of the enhancement from benchmark-specific effects or hyper-parameter choices. revision: yes
Referee: [§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.

Authors: We agree that additional implementation and statistical details are necessary for reproducibility. In the revised manuscript we will expand the experimental section with: (i) precise descriptions of the synchronous baseline implementations, (ii) the number of random seeds used for each result, (iii) statistical variance (standard deviations across seeds), and (iv) complete hyper-parameter tables for all methods and benchmarks. This will allow independent verification of the reported speedups and performance parity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical systems results rest on direct runtime measurements

full rationale

The paper describes an asynchronous RL training system (AReaL) whose core claims are measured speedups (up to 2.77×) and matched/improved benchmark performance on math and code tasks. These outcomes are obtained from end-to-end experiments that compare wall-clock training time and final accuracy against synchronous baselines under identical GPU counts. No mathematical derivation chain, fitted-parameter prediction, or self-citation load-bearing uniqueness theorem is present; the staleness-handling mechanisms are presented as engineering choices whose effectiveness is validated by the same runtime data rather than by construction or tautology. The evaluation is therefore externally falsifiable by re-running the open-source code on the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The contribution is primarily architectural and empirical; no new mathematical axioms, free parameters fitted inside a derivation, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5602 in / 1023 out tokens · 41166 ms · 2026-05-15T14:19:43.363131+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AIS: Adaptive Importance Sampling for Quantized RL
stat.ML 2026-05 unverdicted novelty 7.0

AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
cs.LG 2026-02 conditional novelty 7.0

Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
cs.AI 2026-02 unverdicted novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
cs.LG 2026-05 unverdicted novelty 6.0

FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
cs.AI 2026-05 unverdicted novelty 6.0

T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
cs.LG 2026-04 unverdicted novelty 6.0

DORA's multi-version streaming rollout enables 2-3x higher throughput in asynchronous RL for LLMs while preserving convergence by maintaining policy consistency, data integrity, and bounded staleness.
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
cs.AR 2026-04 unverdicted novelty 6.0

AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
cs.LG 2026-04 unverdicted novelty 6.0

JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
cs.DC 2026-04 unverdicted novelty 6.0

TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
OpenClaw-RL: Train Any Agent Simply by Talking
cs.CL 2026-03 unverdicted novelty 6.0

OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
cs.AI 2026-03 unverdicted novelty 6.0

WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
cs.LG 2026-01 conditional novelty 6.0

ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
cs.LG 2026-01 unverdicted novelty 6.0

FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
cs.AI 2025-03 unverdicted novelty 5.0

The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
Position: Agentic AI System Is a Foreseeable Pathway to AGI
cs.AI 2026-05 unverdicted novelty 4.0

Agentic AI systems with DAG topologies are claimed to deliver exponentially superior generalization and sample efficiency compared to monolithic scaling for achieving AGI.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 4.0

Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 4.0

StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
cs.AI 2026-05 unverdicted novelty 3.0

Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 24 Pith papers · 10 internal anchors

[2]

Dota 2 with Large Scale Deep Reinforcement Learning

C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fis- cher, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with large scale deep reinforcement learni...

work page internal anchor Pith review Pith/arXiv arXiv 1912
[3]

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Z. Chen, A. May, R. Svirschevski, Y . Huang, M. Ryabinin, Z. Jia, and B. Chen. Se- quoia: Scalable and robust speculative decoding. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 129531–129563. Curran Associates, Inc., 2024. URL https://pro...

work page 2024
[5]

Training Verifiers to Solve Math Word Problems

K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with impor- tance weighted actor-learner architectures. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockhol...

work page 2018
[8]

Espeholt, R

L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. SEED RL: scalable and efficient deep-rl with accelerated central inference. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview.net/forum?id=rkgvXlrKwH

work page 2020
[9]

Hendrycks, C

D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. ...

work page 2021
[10]

Hilton, K

J. Hilton, K. Cobbe, and J. Schulman. Batch size-invariance for policy optimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022....

work page 2022
[12]

J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net/foru...

work page 2025
[16]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,

work page 2024
[17]

URLhttps://openreview.net/forum?id=VTF8yNQM66

work page
[18]

Kapturowski, G

S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=r1lyTjAqYX

work page 2019
[20]

W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, editors,Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October ...

work page doi:10.1145/3600006.3613165 2023
[21]

K. Lei, Y . Jin, M. Zhai, K. Huang, H. Ye, and J. Zhai. PUZZLE: efficiently aligning large language models through light-weight context switch. In S. Bagchi and Y . Zhang, editors, Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 127–140. USENIX Association, 2024. URL https: //www.u...

work page 2024
[22]

J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y . Fleureau, G. Lample, and S. Polu. Numina- math. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) , 2024

work page 2024
[23]

Liang, R

E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. I. Jor- dan, and I. Stoica. Rllib: Abstractions for distributed reinforcement learning. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProce...

work page 2018
[24]

B. Y . Lin, R. L. Bras, K. Richardson, A. Sabharwal, R. Poovendran, P. Clark, and Y . Choi. Zebralogic: On the scaling limits of llms for logical reasoning, 2025. URL https://arxiv. org/abs/2502.01100

work page arXiv 2025
[25]

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview...

work page 2024
[26]

M. Luo, S. Tan, R. Huang, A. Patel, A. Ariyak, Q. Wu, X. Shi, R. Xin, C. Cai, M. Weber, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025

work page 2025
[27]

M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,

work page
[28]

Z. Mei, W. Fu, J. Gao, G. Wang, H. Zhang, and Y . Wu. SRL: scaling distributed reinforcement learning to over ten thousand cores. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=lajn1iROCu

work page 2024
[30]

X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programmi...

work page doi:10.1145/3620666.3651335 2024
[33]

URL https://openai.com/index/ learning-to-reason-with-llms/

OpenAI, Sep 2024. URL https://openai.com/index/ learning-to-reason-with-llms/

work page 2024
[34]

URL https://openai.com/index/ introducing-o3-and-o4-mini/

OpenAI, Apr 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/

work page 2025
[35]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...

work page 2022
[36]

J. Pan, J. Zhang, X. Wang, L. Yuan, H. Peng, and A. Suhr. Tinyzero. https://github.com/Jiayi- Pan/TinyZero, 2025. Accessed: 2025-01-24

work page 2025
[37]

Paszke, S

A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An im- perative style, high-performance deep learning library. In H. M. Wallach, H. Larochelle, A. Beygelz...

work page 2019
[39]

M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994. ISBN 978-0-47161977-2. doi: 10.1002/ 9780470316887. URLhttps://doi.org/10.1002/9780470316887

work page doi:10.1002/9780470316887 1994
[40]

Generalized Slow Roll for Tensors

S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: memory optimizations toward training trillion parameter models. In C. Cuicchi, I. Qualters, and W. T. Kramer, editors,Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
[43]

Schulman, P

J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Y . Bengio and Y . LeCun, editors,4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506. 02438

work page 2016
[44]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,

work page internal anchor Pith review Pith/arXiv arXiv
[48]

Hybridflow: A flexible and efficient rlhf framework

G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/3689031.3696075. URL https://doi. or...

work page doi:10.1145/3689031.3696075 2025
[50]

URLhttp://arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 1909
[51]

C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n

work page 2025
[53]

P. I. Team, S. Jaghouar, J. Mattern, J. M. Ong, J. Straube, M. Basra, A. Pazdera, K. Thaman, M. D. Ferrante, F. Gabriel, F. Obeid, K. Erdem, M. Keiblinger, and J. Hagemann. Intellect-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. URL https://arxiv.org/abs/2505.07291

work page arXiv 2025
[54]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...

work page 2017
[55]

Vinyals, I

O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, Ç. Gülçehre, Z. Wang, T. Pfaff,...

work page doi:10.1038/s41586-019-1724-z 2019
[56]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems...

work page 2022
[57]

H. Xin, D. Guo, Z. Shao, Z. Ren, Q. Zhu, B. Liu, C. Ruan, W. Li, and X. Liang. Deepseek-prover: Advancing theorem proving in llms through large-scale synthetic data.CoRR, abs/2405.14333,

work page arXiv
[64]

A. B. Yoo, M. A. Jette, and M. Grondona. SLURM: simple linux utility for resource management. In D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Science, pages 44–60...

work page 2003
[65]

Job Scheduling Strategies for Parallel Processing pp 44–60

doi: 10.1007/10968987\_3. URLhttps://doi.org/10.1007/10968987_3

work page doi:10.1007/10968987
[66]

Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y . Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y . Zhang, L. Yan, M. Qiao, Y . Wu, and M. Wang. DAPO: an open-source LLM reinforcement learning sys...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
[67]

Y . Yue, Y . Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, X. Wei, X. Yu, G. Liu, J. Liu, L. Liu, H. Lin, Z. Lin, B. Ma, C. Zhang, M. Zhang, W. Zhang, H. Zhu, R. Zhang, X. Liu, M. Wang, Y . Wu, and L. Yan. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025. URLhttps://arxiv.org/abs/2504.05118

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

Y . Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li. Pytorch FSDP: experiences on scaling fully sharded data paral- lel.Proc. VLDB Endow., 16(12):3848–3860, 2023. doi: 10.14778/3611540.3611569. URL https://www.vldb.o...

work page doi:10.14778/3611540.3611569 2023
[69]

Zheng, L

L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. W. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annua...

work page 2024
[70]

Limitations

Y . Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, N. Chen, Y . Chen, Y . Zhou, C. Wan, H. Zhou, Y . Jiang, Y . Zhu, and D. Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025. URLhttps://arxiv.org/abs/2504.15930. 16 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract...

work page arXiv 2025
[71]

Guidelines: • The answer NA means that the paper does not involve crowdsourcing nor research with human subjects

Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...

work page 2025
[72]

For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [ 62] as training backend

to evaluate the training throughput in Figure 4 and the training hours in Table 1. For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [ 62] as training backend. In a few cases where SGLang raises errors (experiments with 32B models or 64 nodes), we use vLLM [18] v0.8.4 as a substitution. C Additional Results C.1 Addi...

work page 1911