Recognition: no theorem link
AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning
Pith reviewed 2026-05-15 14:19 UTC · model grok-4.3
The pith
AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AReaL is a fully asynchronous RL system that completely decouples generation from training. Rollout workers continuously generate outputs while training workers update the model on collected batches, with workload balancing to control data staleness and a staleness-enhanced PPO variant for stability. This leads to up to 2.77× training speedup compared to synchronous systems with the same number of GPUs while matching or improving final performance on reasoning benchmarks.
What carries the argument
The asynchronous decoupling of rollout workers from training workers, combined with workload balancing and staleness-enhanced PPO, which allows continuous generation and model updates without batch synchronization waits.
Load-bearing premise
That balancing rollout and training workloads plus the staleness-enhanced PPO variant can maintain training stability and effectiveness even with outdated samples.
What would settle it
Running the same benchmarks on a synchronous system with identical GPUs and observing no speedup or a performance drop would falsify the efficiency claim.
read the original abstract
Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed before model updates, resulting in GPU underutilization. We present AReaL, a fully asynchronous RL system that completely decouples generation from training. Rollout workers in AReaL continuously generate new outputs without waiting, while training workers update the model whenever a batch of data is collected. AReaL also incorporates a collection of system-level optimizations, leading to substantially higher GPU utilization. To stabilize RL training, AReaL balances the workload of rollout and training workers to control data staleness, and adopts a staleness-enhanced PPO variant to better handle outdated training samples. Extensive experiments on math and code reasoning benchmarks show that AReaL achieves up to 2.77$\times$ training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance. The code of AReaL is available at https://github.com/inclusionAI/AReaL/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AReaL, a fully asynchronous RL system for LLM reasoning that decouples rollout generation from training. Rollout workers generate continuously while training workers update on collected batches; workload balancing controls staleness and a staleness-enhanced PPO variant is used to maintain stability. Experiments on math and code benchmarks report up to 2.77× training speedup versus synchronous baselines with matched or improved final performance.
Significance. If the empirical claims hold under broader validation, AReaL would demonstrate a practical path to higher GPU utilization in large-scale LLM RL without performance loss, addressing a central systems bottleneck as model sizes and reasoning tasks grow.
major comments (3)
- [§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.
- [§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.
- [§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.
minor comments (1)
- [Abstract] The abstract refers to 'a collection of system-level optimizations' without enumerating them; a brief list or pointer to the relevant subsection would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help improve the clarity and reproducibility of our work. We address each major comment below and will revise the manuscript to incorporate the requested details and additional experiments.
read point-by-point responses
-
Referee: [§4] §4 (Staleness-enhanced PPO variant): The paper supplies no description of the precise algorithmic modifications (e.g., adjusted clipping thresholds, importance-sampling corrections, or advantage re-weighting by staleness age), which is load-bearing for the claim that training remains stable and effective with outdated samples.
Authors: We agree that §4 lacks a precise description of the modifications in the staleness-enhanced PPO variant. In the revised manuscript we will expand this section with the full algorithmic details, including any changes to clipping thresholds, importance-sampling corrections, and advantage re-weighting as a function of staleness age. These additions will directly support the stability claims for training with outdated samples. revision: yes
-
Referee: [§5] §5 (Experiments): No ablation is presented that removes the staleness enhancement while retaining asynchronous execution, nor are there plots or tables of performance versus measured data staleness; without these the parity claim cannot be isolated from the particular benchmarks or unstated hyper-parameter retuning.
Authors: We acknowledge that an ablation isolating the staleness enhancement (while retaining asynchronous execution) and plots/tables of performance versus measured staleness would strengthen the experimental section. We will add both in the revised manuscript: an ablation comparing the full AReaL system against an asynchronous baseline without the staleness enhancement, plus figures showing final performance and training curves as functions of average data staleness. These will help isolate the contribution of the enhancement from benchmark-specific effects or hyper-parameter choices. revision: yes
-
Referee: [§5] §5 (Experiments): The reported speedups and performance parity lack details on exact baseline implementations, number of random seeds, statistical variance, and full hyper-parameter choices, which are required to verify the central 2.77× claim and reproducibility.
Authors: We agree that additional implementation and statistical details are necessary for reproducibility. In the revised manuscript we will expand the experimental section with: (i) precise descriptions of the synchronous baseline implementations, (ii) the number of random seeds used for each result, (iii) statistical variance (standard deviations across seeds), and (iv) complete hyper-parameter tables for all methods and benchmarks. This will allow independent verification of the reported speedups and performance parity. revision: yes
Circularity Check
No significant circularity; empirical systems results rest on direct runtime measurements
full rationale
The paper describes an asynchronous RL training system (AReaL) whose core claims are measured speedups (up to 2.77×) and matched/improved benchmark performance on math and code tasks. These outcomes are obtained from end-to-end experiments that compare wall-clock training time and final accuracy against synchronous baselines under identical GPU counts. No mathematical derivation chain, fitted-parameter prediction, or self-citation load-bearing uniqueness theorem is present; the staleness-handling mechanisms are presented as engineering choices whose effectiveness is validated by the same runtime data rather than by construction or tautology. The evaluation is therefore externally falsifiable by re-running the open-source code on the stated benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 25 Pith papers
-
AIS: Adaptive Importance Sampling for Quantized RL
AIS adaptively corrects non-stationary policy gradient bias in quantized LLM RL, matching BF16 performance while retaining 1.5-2.76x FP8 rollout speedup.
-
BubbleSpec: Turning Long-Tail Bubbles into Speculative Rollout Drafts for Synchronous Reinforcement Learning
BubbleSpec exploits long-tail bubbles in synchronous RL by using faster ranks' idle time to pre-generate rollout drafts for speculative decoding, reducing steps by 50% and raising throughput up to 1.8x while preservin...
-
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
-
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
-
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
-
Diagnosing Training Inference Mismatch in LLM Reinforcement Learning
Training-inference mismatch in separated rollout and optimization stages of LLM RL can independently cause training collapse.
-
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
-
FlashEvolve: Accelerating Agent Self-Evolution with Asynchronous Stage Orchestration
FlashEvolve accelerates LLM agent self-evolution via asynchronous stage orchestration and inspectable language-space staleness handling, reporting 3.5-4.9x proposal throughput gains over synchronous baselines on GEPA ...
-
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
-
T$^2$PO: Uncertainty-Guided Exploration Control for Stable Multi-Turn Agentic Reinforcement Learning
T²PO improves stability and performance in multi-turn agentic RL by using uncertainty dynamics at token and turn levels to guide exploration and avoid wasted rollouts.
-
Co-Evolving Policy Distillation
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
-
DORA: A Scalable Asynchronous Reinforcement Learning System for Language Model Training
DORA's multi-version streaming rollout enables 2-3x higher throughput in asynchronous RL for LLMs while preserving convergence by maintaining policy consistency, data integrity, and bounded staleness.
-
AMMA: A Multi-Chiplet Memory-Centric Architecture for Low-Latency 1M Context Attention Serving
AMMA is a memory-centric multi-chiplet architecture using HBM-PNM cubes, custom logic dies, hybrid parallelism, and reordered collectives that delivers 15.5X lower attention latency and 6.9X lower energy than NVIDIA H...
-
JigsawRL: Assembling RL Pipelines for Efficient LLM Post-Training
JigsawRL achieves up to 1.85x higher throughput in LLM RL pipelines via pipeline multiplexing, sub-stage graphs, and look-ahead scheduling compared to prior systems.
-
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
-
TensorHub: Scalable and Elastic Weight Transfer for LLM RL Training
TensorHub uses Reference-Oriented Storage to enable scalable weight transfer in LLM RL training by referencing replicated GPU weights, achieving up to 19x reduction in cross-datacenter stall time.
-
OpenClaw-RL: Train Any Agent Simply by Talking
OpenClaw-RL recovers evaluative and directive signals from next-state interactions to enable online RL training of agents across terminal, GUI, SWE, and tool environments via a server-client architecture and hybrid objective.
-
WebChain: A Large-Scale Human-Annotated Dataset of Real-World Web Interaction Traces
WebChain supplies the largest open dataset of real human web trajectories with triple-modal alignment and a dual mid-training method that separates grounding from planning to improve web agents.
-
ETS: Energy-Guided Test-Time Scaling for Training-Free RL Alignment
ETS enables direct sampling from the optimal RL policy for language models at inference time by estimating the energy term with online Monte Carlo and acceleration techniques.
-
FP8-RL: A Practical and Stable Low-Precision Stack for LLM Reinforcement Learning
FP8-RL delivers up to 44% faster rollouts in LLM RL by using blockwise FP8 quantization, KV-cache recalibration, and importance-sampling corrections while keeping learning behavior close to BF16 baselines.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Position: Agentic AI System Is a Foreseeable Pathway to AGI
Agentic AI systems with DAG topologies are claimed to deliver exponentially superior generalization and sample efficiency compared to monolithic scaling for achieving AGI.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory integrates three platforms for simulation, data management, and agent evolution to create a unified pipeline for training trustworthy autonomous AI.
-
StepPO: Step-Aligned Policy Optimization for Agentic Reinforcement Learning
StepPO argues that LLM agents should optimize at the step level rather than token level to better handle delayed rewards and long contexts in agentic RL.
-
Safactory: A Scalable Agentic Infrastructure for Training Trustworthy Autonomous Intelligence
Safactory combines parallel simulation, trustworthy data management, and asynchronous evolution platforms into a single pipeline claimed to be the first unified framework for trustworthy autonomous agents.
Reference graph
Works this paper leans on
-
[2]
Dota 2 with Large Scale Deep Reinforcement Learning
C. Berner, G. Brockman, B. Chan, V . Cheung, P. Debiak, C. Dennison, D. Farhi, Q. Fis- cher, S. Hashme, C. Hesse, R. Józefowicz, S. Gray, C. Olsson, J. Pachocki, M. Petrov, H. P. de Oliveira Pinto, J. Raiman, T. Salimans, J. Schlatter, J. Schneider, S. Sidor, I. Sutskever, J. Tang, F. Wolski, and S. Zhang. Dota 2 with large scale deep reinforcement learni...
work page internal anchor Pith review Pith/arXiv arXiv 1912
-
[3]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
Z. Chen, A. May, R. Svirschevski, Y . Huang, M. Ryabinin, Z. Jia, and B. Chen. Se- quoia: Scalable and robust speculative decoding. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 129531–129563. Curran Associates, Inc., 2024. URL https://pro...
work page 2024
-
[5]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021. URLhttps://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with impor- tance weighted actor-learner architectures. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockhol...
work page 2018
-
[8]
L. Espeholt, R. Marinier, P. Stanczyk, K. Wang, and M. Michalski. SEED RL: scalable and efficient deep-rl with accelerated central inference. In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URLhttps://openreview.net/forum?id=rkgvXlrKwH
work page 2020
-
[9]
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the MATH dataset. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual, 2021. ...
work page 2021
-
[10]
J. Hilton, K. Cobbe, and J. Schulman. Batch size-invariance for policy optimization. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - Decem- ber 9, 2022, 2022....
work page 2022
-
[12]
J. Hu, Y . Zhang, Q. Han, D. Jiang, X. Zhang, and H. Shum. Open-reasoner-zero: An open source approach to scaling up reinforcement learning on the base model.CoRR, abs/2503.24290,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
N. Jain, K. Han, A. Gu, W. Li, F. Yan, T. Zhang, S. Wang, A. Solar-Lezama, K. Sen, and I. Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URL https://openreview. net/foru...
work page 2025
-
[16]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan. Swe-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net,
work page 2024
-
[17]
URLhttps://openreview.net/forum?id=VTF8yNQM66
-
[18]
S. Kapturowski, G. Ostrovski, J. Quan, R. Munos, and W. Dabney. Recurrent experience replay in distributed reinforcement learning. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URLhttps://openreview.net/forum?id=r1lyTjAqYX
work page 2019
-
[20]
W. Kwon, Z. Li, S. Zhuang, Y . Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In J. Flinn, M. I. Seltzer, P. Druschel, A. Kaufmann, and J. Mace, editors,Proceedings of the 29th Symposium on Operating Systems Principles, SOSP 2023, Koblenz, Germany, October ...
-
[21]
K. Lei, Y . Jin, M. Zhai, K. Huang, H. Ye, and J. Zhai. PUZZLE: efficiently aligning large language models through light-weight context switch. In S. Bagchi and Y . Zhang, editors, Proceedings of the 2024 USENIX Annual Technical Conference, USENIX ATC 2024, Santa Clara, CA, USA, July 10-12, 2024, pages 127–140. USENIX Association, 2024. URL https: //www.u...
work page 2024
-
[22]
J. LI, E. Beeching, L. Tunstall, B. Lipkin, R. Soletskyi, S. C. Huang, K. Rasul, L. Yu, A. Jiang, Z. Shen, Z. Qin, B. Dong, L. Zhou, Y . Fleureau, G. Lample, and S. Polu. Numina- math. [https://huggingface.co/AI-MO/NuminaMath-CoT](https://github.com/ project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf) , 2024
work page 2024
-
[23]
E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. Gonzalez, M. I. Jor- dan, and I. Stoica. Rllib: Abstractions for distributed reinforcement learning. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, volume 80 ofProce...
work page 2018
- [24]
-
[25]
X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang. Agentbench: Evaluating llms as agents. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview...
work page 2024
-
[26]
M. Luo, S. Tan, R. Huang, A. Patel, A. Ariyak, Q. Wu, X. Shi, R. Xin, C. Cai, M. Weber, et al. Deepcoder: A fully open-source 14b coder at o3-mini level, 2025
work page 2025
-
[27]
M. Luo, S. Tan, J. Wong, X. Shi, W. Y . Tang, M. Roongta, C. Cai, J. Luo, L. E. Li, R. A. Popa, and I. Stoica. Deepscaler: Surpassing o1-preview with a 1.5b model by scaling rl. https://pretty-radio-b75.notion.site/ DeepScaleR-Surpassing-O1-Preview-with-a-1-5B-Model-by-Scaling-RL-19681902c1468005bed8ca303013a4e2 ,
-
[28]
Z. Mei, W. Fu, J. Gao, G. Wang, H. Zhang, and Y . Wu. SRL: scaling distributed reinforcement learning to over ten thousand cores. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=lajn1iROCu
work page 2024
-
[30]
X. Miao, G. Oliaro, Z. Zhang, X. Cheng, Z. Wang, Z. Zhang, R. Y . Y . Wong, A. Zhu, L. Yang, X. Shi, C. Shi, Z. Chen, D. Arfeen, R. Abhyankar, and Z. Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM International Conference on Architectural Support for Programmi...
-
[33]
URL https://openai.com/index/ learning-to-reason-with-llms/
OpenAI, Sep 2024. URL https://openai.com/index/ learning-to-reason-with-llms/
work page 2024
-
[34]
URL https://openai.com/index/ introducing-o3-and-o4-mini/
OpenAI, Apr 2025. URL https://openai.com/index/ introducing-o3-and-o4-mini/
work page 2025
-
[35]
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Ch...
work page 2022
-
[36]
J. Pan, J. Zhang, X. Wang, L. Yuan, H. Peng, and A. Suhr. Tinyzero. https://github.com/Jiayi- Pan/TinyZero, 2025. Accessed: 2025-01-24
work page 2025
-
[37]
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Köpf, E. Z. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. Pytorch: An im- perative style, high-performance deep learning library. In H. M. Wallach, H. Larochelle, A. Beygelz...
work page 2019
-
[39]
M. L. Puterman.Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley Series in Probability and Statistics. Wiley, 1994. ISBN 978-0-47161977-2. doi: 10.1002/ 9780470316887. URLhttps://doi.org/10.1002/9780470316887
-
[40]
Generalized Slow Roll for Tensors
S. Rajbhandari, J. Rasley, O. Ruwase, and Y . He. Zero: memory optimizations toward training trillion parameter models. In C. Cuicchi, I. Qualters, and W. T. Kramer, editors,Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2020, Virtual Event / Atlanta, Georgia, USA, November 9-19, 2020, page...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.1109/sc41405.2020.00024 2020
-
[43]
J. Schulman, P. Moritz, S. Levine, M. I. Jordan, and P. Abbeel. High-dimensional continuous control using generalized advantage estimation. In Y . Bengio and Y . LeCun, editors,4th International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings, 2016. URL http://arxiv.org/abs/1506. 02438
work page 2016
-
[44]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.CoRR, abs/1707.06347, 2017. URLhttp://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . K. Li, Y . Wu, and D. Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.CoRR, abs/2402.03300,
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
Hybridflow: A flexible and efficient rlhf framework
G. Sheng, C. Zhang, Z. Ye, X. Wu, W. Zhang, R. Zhang, Y . Peng, H. Lin, and C. Wu. Hybridflow: A flexible and efficient RLHF framework. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys 2025, Rotterdam, The Netherlands, 30 March 2025 - 3 April 2025, pages 1279–1297. ACM, 2025. doi: 10.1145/3689031.3696075. URL https://doi. or...
-
[50]
URLhttp://arxiv.org/abs/1909.08053
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[51]
C. V . Snell, J. Lee, K. Xu, and A. Kumar. Scaling LLM test-time compute optimally can be more effective than scaling parameters for reasoning. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum?id=4FWAwZtd2n
work page 2025
-
[53]
P. I. Team, S. Jaghouar, J. Mattern, J. M. Ong, J. Straube, M. Basra, A. Pazdera, K. Thaman, M. D. Ferrante, F. Gabriel, F. Obeid, K. Erdem, M. Keiblinger, and J. Hagemann. Intellect-2: A reasoning model trained through globally decentralized reinforcement learning, 2025. URL https://arxiv.org/abs/2505.07291
-
[54]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polo- sukhin. Attention is all you need. In I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fer- gus, S. V . N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Syste...
work page 2017
-
[55]
O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, J. Oh, D. Horgan, M. Kroiss, I. Danihelka, A. Huang, L. Sifre, T. Cai, J. P. Agapiou, M. Jaderberg, A. S. Vezhnevets, R. Leblond, T. Pohlen, V . Dalibard, D. Budden, Y . Sulsky, J. Molloy, T. L. Paine, Ç. Gülçehre, Z. Wang, T. Pfaff,...
-
[56]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou. Chain-of-thought prompting elicits reasoning in large language models. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems...
work page 2022
- [57]
-
[64]
A. B. Yoo, M. A. Jette, and M. Grondona. SLURM: simple linux utility for resource management. In D. G. Feitelson, L. Rudolph, and U. Schwiegelshohn, editors,Job Scheduling Strategies for Parallel Processing, 9th International Workshop, JSSPP 2003, Seattle, WA, USA, June 24, 2003, Revised Papers, volume 2862 ofLecture Notes in Computer Science, pages 44–60...
work page 2003
-
[65]
Job Scheduling Strategies for Parallel Processing pp 44–60
doi: 10.1007/10968987\_3. URLhttps://doi.org/10.1007/10968987_3
-
[66]
Q. Yu, Z. Zhang, R. Zhu, Y . Yuan, X. Zuo, Y . Yue, T. Fan, G. Liu, L. Liu, X. Liu, H. Lin, Z. Lin, B. Ma, G. Sheng, Y . Tong, C. Zhang, M. Zhang, W. Zhang, H. Zhu, J. Zhu, J. Chen, J. Chen, C. Wang, H. Yu, W. Dai, Y . Song, X. Wei, H. Zhou, J. Liu, W. Ma, Y . Zhang, L. Yan, M. Qiao, Y . Wu, and M. Wang. DAPO: an open-source LLM reinforcement learning sys...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[67]
Y . Yue, Y . Yuan, Q. Yu, X. Zuo, R. Zhu, W. Xu, J. Chen, C. Wang, T. Fan, Z. Du, X. Wei, X. Yu, G. Liu, J. Liu, L. Liu, H. Lin, Z. Lin, B. Ma, C. Zhang, M. Zhang, W. Zhang, H. Zhu, R. Zhang, X. Liu, M. Wang, Y . Wu, and L. Yan. Vapo: Efficient and reliable reinforcement learning for advanced reasoning tasks, 2025. URLhttps://arxiv.org/abs/2504.05118
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
Y . Zhao, A. Gu, R. Varma, L. Luo, C. Huang, M. Xu, L. Wright, H. Shojanazeri, M. Ott, S. Shleifer, A. Desmaison, C. Balioglu, P. Damania, B. Nguyen, G. Chauhan, Y . Hao, A. Mathews, and S. Li. Pytorch FSDP: experiences on scaling fully sharded data paral- lel.Proc. VLDB Endow., 16(12):3848–3860, 2023. doi: 10.14778/3611540.3611569. URL https://www.vldb.o...
-
[69]
L. Zheng, L. Yin, Z. Xie, C. Sun, J. Huang, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez, C. W. Barrett, and Y . Sheng. Sglang: Efficient execution of structured language model programs. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annua...
work page 2024
-
[70]
Y . Zhong, Z. Zhang, X. Song, H. Hu, C. Jin, B. Wu, N. Chen, Y . Chen, Y . Zhou, C. Wan, H. Zhou, Y . Jiang, Y . Zhu, and D. Jiang. Streamrl: Scalable, heterogeneous, and elastic rl for llms with disaggregated stream generation, 2025. URLhttps://arxiv.org/abs/2504.15930. 16 NeurIPS Paper Checklist 1.Claims Question: Do the main claims made in the abstract...
-
[71]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
work page 2025
-
[72]
to evaluate the training throughput in Figure 4 and the training hours in Table 1. For most of the results, we use SGLang [63] v0.4.6 as generation backend and pytorch FSDP [ 62] as training backend. In a few cases where SGLang raises errors (experiments with 32B models or 64 nodes), we use vLLM [18] v0.8.4 as a substitution. C Additional Results C.1 Addi...
work page 1911
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.