{"paper":{"title":"AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"Binhang Yuan, Chen Zhu, Chuyi He, Guo Wei, Jiashu Wang, Jiaxuan Gao, Jun Mei, Shusheng Xu, Tongkai Yang, Wei Fu, Xujie Shen, Yi Wu, Zhiyu Mei","submitted_at":"2025-05-30T07:18:25Z","abstract_excerpt":"Reinforcement learning (RL) has become a dominant paradigm for training large language models (LLMs), particularly for reasoning tasks. Effective RL for LLMs requires massive parallelization and poses an urgent need for efficient training systems. Most existing large-scale RL systems for LLMs are synchronous, alternating generation and training in a batch setting where rollouts in each training batch are generated by the same model. This approach stabilizes RL training but suffers from severe system-level inefficiency: generation must wait until the longest output in the batch is completed bef"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"AReaL achieves up to 2.77× training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That workload balancing between rollout and training workers plus the staleness-enhanced PPO variant can keep training stable and effective despite using outdated samples.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"8f7b78557dc18bf2f5c7402623e5ae37b7da41f48d7a52993445f4dedbc0ba71"},"source":{"id":"2505.24298","kind":"arxiv","version":5},"verdict":{"id":"74e10581-fed9-455c-9d62-a2dedfc2da9e","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T14:19:43.363131Z","strongest_claim":"AReaL achieves up to 2.77× training speedup compared to synchronous systems with the same number of GPUs and matched or improved final performance.","one_line_summary":"AReaL decouples generation and training in LLM reinforcement learning to achieve up to 2.77x speedup with matched or better performance on math and code benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That workload balancing between rollout and training workers plus the staleness-enhanced PPO variant can keep training stable and effective despite using outdated samples.","pith_extraction_headline":"AReaL decouples generation from training in reinforcement learning to achieve up to 2.77 times faster training for language models on reasoning tasks."},"references":{"count":50,"sample":[{"doi":"","year":1912,"title":"Dota 2 with Large Scale Deep Reinforcement Learning","work_id":"b047dc18-e9a3-4d11-8ff6-cd59d41a6357","ref_index":2,"cited_arxiv_id":"1912.06680","is_internal_anchor":true},{"doi":"","year":2021,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":3,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"","year":2024,"title":"Z. Chen, A. May, R. Svirschevski, Y . Huang, M. Ryabinin, Z. Jia, and B. Chen. Se- quoia: Scalable and robust speculative decoding. In A. Globerson, L. Mackey, D. Bel- grave, A. Fan, U. Paquet, J. Tom","work_id":"6e8f2590-79af-491f-9142-599ba03cbebb","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2021,"title":"Training Verifiers to Solve Math Word Problems","work_id":"acab1aa8-b4d6-40e0-a3ee-25341701dca2","ref_index":5,"cited_arxiv_id":"2110.14168","is_internal_anchor":true},{"doi":"","year":2018,"title":"L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: scalable distributed deep-rl with impor- tance weigh","work_id":"c8686f95-7ff2-4fbf-b5bc-f1243774d697","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":50,"snapshot_sha256":"8333ebd236a17d4e7295c910e2838d186da222050b0e8e43c0b0565ddf934280","internal_anchors":10},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}