{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2025:YWC3IJHJ47CJKCAFEB2EW4RBR2","short_pith_number":"pith:YWC3IJHJ","schema_version":"1.0","canonical_sha256":"c585b424e9e7c495080520744b72218e966160a63d15d1223b48ca4c80d67e12","source":{"kind":"arxiv","id":"2510.13786","version":1},"attestation_state":"computed","paper":{"title":"The Art of Scaling Reinforcement Learning Compute for LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"David Brandfonbrener, Devvrit Khatri, Inderjit S. Dhillon, Lovish Madaan, Manzil Zaheer, Rachit Bansal, Rishabh Agarwal, Rishabh Tiwari, Sai Surya Duvvuri","submitted_at":"2025-10-15T17:43:03Z","abstract_excerpt":"Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range o"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2510.13786","kind":"arxiv","version":1},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.LG","submitted_at":"2025-10-15T17:43:03Z","cross_cats_sorted":["cs.AI"],"title_canon_sha256":"9487e005a66954e91c32149adfded5424cdd518509be36aa2df7a2394e1bcee8","abstract_canon_sha256":"713ae47eea08fff4bed2b11c38746e1499694d17cccc4516db60778642b19026"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:47.305584Z","signature_b64":"6n2mWpyfTIUSUlXTv4DZHX/DowBwQxU5tvpAhrwEpNhGjjSx3xKef8TfsTbdVzpfw8LvRuvnqD1k9IooBxxLDw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c585b424e9e7c495080520744b72218e966160a63d15d1223b48ca4c80d67e12","last_reissued_at":"2026-05-17T23:38:47.304966Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:47.304966Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"The Art of Scaling Reinforcement Learning Compute for LLMs","license":"http://creativecommons.org/licenses/by/4.0/","headline":"RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.","cross_cats":["cs.AI"],"primary_cat":"cs.LG","authors_text":"David Brandfonbrener, Devvrit Khatri, Inderjit S. Dhillon, Lovish Madaan, Manzil Zaheer, Rachit Bansal, Rishabh Agarwal, Rishabh Tiwari, Sai Surya Duvvuri","submitted_at":"2025-10-15T17:43:03Z","abstract_excerpt":"Reinforcement learning (RL) has become central to training large language models (LLMs), yet the field lacks predictive scaling methodologies comparable to those established for pre-training. Despite rapidly rising compute budgets, there is no principled understanding of how to evaluate algorithmic improvements for scaling RL compute. We present the first large-scale systematic study, amounting to more than 400,000 GPU-hours, that defines a principled framework for analyzing and predicting RL scaling in LLMs. We fit sigmoidal compute-performance curves for RL training and ablate a wide range o"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. We demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the sigmoidal functional form fitted to smaller-scale runs will continue to hold and allow accurate extrapolation at scales an order of magnitude larger, and that the ablated design choices capture the dominant factors that determine asymptotic performance versus efficiency.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"c0ca91f5f9cc5a78c459ed2a200d816ed97538917799650b118418451c4e97cf"},"source":{"id":"2510.13786","kind":"arxiv","version":1},"verdict":{"id":"7bf96e1c-3e20-4797-b093-3c83d1d7e01b","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T16:24:18.978810Z","strongest_claim":"Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. We demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.","one_line_summary":"A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the sigmoidal functional form fitted to smaller-scale runs will continue to hold and allow accurate extrapolation at scales an order of magnitude larger, and that the ablated design choices capture the dominant factors that determine asymptotic performance versus efficiency.","pith_extraction_headline":"RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs."},"references":{"count":36,"sample":[{"doi":"","year":2025,"title":"URLhttps://hkunlp.github.io/blog/2025/Polaris. AoPS. AIME problem set 1983-2025,","work_id":"d27693b0-c6ab-488a-958d-31df012bbe1e","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Cwm: An open-weights llm for research on code generation with world models","work_id":"7a74903a-6383-48ce-97b8-17afa5faeae1","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models","work_id":"d4b4aee4-d20f-4572-886a-4ba9ea6c9b81","ref_index":3,"cited_arxiv_id":"2505.22617","is_internal_anchor":true},{"doi":"","year":null,"title":"GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning","work_id":"366607ba-e4ea-4726-98c3-63356e32351c","ref_index":4,"cited_arxiv_id":"2507.01006","is_internal_anchor":true},{"doi":"10.64434/tml.20250910","year":null,"title":"Measuring Mathematical Problem Solving With the MATH Dataset","work_id":"50652ac6-fb7c-4675-a2c2-159c241feb17","ref_index":5,"cited_arxiv_id":"2103.03874","is_internal_anchor":true}],"resolved_work":36,"snapshot_sha256":"077b8abdcdcf67bd5475ecaf9cf9b17b23e5162896a64c2c0f38393b9d31f7bb","internal_anchors":14},"formal_canon":{"evidence_count":3,"snapshot_sha256":"c7617f062677e65d6fe9c9675b38f1fd1a3100eb80adb2c77e4e88fb29f76510"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2510.13786","created_at":"2026-05-17T23:38:47.305059+00:00"},{"alias_kind":"arxiv_version","alias_value":"2510.13786v1","created_at":"2026-05-17T23:38:47.305059+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2510.13786","created_at":"2026-05-17T23:38:47.305059+00:00"},{"alias_kind":"pith_short_12","alias_value":"YWC3IJHJ47CJ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YWC3IJHJ47CJKCAF","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YWC3IJHJ","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":27,"internal_anchor_count":27,"sample":[{"citing_arxiv_id":"2504.12501","citing_title":"Reinforcement Learning from Human Feedback","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2512.18552","citing_title":"Toward Training Superintelligent Software Agents through Self-Play SWE-RL","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2602.04663","citing_title":"Rethinking the Design Space of Reinforcement Learning for Diffusion Models: On the Importance of Likelihood Estimation Beyond Loss Design","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2602.03839","citing_title":"Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15871","citing_title":"Agentic Discovery of Neural Architectures: AIRA-Compose and AIRA-Design","ref_index":21,"is_internal_anchor":true},{"citing_arxiv_id":"2512.11470","citing_title":"Rethinking Expert Trajectory Utilization in LLM Post-training for Mathematical Reasoning","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2602.14868","citing_title":"Goldilocks RL: Tuning Task Difficulty to Escape Sparse Rewards for Reasoning","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12484","citing_title":"Learning, Fast and Slow: Towards LLMs That Adapt Continually","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2603.28507","citing_title":"Continued AI Scaling Requires Repeated Efficiency Doublings","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11739","citing_title":"Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11739","citing_title":"Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation","ref_index":93,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12484","citing_title":"Learning, Fast and Slow: Towards LLMs That Adapt Continually","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28020","citing_title":"Cost-Aware Learning","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08472","citing_title":"Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":79,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10194","citing_title":"TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06638","citing_title":"Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05365","citing_title":"ZAYA1-8B Technical Report","ref_index":158,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20209","citing_title":"Scaling Self-Play with Self-Guidance","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.04077","citing_title":"Balanced Aggregation: Understanding and Fixing Aggregation Bias in GRPO","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06761","citing_title":"Weblica: Scalable and Reproducible Training Environments for Visual Web Agents","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07865","citing_title":"KL for a KL: On-Policy Distillation with Control Variate Baseline","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.06159","citing_title":"Target Policy Optimization","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18381","citing_title":"Learning from Less: Measuring the Effectiveness of RLVR in Low Data and Compute Regimes","ref_index":8,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":3,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2","json":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2.json","graph_json":"https://pith.science/api/pith-number/YWC3IJHJ47CJKCAFEB2EW4RBR2/graph.json","events_json":"https://pith.science/api/pith-number/YWC3IJHJ47CJKCAFEB2EW4RBR2/events.json","paper":"https://pith.science/paper/YWC3IJHJ"},"agent_actions":{"view_html":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2","download_json":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2.json","view_paper":"https://pith.science/paper/YWC3IJHJ","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2510.13786&json=true","fetch_graph":"https://pith.science/api/pith-number/YWC3IJHJ47CJKCAFEB2EW4RBR2/graph.json","fetch_events":"https://pith.science/api/pith-number/YWC3IJHJ47CJKCAFEB2EW4RBR2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2/action/storage_attestation","attest_author":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2/action/author_attestation","sign_citation":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2/action/citation_signature","submit_replication":"https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2/action/replication_record"}},"created_at":"2026-05-17T23:38:47.305059+00:00","updated_at":"2026-05-17T23:38:47.305059+00:00"}