pith. sign in
Pith Number

pith:YWC3IJHJ

pith:2025:YWC3IJHJ47CJKCAFEB2EW4RBR2
not attested not anchored not stored refs resolved

The Art of Scaling Reinforcement Learning Compute for LLMs

David Brandfonbrener, Devvrit Khatri, Inderjit S. Dhillon, Lovish Madaan, Manzil Zaheer, Rachit Bansal, Rishabh Agarwal, Rishabh Tiwari, Sai Surya Duvvuri

RL training for LLMs follows predictable sigmoidal scaling curves that enable extrapolation from small-scale runs.

arxiv:2510.13786 v1 · 2025-10-15 · cs.LG · cs.AI

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YWC3IJHJ47CJKCAFEB2EW4RBR2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Stable, scalable recipes follow predictable scaling trajectories, enabling extrapolation from smaller-scale runs. We demonstrate its effectiveness by successfully scaling and predicting validation performance on a single RL run scaled up to 100,000 GPU-hours.

C2weakest assumption

That the sigmoidal functional form fitted to smaller-scale runs will continue to hold and allow accurate extrapolation at scales an order of magnitude larger, and that the ablated design choices capture the dominant factors that determine asymptotic performance versus efficiency.

C3one line summary

A 400k+ GPU-hour study shows RL scaling in LLMs follows predictable sigmoidal trajectories, with most design choices affecting efficiency rather than the performance asymptote, enabling accurate large-scale predictions via the ScaleRL recipe.

References

36 extracted · 36 resolved · 14 Pith anchors

[1] URLhttps://hkunlp.github.io/blog/2025/Polaris. AoPS. AIME problem set 1983-2025, 2025
[2] Cwm: An open-weights llm for research on code generation with world models
[3] The Entropy Mechanism of Reinforcement Learning for Reasoning Language Models · arXiv:2505.22617
[4] GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning · arXiv:2507.01006
[5] Measuring Mathematical Problem Solving With the MATH Dataset · doi:10.64434/tml.20250910

Formal links

3 machine-checked theorem links

Cited by

27 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:47.304966Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c585b424e9e7c495080520744b72218e966160a63d15d1223b48ca4c80d67e12

Aliases

arxiv: 2510.13786 · arxiv_version: 2510.13786v1 · doi: 10.48550/arxiv.2510.13786 · pith_short_12: YWC3IJHJ47CJ · pith_short_16: YWC3IJHJ47CJKCAF · pith_short_8: YWC3IJHJ
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YWC3IJHJ47CJKCAFEB2EW4RBR2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c585b424e9e7c495080520744b72218e966160a63d15d1223b48ca4c80d67e12
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "713ae47eea08fff4bed2b11c38746e1499694d17cccc4516db60778642b19026",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-10-15T17:43:03Z",
    "title_canon_sha256": "9487e005a66954e91c32149adfded5424cdd518509be36aa2df7a2394e1bcee8"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2510.13786",
    "kind": "arxiv",
    "version": 1
  }
}