pith. machine review for the scientific record. sign in
Pith Number

pith:OQMAWUXN

pith:2024:OQMAWUXNEGORB2TU7H6MP4SDXQ
not attested not anchored not stored refs resolved

Training Language Models to Self-Correct via Reinforcement Learning

Aleksandra Faust, Aviral Kumar, Avi Singh, Colton Bishop, Cosmin Paduraru, Disha Shrivastava, Doina Precup, Feryal Behbahani, George Tucker, John D Co-Reyes, Kate Baumli, Kay McKinney, Lei M Zhang, Rebecca Roelofs, Rishabh Agarwal, Shariq Iqbal, Vincent Zhuang, Yi Su

Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.

arxiv:2409.12917 v2 · 2024-09-19 · cs.LG

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

C2weakest assumption

That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.

C3one line summary

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

References

282 extracted · 282 resolved · 63 Pith anchors

[1] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs 2024 · arXiv:2402.14740
[2] arXiv preprint arXiv:2305.08844 , year= 2023
[3] Program Synthesis with Large Language Models 2021 · arXiv:2108.07732
[5] Teaching Large Language Models to Self-Debug 2023 · arXiv:2304.05128
[7] Teaching large language models to reason with reinforcement learning 2024

Formal links

3 machine-checked theorem links

Cited by

17 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:14.174994Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030

Aliases

arxiv: 2409.12917 · arxiv_version: 2409.12917v2 · doi: 10.48550/arxiv.2409.12917 · pith_short_12: OQMAWUXNEGOR · pith_short_16: OQMAWUXNEGORB2TU · pith_short_8: OQMAWUXN
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "25112ca15391c914fbcb0c3c54f8027aed431f584ff13d6753715ac632734758",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-09-19T17:16:21Z",
    "title_canon_sha256": "8ae34f7e383954bfbcd3c1e6be66a80a8a987cc48e21d4f15ed4522b6724ad23"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2409.12917",
    "kind": "arxiv",
    "version": 2
  }
}