Pith Number

pith:OQMAWUXN

pith:2024:OQMAWUXNEGORB2TU7H6MP4SDXQ

not attested not anchored not stored refs resolved

Training Language Models to Self-Correct via Reinforcement Learning

Aleksandra Faust, Aviral Kumar, Avi Singh, Colton Bishop, Cosmin Paduraru, Disha Shrivastava, Doina Precup, Feryal Behbahani, George Tucker, John D Co-Reyes, Kate Baumli, Kay McKinney, Lei M Zhang, Rebecca Roelofs, Rishabh Agarwal, Shariq Iqbal, Vincent Zhuang, Yi Su

Multi-turn reinforcement learning trains language models to self-correct using only their own generated data.

arxiv:2409.12917 v2 · 2024-09-19 · cs.LG

Open paper page JSON Open Graph Bundle Merged state What is a Pith Number?

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

With Gemini 1.0 Pro and 1.5 Flash models, we find that SCoRe achieves state-of-the-art self-correction performance, improving the base models' self-correction by 15.6% and 9.1% respectively on MATH and HumanEval.

C2weakest assumption

That training under the model's own distribution of self-generated correction traces combined with the described regularization will produce effective self-correction behavior at test time rather than fitting to high-reward but non-generalizable patterns.

C3one line summary

SCoRe uses multi-turn online RL with regularization on self-generated traces to improve LLM self-correction, achieving 15.6% and 9.1% gains on MATH and HumanEval for Gemini models.

References

282 extracted · 282 resolved · 63 Pith anchors

[1] Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs 2024 · arXiv:2402.14740

[2] arXiv preprint arXiv:2305.08844 , year= 2023

[3] Program Synthesis with Large Language Models 2021 · arXiv:2108.07732

[5] Teaching Large Language Models to Self-Debug 2023 · arXiv:2304.05128

[7] Teaching large language models to reason with reinforcement learning 2024

Formal links

3 machine-checked theorem links

Cited by

17 papers in Pith

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

Seg-Zero: Reasoning-Chain Guided Segmentation via Cognitive Reinforcement

Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

HuatuoGPT-o1, Towards Medical Complex Reasoning with LLMs

Receipt and verification

First computed	2026-05-17T23:38:14.174994Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030

Aliases

arxiv: 2409.12917 · arxiv_version: 2409.12917v2 · doi: 10.48550/arxiv.2409.12917 · pith_short_12: OQMAWUXNEGOR · pith_short_16: OQMAWUXNEGORB2TU · pith_short_8: OQMAWUXN

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/OQMAWUXNEGORB2TU7H6MP4SDXQ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 74180b52ed219d10ea74f9fcc7f243bc28772308a634fc0eec79f2445249d030

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "25112ca15391c914fbcb0c3c54f8027aed431f584ff13d6753715ac632734758",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2024-09-19T17:16:21Z",
    "title_canon_sha256": "8ae34f7e383954bfbcd3c1e6be66a80a8a987cc48e21d4f15ed4522b6724ad23"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2409.12917",
    "kind": "arxiv",
    "version": 2
  }
}