Pith Number

pith:DHNPYQ5G

pith:2025:DHNPYQ5G2DWUV3T2BAPT6HOBNY

not attested not anchored not stored refs resolved

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Baolin Peng, Hao Cheng, Jianfeng Gao, Kuan Wang, Liliang Ren, Liyuan Liu, Qing Yang, Shuohang Wang, Simon Shaolei Du, Weizhu Chen, Xuehai He, Yelong Shen, Yiping Wang, Zhiyuan Zeng

One training example via reinforcement learning lifts an LLM's math reasoning score from 36% to 74% on MATH500.

arxiv:2504.20571 v3 · 2025-04-29 · cs.LG · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{DHNPYQ5G2DWUV3T2BAPT6HOBNY}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%.

C2weakest assumption

That the single chosen example is not specially selected in a way that inflates generalization, and that the observed gains arise from the RL policy gradient rather than incidental effects of the training setup or prompt format.

C3one line summary

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

References

69 extracted · 69 resolved · 29 Pith anchors

[1] Learning to reason with llms 2024

[2] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948

[3] Kimi k1.5: Scaling Reinforcement Learning with LLMs 2025 · arXiv:2501.12599

[4] On designing effective rl reward at training time for llm reasoning 2024

[5] Tulu 3: Pushing Frontiers in Open Language Model Post-Training 2024 · arXiv:2411.15124

Formal links

1 machine-checked theorem link

Cited by

39 papers in Pith

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

Small Generalizable Prompt Predictive Models Can Steer Efficient RL Post-Training of Large Reasoning Models

FBOS-RL: Feedback-Driven Bi-Objective Synergistic Reinforcement Learning

Receipt and verification

First computed	2026-05-17T23:38:50.357211Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

19dafc43a6d0ed4aee7a081f3f1dc16e2690bce9e729cb67e4cf0fedc999d5d7

Aliases

arxiv: 2504.20571 · arxiv_version: 2504.20571v3 · doi: 10.48550/arxiv.2504.20571 · pith_short_12: DHNPYQ5G2DWU · pith_short_16: DHNPYQ5G2DWUV3T2 · pith_short_8: DHNPYQ5G

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/DHNPYQ5G2DWUV3T2BAPT6HOBNY \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 19dafc43a6d0ed4aee7a081f3f1dc16e2690bce9e729cb67e4cf0fedc999d5d7

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "6438bc3f9b0ad57edc0c56bf97ba63f2ed3cd3b4b78e640459a294c532011eea",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-04-29T09:24:30Z",
    "title_canon_sha256": "a3ee8a00abb8653ecd09efee5e6e3dff17022749f7ceb3f9a19d8760fc0ac677"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.20571",
    "kind": "arxiv",
    "version": 3
  }
}