pith. sign in
Pith Number

pith:DHNPYQ5G

pith:2025:DHNPYQ5G2DWUV3T2BAPT6HOBNY
not attested not anchored not stored refs resolved

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Baolin Peng, Hao Cheng, Jianfeng Gao, Kuan Wang, Liliang Ren, Liyuan Liu, Qing Yang, Shuohang Wang, Simon Shaolei Du, Weizhu Chen, Xuehai He, Yelong Shen, Yiping Wang, Zhiyuan Zeng

One training example via reinforcement learning lifts an LLM's math reasoning score from 36% to 74% on MATH500.

arxiv:2504.20571 v3 · 2025-04-29 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{DHNPYQ5G2DWUV3T2BAPT6HOBNY}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Reinforcement learning with verifiable reward using one training example (1-shot RLVR) is effective in incentivizing the math reasoning capabilities of large language models (LLMs). Applying RLVR to the base model Qwen2.5-Math-1.5B, we identify a single example that elevates model performance on MATH500 from 36.0% to 73.6%.

C2weakest assumption

That the single chosen example is not specially selected in a way that inflates generalization, and that the observed gains arise from the RL policy gradient rather than incidental effects of the training setup or prompt format.

C3one line summary

One training example via RLVR boosts LLM math reasoning from 17.6% to 35.7% average across six benchmarks.

References

69 extracted · 69 resolved · 29 Pith anchors

[1] Learning to reason with llms 2024
[2] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948
[3] Kimi k1.5: Scaling Reinforcement Learning with LLMs 2025 · arXiv:2501.12599
[4] On designing effective rl reward at training time for llm reasoning 2024
[5] Tulu 3: Pushing Frontiers in Open Language Model Post-Training 2024 · arXiv:2411.15124

Formal links

1 machine-checked theorem link

Cited by

39 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:50.357211Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

19dafc43a6d0ed4aee7a081f3f1dc16e2690bce9e729cb67e4cf0fedc999d5d7

Aliases

arxiv: 2504.20571 · arxiv_version: 2504.20571v3 · doi: 10.48550/arxiv.2504.20571 · pith_short_12: DHNPYQ5G2DWU · pith_short_16: DHNPYQ5G2DWUV3T2 · pith_short_8: DHNPYQ5G
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/DHNPYQ5G2DWUV3T2BAPT6HOBNY \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 19dafc43a6d0ed4aee7a081f3f1dc16e2690bce9e729cb67e4cf0fedc999d5d7
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "6438bc3f9b0ad57edc0c56bf97ba63f2ed3cd3b4b78e640459a294c532011eea",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-04-29T09:24:30Z",
    "title_canon_sha256": "a3ee8a00abb8653ecd09efee5e6e3dff17022749f7ceb3f9a19d8760fc0ac677"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2504.20571",
    "kind": "arxiv",
    "version": 3
  }
}