Pith Number

pith:XP5QRJWO

pith:2025:XP5QRJWORXVGYL5LSNZ3LPVQYR

not attested not anchored not stored refs pending

A Model Can Help Itself: Reward-Free Self-Training for LLM Reasoning

Anthony Man-Cho So, Lei Zhao, Mengqi Li, Ruoyu Sun, Xiao Li

Language models can improve their reasoning by training on responses they generate themselves.

arxiv:2510.18814 v3 · 2025-10-21 · cs.LG · cs.AI

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{XP5QRJWORXVGYL5LSNZ3LPVQYR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We show that they can. We propose Self-evolving Post-Training (SePT), a simple post-training method that alternates between self-generation and training on self-generated responses. Across six math reasoning benchmarks, SePT improves a strong no-training baseline... and in some settings can even approach the performance of Reinforcement Learning with Verifiable Rewards (RLVR).

C2weakest assumption

That self-generated responses supply a net positive training signal rather than reinforcing the model's existing errors or hallucinations, which must hold for the iterative self-training loop to produce sustained gains without external verification or filtering.

C3one line summary

SePT enables LLMs to improve math reasoning on multiple benchmarks by iteratively training on their own low-temperature generated responses using an online data refresh mechanism.

Formal links

2 machine-checked theorem links

Cited by

2 papers in Pith

Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients

Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation

Receipt and verification

First computed	2026-05-20T00:00:26.365399Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

bbfb08a6ce8dea6c2fab9373b5beb0c471ac5638b82199949a8de7ac157dbfa3

Aliases

arxiv: 2510.18814 · arxiv_version: 2510.18814v3 · doi: 10.48550/arxiv.2510.18814 · pith_short_12: XP5QRJWORXVG · pith_short_16: XP5QRJWORXVGYL5L · pith_short_8: XP5QRJWO

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/XP5QRJWORXVGYL5LSNZ3LPVQYR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: bbfb08a6ce8dea6c2fab9373b5beb0c471ac5638b82199949a8de7ac157dbfa3

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "a2a6d74918e4bf28b16bc80f1d5e27a016ac55827482ebac9d171647432ca76e",
    "cross_cats_sorted": [
      "cs.AI"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-10-21T17:15:56Z",
    "title_canon_sha256": "3da41121e66a06f632b807a7bacb8a67fc88c090f8e70453dc0c1151ff1a4d99"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2510.18814",
    "kind": "arxiv",
    "version": 3
  }
}