Pith Number

pith:T7VQIBNL

pith:2025:T7VQIBNLLTE2TBPWOMMR5JEKYR

not attested not anchored not stored refs resolved

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

On-policy RL forgets less than SFT because it selects the minimal-KL solution to new tasks among many possibilities.

arxiv:2509.04259 v1 · 2025-09-04 · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{T7VQIBNLLTE2TBPWOMMR5JEKYR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model

C2weakest assumption

that the observed degree of forgetting is determined by the KL-divergence between fine-tuned and base policy evaluated on the new task

C3one line summary

Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.

References

12 extracted · 12 resolved · 0 Pith anchors

[1] Reinforcement fine-tuning naturally mitigates forgetting in continual post-training 2019

[2] We trained multiple models under a broad sweep of hyperparame- ters (see Table 2)

[3] For Math and Science Q&A, accuracy was measured by comparing the model’s final answer to the ground truth, ignoring intermediate reasoning chains

[4] We assessed performance on unrelated benchmarks as described in Section 3.1, using the Language Model Evaluation Harness (Gao et al., 2024) 2024

[5] From the trained models, we retained only those lying within 2 accuracy points of the Pareto frontier

Cited by

26 papers in Pith

Towards Long-Lived Robots: Continual Learning VLA Models via Reinforcement Fine-Tuning

Understanding and Exploiting Weight Update Sparsity for Communication-Efficient Distributed RL

Preserving Foundational Capabilities in Flow-Matching VLAs through Conservative SFT

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

Emergent Slow Thinking in LLMs as Inverse Tree Freezing

Receipt and verification

First computed	2026-05-17T23:38:15.049025Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

9feb0405ab5cc9a985f673191ea48ac45049de8d784429319118892e7c71d92e

Aliases

arxiv: 2509.04259 · arxiv_version: 2509.04259v1 · doi: 10.48550/arxiv.2509.04259 · pith_short_12: T7VQIBNLLTE2 · pith_short_16: T7VQIBNLLTE2TBPW · pith_short_8: T7VQIBNL

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/T7VQIBNLLTE2TBPWOMMR5JEKYR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9feb0405ab5cc9a985f673191ea48ac45049de8d784429319118892e7c71d92e

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "06de02bce6ff42333c41e0a972d2169fd6b955543aa8d13f5ef008acba3af663",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-09-04T14:38:08Z",
    "title_canon_sha256": "9b5cf8274c7ae4584cabcb5cbf3d95444013f1dba694d01c57bdbdf3abbf1241"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2509.04259",
    "kind": "arxiv",
    "version": 1
  }
}