pith. sign in
Pith Number

pith:T7VQIBNL

pith:2025:T7VQIBNLLTE2TBPWOMMR5JEKYR
not attested not anchored not stored refs resolved

RL's Razor: Why Online Reinforcement Learning Forgets Less

Idan Shenfeld, Jyothish Pari, Pulkit Agrawal

On-policy RL forgets less than SFT because it selects the minimal-KL solution to new tasks among many possibilities.

arxiv:2509.04259 v1 · 2025-09-04 · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{T7VQIBNLLTE2TBPWOMMR5JEKYR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

on-policy RL is implicitly biased towards KL-minimal solutions among the many that solve the new task, whereas SFT can converge to distributions arbitrarily far from the base model

C2weakest assumption

that the observed degree of forgetting is determined by the KL-divergence between fine-tuned and base policy evaluated on the new task

C3one line summary

Online RL fine-tuning forgets less than SFT because it is implicitly biased toward KL-minimal solutions among all policies that solve the new task.

References

12 extracted · 12 resolved · 0 Pith anchors

[1] Reinforcement fine-tuning naturally mitigates forgetting in continual post-training 2019
[2] We trained multiple models under a broad sweep of hyperparame- ters (see Table 2)
[3] For Math and Science Q&A, accuracy was measured by comparing the model’s final answer to the ground truth, ignoring intermediate reasoning chains
[4] We assessed performance on unrelated benchmarks as described in Section 3.1, using the Language Model Evaluation Harness (Gao et al., 2024) 2024
[5] From the trained models, we retained only those lying within 2 accuracy points of the Pareto frontier

Cited by

26 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:15.049025Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

9feb0405ab5cc9a985f673191ea48ac45049de8d784429319118892e7c71d92e

Aliases

arxiv: 2509.04259 · arxiv_version: 2509.04259v1 · doi: 10.48550/arxiv.2509.04259 · pith_short_12: T7VQIBNLLTE2 · pith_short_16: T7VQIBNLLTE2TBPW · pith_short_8: T7VQIBNL
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/T7VQIBNLLTE2TBPWOMMR5JEKYR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 9feb0405ab5cc9a985f673191ea48ac45049de8d784429319118892e7c71d92e
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "06de02bce6ff42333c41e0a972d2169fd6b955543aa8d13f5ef008acba3af663",
    "cross_cats_sorted": [],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-09-04T14:38:08Z",
    "title_canon_sha256": "9b5cf8274c7ae4584cabcb5cbf3d95444013f1dba694d01c57bdbdf3abbf1241"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2509.04259",
    "kind": "arxiv",
    "version": 1
  }
}