pith. sign in
Pith Number

pith:QAEGPR6G

pith:2025:QAEGPR6GQVUOWUPBXPUKW3PMDG
not attested not anchored not stored refs resolved

Soft Adaptive Policy Optimization

An Yang, Bowen Yu, Chang Gao, Chujie Zheng, Jingren Zhou, Junyang Lin, Kai Dang, Shixuan Liu, Shuai Bai, Xiong-Hui Chen

A smooth temperature-controlled gate replaces hard clipping to stabilize reinforcement learning updates for language models.

arxiv:2511.20347 v2 · 2025-11-25 · cs.LG · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{QAEGPR6GQVUOWUPBXPUKW3PMDG}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Empirical results on mathematical reasoning benchmarks indicate that SAPO exhibits improved training stability and higher Pass@1 performance under comparable training budgets. Moreover, we employ SAPO to train the Qwen3-VL model series, demonstrating that SAPO yields consistent performance gains across diverse tasks and different model sizes.

C2weakest assumption

That the smooth temperature-controlled gate selectively attenuates only harmful off-policy signals without suppressing useful learning gradients or introducing new instabilities that hard clipping avoided.

C3one line summary

SAPO introduces smooth adaptive gating to replace hard clipping in token- and sequence-level policy optimization for more stable LLM reinforcement learning.

References

12 extracted · 12 resolved · 5 Pith anchors

[1] Aime problems and solutions 2025
[2] The sufficiency of off-policyness and soft clipping: Ppo is still insufficient according to an off-policy measure 2023
[3] DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning 2025 · arXiv:2501.12948
[4] HMMT . Hmmt 2025. https://www.hmmt.org, 2025 2025
[5] LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code 2024 · arXiv:2403.07974

Formal links

2 machine-checked theorem links

Cited by

40 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:53.164767Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

800867c7c68568eb51e1bbe8ab6dec19ad280eec8e7f7d09d5ee95de3cb8e144

Aliases

arxiv: 2511.20347 · arxiv_version: 2511.20347v2 · doi: 10.48550/arxiv.2511.20347 · pith_short_12: QAEGPR6GQVUO · pith_short_16: QAEGPR6GQVUOWUPB · pith_short_8: QAEGPR6G
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/QAEGPR6GQVUOWUPBXPUKW3PMDG \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 800867c7c68568eb51e1bbe8ab6dec19ad280eec8e7f7d09d5ee95de3cb8e144
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "87ba5eef28f6f15dd14bb0c369fff2172ee7a06436f5bd193fe0f7ecba7898a4",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.LG",
    "submitted_at": "2025-11-25T14:25:19Z",
    "title_canon_sha256": "9dbd102b340aee7e9177a9024622d6d374cd500709201ceaa0298a3715c1ed8d"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2511.20347",
    "kind": "arxiv",
    "version": 2
  }
}