Pith Number

pith:YLVMLQCG

pith:2023:YLVMLQCGH4CXTYJQVOVYBNMYU2

not attested not anchored not stored refs resolved

Secrets of RLHF in Large Language Models Part I: PPO

Binghai Wang, Cheng Chang, Hang Yan, Haoran Huang, Limao Xiong, Lu Chen, Minghao Zhu, Nuo Xu, Qin Liu, Qi Zhang, Rongxiang Weng, Rui Zheng, Senjie Jin, Shihan Dou, Songyang Gao, Tao Gui, Tianxiang Sun, Wei Shen, Wenbin Lai, Wensen Cheng, Xipeng Qiu, Xuanjing Huang, Yan Liu, Yuan Hua, Yuhao Zhou, Zhangyue Yin, Zhiheng Xi

Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.

arxiv:2307.04964 v2 · 2023-07-11 · cs.CL · cs.AI · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{YLVMLQCGH4CXTYJQVOVYBNMYU2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model.

C2weakest assumption

That the observed training instability in RLHF stems primarily from policy constraint mechanics in PPO rather than from reward model quality, data selection, or other unexamined components of the full pipeline.

C3one line summary

Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.

References

60 extracted · 60 resolved · 14 Pith anchors

[1] LLaMA: Open and Efficient Foundation Language Models 2023 · arXiv:2302.13971

[2] Chiang, W.-L., Z. Li, Z. Lin, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023 2023

[3] Gpt-4 technical report 2023

[4] A Survey of Large Language Models 2023 · arXiv:2303.18223

[5] Brown, T., B. Mann, N. Ryder, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020 1901

Formal links

1 machine-checked theorem link

Cited by

18 papers in Pith

BalancedDPO: Adaptive Multi-Metric Alignment

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots

Receipt and verification

First computed	2026-05-17T23:38:13.976773Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4

Aliases

arxiv: 2307.04964 · arxiv_version: 2307.04964v2 · doi: 10.48550/arxiv.2307.04964 · pith_short_12: YLVMLQCGH4CX · pith_short_16: YLVMLQCGH4CXTYJQ · pith_short_8: YLVMLQCG

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "4759771176e5a57db34b180d670038182053f6e0e0813f34e57d3db6cda5879b",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-07-11T01:55:24Z",
    "title_canon_sha256": "c69253843ad7e306e4dbfca11d0095e5519020508eb9d9fb7ea4417a398f3e42"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2307.04964",
    "kind": "arxiv",
    "version": 2
  }
}