pith. sign in
Pith Number

pith:YLVMLQCG

pith:2023:YLVMLQCGH4CXTYJQVOVYBNMYU2
not attested not anchored not stored refs resolved

Secrets of RLHF in Large Language Models Part I: PPO

Binghai Wang, Cheng Chang, Hang Yan, Haoran Huang, Limao Xiong, Lu Chen, Minghao Zhu, Nuo Xu, Qin Liu, Qi Zhang, Rongxiang Weng, Rui Zheng, Senjie Jin, Shihan Dou, Songyang Gao, Tao Gui, Tianxiang Sun, Wei Shen, Wenbin Lai, Wensen Cheng, Xipeng Qiu, Xuanjing Huang, Yan Liu, Yuan Hua, Yuhao Zhou, Zhangyue Yin, Zhiheng Xi

Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.

arxiv:2307.04964 v2 · 2023-07-11 · cs.CL · cs.AI · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YLVMLQCGH4CXTYJQVOVYBNMYU2}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model.

C2weakest assumption

That the observed training instability in RLHF stems primarily from policy constraint mechanics in PPO rather than from reward model quality, data selection, or other unexamined components of the full pipeline.

C3one line summary

Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.

References

60 extracted · 60 resolved · 14 Pith anchors

[1] LLaMA: Open and Efficient Foundation Language Models 2023 · arXiv:2302.13971
[2] Chiang, W.-L., Z. Li, Z. Lin, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023 2023
[3] Gpt-4 technical report 2023
[4] A Survey of Large Language Models 2023 · arXiv:2303.18223
[5] Brown, T., B. Mann, N. Ryder, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020 1901

Formal links

1 machine-checked theorem link

Cited by

18 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.976773Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4

Aliases

arxiv: 2307.04964 · arxiv_version: 2307.04964v2 · doi: 10.48550/arxiv.2307.04964 · pith_short_12: YLVMLQCGH4CX · pith_short_16: YLVMLQCGH4CXTYJQ · pith_short_8: YLVMLQCG
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2 \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "4759771176e5a57db34b180d670038182053f6e0e0813f34e57d3db6cda5879b",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2023-07-11T01:55:24Z",
    "title_canon_sha256": "c69253843ad7e306e4dbfca11d0095e5519020508eb9d9fb7ea4417a398f3e42"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2307.04964",
    "kind": "arxiv",
    "version": 2
  }
}