pith:YLVMLQCG
Secrets of RLHF in Large Language Models Part I: PPO
Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.
arxiv:2307.04964 v2 · 2023-07-11 · cs.CL · cs.AI · cs.LG
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{YLVMLQCGH4CXTYJQVOVYBNMYU2}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model.
That the observed training instability in RLHF stems primarily from policy constraint mechanics in PPO rather than from reward model quality, data selection, or other unexamined components of the full pipeline.
Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.
References
Formal links
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:13.976773Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2 \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "4759771176e5a57db34b180d670038182053f6e0e0813f34e57d3db6cda5879b",
"cross_cats_sorted": [
"cs.AI",
"cs.LG"
],
"license": "http://creativecommons.org/licenses/by/4.0/",
"primary_cat": "cs.CL",
"submitted_at": "2023-07-11T01:55:24Z",
"title_canon_sha256": "c69253843ad7e306e4dbfca11d0095e5519020508eb9d9fb7ea4417a398f3e42"
},
"schema_version": "1.0",
"source": {
"id": "2307.04964",
"kind": "arxiv",
"version": 2
}
}