{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:YLVMLQCGH4CXTYJQVOVYBNMYU2","short_pith_number":"pith:YLVMLQCG","schema_version":"1.0","canonical_sha256":"c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4","source":{"kind":"arxiv","id":"2307.04964","version":2},"attestation_state":"computed","paper":{"title":"Secrets of RLHF in Large Language Models Part I: PPO","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Binghai Wang, Cheng Chang, Hang Yan, Haoran Huang, Limao Xiong, Lu Chen, Minghao Zhu, Nuo Xu, Qin Liu, Qi Zhang, Rongxiang Weng, Rui Zheng, Senjie Jin, Shihan Dou, Songyang Gao, Tao Gui, Tianxiang Sun, Wei Shen, Wenbin Lai, Wensen Cheng, Xipeng Qiu, Xuanjing Huang, Yan Liu, Yuan Hua, Yuhao Zhou, Zhangyue Yin, Zhiheng Xi","submitted_at":"2023-07-11T01:55:24Z","abstract_excerpt":"Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \\textbf{reward models} to measure human preferences, \\textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \\textbf{process supervision} to"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2307.04964","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-07-11T01:55:24Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"c69253843ad7e306e4dbfca11d0095e5519020508eb9d9fb7ea4417a398f3e42","abstract_canon_sha256":"4759771176e5a57db34b180d670038182053f6e0e0813f34e57d3db6cda5879b"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.977278Z","signature_b64":"RE6AUkvv45GrXXI53azq+xAHyCkrY/52/UYX9w3JfB30s2V8hImnYlhg9ScQSH9pIu/dioNT/GkpA3tw95TFCQ==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"c2eac5c0463f0579e130abab80b598a6b92b3e4612bc811c174a2165e07240e4","last_reissued_at":"2026-05-17T23:38:13.976773Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.976773Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Secrets of RLHF in Large Language Models Part I: PPO","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Binghai Wang, Cheng Chang, Hang Yan, Haoran Huang, Limao Xiong, Lu Chen, Minghao Zhu, Nuo Xu, Qin Liu, Qi Zhang, Rongxiang Weng, Rui Zheng, Senjie Jin, Shihan Dou, Songyang Gao, Tao Gui, Tianxiang Sun, Wei Shen, Wenbin Lai, Wensen Cheng, Xipeng Qiu, Xuanjing Huang, Yan Liu, Yuan Hua, Yuhao Zhou, Zhangyue Yin, Zhiheng Xi","submitted_at":"2023-07-11T01:55:24Z","abstract_excerpt":"Large language models (LLMs) have formulated a blueprint for the advancement of artificial general intelligence. Its primary objective is to function as a human-centric (helpful, honest, and harmless) assistant. Alignment with humans assumes paramount significance, and reinforcement learning with human feedback (RLHF) emerges as the pivotal technological paradigm underpinning this pursuit. Current technical routes usually include \\textbf{reward models} to measure human preferences, \\textbf{Proximal Policy Optimization} (PPO) to optimize policy model outputs, and \\textbf{process supervision} to"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the observed training instability in RLHF stems primarily from policy constraint mechanics in PPO rather than from reward model quality, data selection, or other unexamined components of the full pipeline.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Policy constraints are the key factor for effective PPO implementation in RLHF for large language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"fc08c09b1a2dc6f204bef33dbfb6cc9735cb28f8a47714ad1df63fba32bb48e1"},"source":{"id":"2307.04964","kind":"arxiv","version":2},"verdict":{"id":"ba43b342-14b3-4d38-a4cf-8265131077c1","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T13:11:43.566728Z","strongest_claim":"We identify policy constraints being the key factor for the effective implementation of the PPO algorithm. Therefore, we explore the PPO-max, an advanced version of PPO algorithm, to efficiently improve the training stability of the policy model.","one_line_summary":"Policy constraints are the critical factor for stable PPO training in RLHF, and the proposed PPO-max variant improves stability for large language model alignment.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the observed training instability in RLHF stems primarily from policy constraint mechanics in PPO rather than from reward model quality, data selection, or other unexamined components of the full pipeline.","pith_extraction_headline":"Policy constraints are the key factor for effective PPO implementation in RLHF for large language models."},"references":{"count":60,"sample":[{"doi":"","year":2023,"title":"LLaMA: Open and Efficient Foundation Language Models","work_id":"c018fc23-6f3f-4035-9d02-28a2173b2b9d","ref_index":1,"cited_arxiv_id":"2302.13971","is_internal_anchor":true},{"doi":"","year":2023,"title":"Chiang, W.-L., Z. Li, Z. Lin, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023","work_id":"17fc14bb-5dff-4a72-af61-425c91b07479","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Gpt-4 technical report","work_id":"388f534c-855a-4366-b933-f07bf3e2db5f","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"A Survey of Large Language Models","work_id":"de1b42b5-4a0a-4b1f-8c78-1f7fe21be6c9","ref_index":4,"cited_arxiv_id":"2303.18223","is_internal_anchor":true},{"doi":"","year":1901,"title":"Brown, T., B. Mann, N. Ryder, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020","work_id":"9010afde-4504-4219-a609-1afdc37b81a3","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":60,"snapshot_sha256":"b5b055ecd0e7677ffb9ff2da154f836e84baaecdab77551ebf5395099677c30a","internal_anchors":14},"formal_canon":{"evidence_count":1,"snapshot_sha256":"ca570a855e77390e73535f3b9034a90c5e946dd6731ebfdc3aa7af6b218dfab5"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2307.04964","created_at":"2026-05-17T23:38:13.976854+00:00"},{"alias_kind":"arxiv_version","alias_value":"2307.04964v2","created_at":"2026-05-17T23:38:13.976854+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2307.04964","created_at":"2026-05-17T23:38:13.976854+00:00"},{"alias_kind":"pith_short_12","alias_value":"YLVMLQCGH4CX","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"YLVMLQCGH4CXTYJQ","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"YLVMLQCG","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":18,"internal_anchor_count":18,"sample":[{"citing_arxiv_id":"2503.12575","citing_title":"BalancedDPO: Adaptive Multi-Metric Alignment","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2601.04068","citing_title":"Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2601.04068","citing_title":"Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models","ref_index":81,"is_internal_anchor":true},{"citing_arxiv_id":"2411.10442","citing_title":"Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization","ref_index":118,"is_internal_anchor":true},{"citing_arxiv_id":"2603.00774","citing_title":"Structure Matters: Evaluating Multi-Agents Orchestration in Generative Therapeutic Chatbots","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2603.12631","citing_title":"Joint Optimization of Multi-agent Memory System","ref_index":41,"is_internal_anchor":true},{"citing_arxiv_id":"2403.17297","citing_title":"InternLM2 Technical Report","ref_index":156,"is_internal_anchor":true},{"citing_arxiv_id":"2605.11235","citing_title":"Internalizing Curriculum Judgment for LLM Reinforcement Fine-Tuning","ref_index":18,"is_internal_anchor":true},{"citing_arxiv_id":"2604.28020","citing_title":"Cost-Aware Learning","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08665","citing_title":"Hint Tuning: Less Data Makes Better Reasoners","ref_index":33,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08378","citing_title":"Reinforcement Learning for Scalable and Trustworthy Intelligent Systems","ref_index":97,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00347","citing_title":"Odysseus: Scaling VLMs to 100+ Turn Decision-Making in Games via Reinforcement Learning","ref_index":76,"is_internal_anchor":true},{"citing_arxiv_id":"2604.19485","citing_title":"EVPO: Explained Variance Policy Optimization for Adaptive Critic Utilization in LLM Post-Training","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11446","citing_title":"Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration","ref_index":43,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07522","citing_title":"WeatherSyn: An Instruction Tuning MLLM For Weather Forecasting Report Generation","ref_index":23,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07941","citing_title":"Large Language Model Post-Training: A Unified View of Off-Policy and On-Policy Learning","ref_index":95,"is_internal_anchor":true},{"citing_arxiv_id":"2508.18265","citing_title":"InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency","ref_index":185,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17396","citing_title":"Representation-Guided Parameter-Efficient LLM Unlearning","ref_index":203,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2","json":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2.json","graph_json":"https://pith.science/api/pith-number/YLVMLQCGH4CXTYJQVOVYBNMYU2/graph.json","events_json":"https://pith.science/api/pith-number/YLVMLQCGH4CXTYJQVOVYBNMYU2/events.json","paper":"https://pith.science/paper/YLVMLQCG"},"agent_actions":{"view_html":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2","download_json":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2.json","view_paper":"https://pith.science/paper/YLVMLQCG","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2307.04964&json=true","fetch_graph":"https://pith.science/api/pith-number/YLVMLQCGH4CXTYJQVOVYBNMYU2/graph.json","fetch_events":"https://pith.science/api/pith-number/YLVMLQCGH4CXTYJQVOVYBNMYU2/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2/action/timestamp_anchor","attest_storage":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2/action/storage_attestation","attest_author":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2/action/author_attestation","sign_citation":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2/action/citation_signature","submit_replication":"https://pith.science/pith/YLVMLQCGH4CXTYJQVOVYBNMYU2/action/replication_record"}},"created_at":"2026-05-17T23:38:13.976854+00:00","updated_at":"2026-05-17T23:38:13.976854+00:00"}