{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:AD4OU3S57S5Y2GK4FCGKHPM5GD","short_pith_number":"pith:AD4OU3S5","schema_version":"1.0","canonical_sha256":"00f8ea6e5dfcbb8d195c288ca3bd9d30ccec365083e1091ebe19ac2b0a61252f","source":{"kind":"arxiv","id":"2309.00267","version":3},"attestation_state":"computed","paper":{"title":"RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Abhinav Rastogi, Colton Bishop, Ethan Hall, Harrison Lee, Hassan Mansoor, Johan Ferret, Kellie Lu, Samrat Phatale, Sushant Prakash, Thomas Mesnard, Victor Carbune","submitted_at":"2023-09-01T05:53:33Z","abstract_excerpt":"Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards \"self-improvement\" by demonstrating that "},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2309.00267","kind":"arxiv","version":3},"metadata":{"license":"http://creativecommons.org/licenses/by/4.0/","primary_cat":"cs.CL","submitted_at":"2023-09-01T05:53:33Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"30a71bef573f4df6e33a747f2c8824790bde2c0d53c2ef6779f1df973bc3eb36","abstract_canon_sha256":"e06fbc5dabdd57615fbf708dac24235e084d3e237bd3abd328f2bc19edfd90ee"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:50.098761Z","signature_b64":"gdaJTPx4/wzV4nWtoCfQEipK/U0PyCM5yZ9HCCKZEjejR3+EW750giZDiWOjTDtKsFZ322HxChSXhVIof+9zAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"00f8ea6e5dfcbb8d195c288ca3bd9d30ccec365083e1091ebe19ac2b0a61252f","last_reissued_at":"2026-05-17T23:38:50.098142Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:50.098142Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"RLAIF vs. RLHF: Scaling Reinforcement Learning from Human Feedback with AI Feedback","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Abhinav Rastogi, Colton Bishop, Ethan Hall, Harrison Lee, Hassan Mansoor, Johan Ferret, Kellie Lu, Samrat Phatale, Sushant Prakash, Thomas Mesnard, Victor Carbune","submitted_at":"2023-09-01T05:53:33Z","abstract_excerpt":"Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but gathering high-quality preference labels is expensive. RL from AI Feedback (RLAIF), introduced in Bai et al., offers a promising alternative that trains the reward model (RM) on preferences generated by an off-the-shelf LLM. Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. Furthermore, we take a step towards \"self-improvement\" by demonstrating that "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. ... we introduce direct-RLAIF (d-RLAIF) ... which achieves superior performance to canonical RLAIF.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the preferences generated by an off-the-shelf LLM are high-quality enough to serve as a substitute for human preferences in training the reward model.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Reinforcement learning from AI feedback matches human feedback performance for aligning large language models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"4ddbd0027246b56fe7179d811e1ce883159e5134259fd24be22f93395f94c2b3"},"source":{"id":"2309.00267","kind":"arxiv","version":3},"verdict":{"id":"a3f1a224-083d-4a40-8281-981bb014d035","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T21:29:18.481786Z","strongest_claim":"Across the tasks of summarization, helpful dialogue generation, and harmless dialogue generation, we show that RLAIF achieves comparable performance to RLHF. ... we introduce direct-RLAIF (d-RLAIF) ... which achieves superior performance to canonical RLAIF.","one_line_summary":"RLAIF matches RLHF on summarization and dialogue tasks, with a direct-RLAIF variant achieving superior results by using LLM rewards directly during training.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the preferences generated by an off-the-shelf LLM are high-quality enough to serve as a substitute for human preferences in training the reward model.","pith_extraction_headline":"Reinforcement learning from AI feedback matches human feedback performance for aligning large language models."},"references":{"count":98,"sample":[{"doi":"","year":2022,"title":"E., Fort, S., Lanham, T., Telleen-Lawton, T., Conerly, T., Henighan, T., Hume, T., Bowman, S","work_id":"686a699b-d7b5-4f10-ba6c-e3df30418d80","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al","work_id":"fb006848-e99b-481f-9340-c35dbff67d47","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D","work_id":"463eef82-ca77-4ce1-8588-f7c3f7abe5a4","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"RAFT : Reward ranked finetuning for generative foundation model alignment","work_id":"aa80a8cf-c1b9-4e19-bb61-6c9aea78c183","ref_index":8,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2022,"title":"Understanding dataset difficulty with V -usable information","work_id":"8b66d63b-71d0-445b-8db1-ceaecd3c2e1f","ref_index":9,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":98,"snapshot_sha256":"280efcd0accaebc769d975a017e5a7572be47b99c701fdd06e17e462155ec766","internal_anchors":15},"formal_canon":{"evidence_count":2,"snapshot_sha256":"4ea79e99becc6383a8dc457d14fef873224bf8b76f834a6ee459b0fefd95c6cb"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2309.00267","created_at":"2026-05-17T23:38:50.098239+00:00"},{"alias_kind":"arxiv_version","alias_value":"2309.00267v3","created_at":"2026-05-17T23:38:50.098239+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2309.00267","created_at":"2026-05-17T23:38:50.098239+00:00"},{"alias_kind":"pith_short_12","alias_value":"AD4OU3S57S5Y","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"AD4OU3S57S5Y2GK4","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"AD4OU3S5","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":35,"internal_anchor_count":35,"sample":[{"citing_arxiv_id":"2605.23244","citing_title":"Convex Optimization for Alignment and Preference Learning on a Single GPU","ref_index":39,"is_internal_anchor":true},{"citing_arxiv_id":"2310.14768","citing_title":"Policy Gradient with Kernel Quadrature","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2408.15549","citing_title":"WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2504.13898","citing_title":"Social Human Robot Embodied Conversation (SHREC) Dataset: Benchmarking Foundational Models' Social Reasoning","ref_index":34,"is_internal_anchor":true},{"citing_arxiv_id":"2404.13076","citing_title":"LLM Evaluators Recognize and Favor Their Own Generations","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18851","citing_title":"STRIDE: Learnable Stepwise Language Feedback for LLM Reasoning","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18309","citing_title":"Alignment Dynamics in LLM Fine-Tuning","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.20164","citing_title":"Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15207","citing_title":"TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination","ref_index":56,"is_internal_anchor":true},{"citing_arxiv_id":"2605.15300","citing_title":"Deep Pre-Alignment for VLMs","ref_index":73,"is_internal_anchor":true},{"citing_arxiv_id":"2505.23912","citing_title":"LoVeC: Reinforcement Learning for Better Verbalized Confidence in Long-Form Generations","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2506.21834","citing_title":"PrefPaint: Enhancing Medical Image Inpainting through Expert Human Feedback","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2401.05561","citing_title":"TrustLLM: Trustworthiness in Large Language Models","ref_index":51,"is_internal_anchor":true},{"citing_arxiv_id":"2402.13116","citing_title":"A Survey on Knowledge Distillation of Large Language Models","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2402.11411","citing_title":"Aligning Modalities in Vision Large Language Models via Preference Fine-tuning","ref_index":155,"is_internal_anchor":true},{"citing_arxiv_id":"2511.17879","citing_title":"Generative Adversarial Post-Training Mitigates Reward Hacking in Live Human-AI Music Interaction","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2403.07691","citing_title":"ORPO: Monolithic Preference Optimization without Reference Model","ref_index":29,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13875","citing_title":"Common-agency Games for Multi-Objective Test-Time Alignment","ref_index":89,"is_internal_anchor":true},{"citing_arxiv_id":"2401.01335","citing_title":"Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2406.00515","citing_title":"A Survey on Large Language Models for Code Generation","ref_index":144,"is_internal_anchor":true},{"citing_arxiv_id":"2401.10020","citing_title":"Self-Rewarding Language Models","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09533","citing_title":"Assessment of RAG and Fine-Tuning for Industrial Question-Answering-Applications","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06582","citing_title":"PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2402.06196","citing_title":"Large Language Models: A Survey","ref_index":138,"is_internal_anchor":true},{"citing_arxiv_id":"2604.21811","citing_title":"Probably Approximately Consensus: On the Learning Theory of Finding Common Ground","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD","json":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD.json","graph_json":"https://pith.science/api/pith-number/AD4OU3S57S5Y2GK4FCGKHPM5GD/graph.json","events_json":"https://pith.science/api/pith-number/AD4OU3S57S5Y2GK4FCGKHPM5GD/events.json","paper":"https://pith.science/paper/AD4OU3S5"},"agent_actions":{"view_html":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD","download_json":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD.json","view_paper":"https://pith.science/paper/AD4OU3S5","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2309.00267&json=true","fetch_graph":"https://pith.science/api/pith-number/AD4OU3S57S5Y2GK4FCGKHPM5GD/graph.json","fetch_events":"https://pith.science/api/pith-number/AD4OU3S57S5Y2GK4FCGKHPM5GD/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD/action/timestamp_anchor","attest_storage":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD/action/storage_attestation","attest_author":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD/action/author_attestation","sign_citation":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD/action/citation_signature","submit_replication":"https://pith.science/pith/AD4OU3S57S5Y2GK4FCGKHPM5GD/action/replication_record"}},"created_at":"2026-05-17T23:38:50.098239+00:00","updated_at":"2026-05-17T23:38:50.098239+00:00"}