{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2023:3T7I6SRLYYQSMS2HDBIITTL7EY","short_pith_number":"pith:3T7I6SRL","schema_version":"1.0","canonical_sha256":"dcfe8f4a2bc621264b47185089cd7f26248b30c7e9609908cb419894e397dff4","source":{"kind":"arxiv","id":"2310.01377","version":2},"attestation_state":"computed","paper":{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Ganqu Cui, Guanming Yao, Guotong Xie, Lifan Yuan, Maosong Sun, Ning Ding, Ruobing Xie, Wei Zhu, Yankai Lin, Yuan Ni, Zhiyuan Liu","submitted_at":"2023-10-02T17:40:01Z","abstract_excerpt":"Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \\textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \\textbf{scale and diversi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2310.01377","kind":"arxiv","version":2},"metadata":{"license":"http://creativecommons.org/licenses/by-sa/4.0/","primary_cat":"cs.CL","submitted_at":"2023-10-02T17:40:01Z","cross_cats_sorted":["cs.AI","cs.LG"],"title_canon_sha256":"b7b8be285286f3dd7d47544a7033add9fc57876b36c4cf43b92d8ac8f1cd2f66","abstract_canon_sha256":"1d36aa47da97202909f564bcf2fd99c5f68f7c70f1de301a52bfcd55c832cdff"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.587095Z","signature_b64":"049o1knKGsFqThP+wGkEo0Fq76L3edqexuZ5UYUOAEePmNJ0iEctiq3GzGa4fWqvQO+MvR5HFDw/X1ALYa6NDA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"dcfe8f4a2bc621264b47185089cd7f26248b30c7e9609908cb419894e397dff4","last_reissued_at":"2026-05-17T23:38:13.586464Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.586464Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"UltraFeedback: Boosting Language Models with Scaled AI Feedback","license":"http://creativecommons.org/licenses/by-sa/4.0/","headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","cross_cats":["cs.AI","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bingxiang He, Ganqu Cui, Guanming Yao, Guotong Xie, Lifan Yuan, Maosong Sun, Ning Ding, Ruobing Xie, Wei Zhu, Yankai Lin, Yuan Ni, Zhiyuan Liu","submitted_at":"2023-10-02T17:40:01Z","abstract_excerpt":"Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \\textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \\textbf{scale and diversi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"651a10cc1350ba03067efd47bcb55e9e95127e7d55fc592ec68511628725edf1"},"source":{"id":"2310.01377","kind":"arxiv","version":2},"verdict":{"id":"252a8e97-63b0-4ad7-9667-1cd978ace386","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T16:26:16.522516Z","strongest_claim":"Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks.","one_line_summary":"UltraFeedback is a large-scale AI feedback dataset that enables effective alignment of open-source language models, yielding strong results on chat benchmarks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.","pith_extraction_headline":"A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models."},"references":{"count":14,"sample":[{"doi":"10.5281/zenodo.5371628","year":2021,"title":"Evaluating Large Language Models Trained on Code","work_id":"042493e9-b26f-4b4e-bbde-382072ca9b08","ref_index":1,"cited_arxiv_id":"2107.03374","is_internal_anchor":true},{"doi":"10.18653/v1/","year":2023,"title":"doi: 10.18653/v1/ 2024.findings-acl.586","work_id":"8d675bdd-79ca-48d6-9163-fc17ce0e8ece","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.48550/arxiv","year":2022,"title":"Self-critiquing models for assisting human evaluators","work_id":"3fcefdd1-22ab-4648-a683-cb1555e7a50e","ref_index":3,"cited_arxiv_id":"2206.05802","is_internal_anchor":true},{"doi":"","year":null,"title":"This may be particularly helpful if you have a busy schedule and may not have time to take them later in the day","work_id":"3761cc93-810c-498a-b8f7-6fbb54a50451","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Taking a vitamin D supplement after spending time outdoors can help boost your levels and ensure you’re getting enough","work_id":"dad4fd18-cbc0-46a0-866d-afcab590a1a9","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":14,"snapshot_sha256":"576c9171a7604250df5469674777be4ed6c66a9eead0820ce47ec4fd283f263d","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"d3e53c5bd066182d0d71c7229b7c10558aaf01828949e86336fb1134216b3905"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2310.01377","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"arxiv_version","alias_value":"2310.01377v2","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2310.01377","created_at":"2026-05-17T23:38:13.586563+00:00"},{"alias_kind":"pith_short_12","alias_value":"3T7I6SRLYYQS","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"3T7I6SRLYYQSMS2H","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"3T7I6SRL","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":22,"internal_anchor_count":22,"sample":[{"citing_arxiv_id":"2509.20265","citing_title":"Failure Modes of Maximum Entropy RLHF","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2509.23102","citing_title":"Multiplayer Nash Preference Optimization","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2510.04595","citing_title":"SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2402.13116","citing_title":"A Survey on Knowledge Distillation of Large Language Models","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2410.18451","citing_title":"Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2310.16944","citing_title":"Zephyr: Direct Distillation of LM Alignment","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2406.08464","citing_title":"Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing","ref_index":100,"is_internal_anchor":true},{"citing_arxiv_id":"2602.08813","citing_title":"Robust Policy Optimization to Prevent Catastrophic Forgetting","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2603.18113","citing_title":"VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.04120","citing_title":"Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.12000","citing_title":"Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.27733","citing_title":"Mind the Gap: Structure-Aware Consistency in Preference Learning","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2605.09227","citing_title":"Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.10716","citing_title":"What should post-training optimize? A test-time scaling law perspective","ref_index":7,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00195","citing_title":"Diversity in Large Language Models under Supervised Fine-Tuning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23543","citing_title":"Pref-CTRL: Preference Driven LLM Alignment using Representation Editing","ref_index":5,"is_internal_anchor":true},{"citing_arxiv_id":"2605.06036","citing_title":"Optimal Transport for LLM Reward Modeling from Noisy Preference","ref_index":244,"is_internal_anchor":true},{"citing_arxiv_id":"2412.05271","citing_title":"Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling","ref_index":48,"is_internal_anchor":true},{"citing_arxiv_id":"2604.15602","citing_title":"GroupDPO: Memory efficient Group-wise Direct Preference Optimization","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20685","citing_title":"MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment","ref_index":63,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00195","citing_title":"Diversity in Large Language Models under Supervised Fine-Tuning","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02626","citing_title":"Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models","ref_index":5,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY","json":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY.json","graph_json":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/graph.json","events_json":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/events.json","paper":"https://pith.science/paper/3T7I6SRL"},"agent_actions":{"view_html":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY","download_json":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY.json","view_paper":"https://pith.science/paper/3T7I6SRL","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2310.01377&json=true","fetch_graph":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/graph.json","fetch_events":"https://pith.science/api/pith-number/3T7I6SRLYYQSMS2HDBIITTL7EY/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/timestamp_anchor","attest_storage":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/storage_attestation","attest_author":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/author_attestation","sign_citation":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/citation_signature","submit_replication":"https://pith.science/pith/3T7I6SRLYYQSMS2HDBIITTL7EY/action/replication_record"}},"created_at":"2026-05-17T23:38:13.586563+00:00","updated_at":"2026-05-17T23:38:13.586563+00:00"}