UltraFeedback: Boosting Language Models with Scaled AI Feedback
Pith reviewed 2026-05-17 16:26 UTC · model grok-4.3
The pith
A dataset of over one million GPT-4 feedbacks enables effective alignment of LLaMA-based chat models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UltraFeedback is a large-scale, high-quality, and diversified AI feedback dataset containing over 1 million GPT-4 feedbacks for 250k user-assistant conversations; when used to align a LLaMA-based model via best-of-n sampling and reinforcement learning, it produces exceptional performance on chat benchmarks and validates scaled AI feedback as an effective foundation for open-source alignment.
What carries the argument
The UltraFeedback dataset, built by broadening instructions and responses then applying bias-mitigation techniques to GPT-4 annotations, which supplies the training signal for best-of-n sampling and reinforcement learning.
If this is right
- Open-source chat models can reach strong benchmark performance using only AI feedback instead of human feedback.
- Best-of-n sampling combined with reinforcement learning on the feedback data improves alignment quality.
- The dataset and approach serve as a foundation for further feedback-learning research.
- Scaling both the amount and diversity of feedback data is what drives the alignment gains.
Where Pith is reading between the lines
- The same scaling approach could be tested on alignment tasks beyond chat, such as instruction following or safety.
- Hybrid pipelines that mix UltraFeedback with limited human data might close remaining gaps with proprietary models.
- The bias-mitigation steps could be reused or refined when other large models serve as feedback providers.
Load-bearing premise
The series of techniques applied to mitigate annotation biases in GPT-4 feedback produces sufficiently reliable and unbiased signals for effective model alignment.
What would settle it
If models trained on UltraFeedback show no measurable gain over baselines trained on smaller human-feedback datasets across multiple chat benchmarks, the effectiveness of scaled AI feedback would be falsified.
read the original abstract
Learning from human feedback has become a pivot technique in aligning large language models (LLMs) with human preferences. However, acquiring vast and premium human feedback is bottlenecked by time, labor, and human capability, resulting in small sizes or limited topics of current datasets. This further hinders feedback learning as well as alignment research within the open-source community. To address this issue, we explore how to go beyond human feedback and collect high-quality \textit{AI feedback} automatically for a scalable alternative. Specifically, we identify \textbf{scale and diversity} as the key factors for feedback data to take effect. Accordingly, we first broaden instructions and responses in both amount and breadth to encompass a wider range of user-assistant interactions. Then, we meticulously apply a series of techniques to mitigate annotation biases for more reliable AI feedback. We finally present \textsc{UltraFeedback}, a large-scale, high-quality, and diversified AI feedback dataset, which contains over 1 million GPT-4 feedback for 250k user-assistant conversations from various aspects. Built upon \textsc{UltraFeedback}, we align a LLaMA-based model by best-of-$n$ sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks. Our work validates the effectiveness of scaled AI feedback data in constructing strong open-source chat language models, serving as a solid foundation for future feedback learning research. Our data and models are available at https://github.com/thunlp/UltraFeedback.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces UltraFeedback, a large-scale dataset containing over 1 million GPT-4 feedbacks across 250k diverse user-assistant conversations. The authors broaden the scope of instructions and responses and apply a series of techniques to mitigate annotation biases in the GPT-4 signals. They then align a LLaMA-based model using best-of-n sampling and reinforcement learning on this dataset, reporting strong results on chat benchmarks and positioning the work as a scalable open-source alternative to human feedback for alignment research.
Significance. If the central empirical claims hold, the work supplies a publicly released, high-volume AI feedback resource that could meaningfully accelerate open-source LLM alignment experiments. The explicit focus on scale, diversity, and bias mitigation, together with the release of both data and models, constitutes a concrete contribution to the feedback-learning literature.
major comments (2)
- [Dataset construction and bias-mitigation subsection] Dataset construction and bias-mitigation subsection: the manuscript describes a series of techniques to reduce GPT-4 annotation biases but provides no controlled comparison (e.g., agreement rates or win rates) of the resulting preference signals against human labels on an overlapping instruction set. Without such validation, it remains unclear whether residual GPT-4 biases (verbosity, sycophancy) are sufficiently suppressed for the subsequent RL stage to be reliable.
- [Alignment experiments section] Alignment experiments section: the headline claim of 'exceptional performance' on chat benchmarks is presented without reported standard deviations across multiple runs, without explicit baseline numbers for models trained on comparable human-feedback datasets, and without ablation results isolating the contribution of the bias-mitigation steps. These omissions make it difficult to determine whether the observed gains are statistically robust and attributable to UltraFeedback quality.
minor comments (2)
- [Abstract] Abstract: the phrase 'exceptional performance' is used without any numeric benchmark scores or direct comparisons, reducing immediate readability.
- [Alignment experiments section] Notation: the description of best-of-n sampling and the RL objective would benefit from an explicit equation or pseudocode block to clarify the exact training procedure.
Simulated Author's Rebuttal
We are grateful to the referee for the positive assessment of the work's significance and for the constructive major comments. We respond to each point below, acknowledging where the current manuscript is limited and describing the revisions we will make.
read point-by-point responses
-
Referee: [Dataset construction and bias-mitigation subsection] Dataset construction and bias-mitigation subsection: the manuscript describes a series of techniques to reduce GPT-4 annotation biases but provides no controlled comparison (e.g., agreement rates or win rates) of the resulting preference signals against human labels on an overlapping instruction set. Without such validation, it remains unclear whether residual GPT-4 biases (verbosity, sycophancy) are sufficiently suppressed for the subsequent RL stage to be reliable.
Authors: We agree that a direct controlled comparison against human labels on an overlapping set would provide valuable additional validation. The manuscript does not contain such a comparison, as collecting human annotations at the scale of 250k conversations was not feasible and is precisely the bottleneck our work seeks to address. We will revise the bias-mitigation subsection to explicitly acknowledge this limitation, discuss the known properties of GPT-4 as a judge (including residual risks of verbosity and sycophancy), and cite relevant studies on LLM-judge reliability. We will also note that downstream benchmark gains serve as an indirect indicator of signal quality. revision: yes
-
Referee: [Alignment experiments section] Alignment experiments section: the headline claim of 'exceptional performance' on chat benchmarks is presented without reported standard deviations across multiple runs, without explicit baseline numbers for models trained on comparable human-feedback datasets, and without ablation results isolating the contribution of the bias-mitigation steps. These omissions make it difficult to determine whether the observed gains are statistically robust and attributable to UltraFeedback quality.
Authors: We acknowledge that the experimental section would be strengthened by these elements. The current manuscript reports single-run results for the primary models and does not include explicit human-feedback baselines or full ablations on bias mitigation. We will revise the alignment experiments section to report standard deviations from any available multi-seed runs, add direct comparisons against models trained on established human-feedback datasets (e.g., HH-RLHF), and include targeted ablations isolating the bias-mitigation techniques. Due to computational constraints, the scope of new experiments will be limited to feasible re-runs and smaller-scale ablations. revision: partial
Circularity Check
No circularity: empirical dataset construction and external benchmark validation
full rationale
The paper presents an empirical pipeline: broadening instructions/responses, applying bias-mitigation techniques to GPT-4 annotations, releasing the resulting UltraFeedback dataset of 1M+ feedbacks, and then performing best-of-n sampling plus RL alignment on a LLaMA model whose chat-benchmark scores are reported as external evidence. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the derivation chain; the performance claims rest on measured outcomes against independent benchmarks rather than reducing to the input data or prior author results by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption GPT-4 can generate reliable preference feedback when annotation biases are mitigated by the described techniques
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Built upon UltraFeedback, we align a LLaMA-based model by best-of-n sampling and reinforcement learning, demonstrating its exceptional performance on chat benchmarks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
Split the Differences, Pool the Rest: Provably Efficient Multi-Objective Imitation
MA-BC partitions divergent expert data while pooling non-conflicting pairs in MOMDPs, converging faster to Pareto-optimal policies than independent learners and matching a new minimax lower bound.
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...
-
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
-
What should post-training optimize? A test-time scaling law perspective
Tail-extrapolated estimators approximate best-of-N policy gradients from limited training rollouts by leveraging upper-tail reward statistics under structural assumptions.
-
Optimal Transport for LLM Reward Modeling from Noisy Preference
SelectiveRM applies optimal transport with a joint consistency discrepancy and partial mass relaxation to produce reward models that optimize a tighter upper bound on clean risk while autonomously dropping noisy prefe...
-
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Pref-CTRL: Preference Driven LLM Alignment using Representation Editing
Pref-CTRL trains a multi-objective value function on preferences to guide representation editing for LLM alignment, outperforming RE-Control on benchmarks with better out-of-domain generalization.
-
MGDA-Decoupled: Geometry-Aware Multi-Objective Optimisation for DPO-based LLM Alignment
MGDA-Decoupled applies geometry-based multi-objective optimization within the DPO framework to find shared descent directions that account for each objective's convergence dynamics, yielding higher win rates on UltraFeedback.
-
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
-
Shorter, but Still Trustworthy? An Empirical Study of Chain-of-Thought Compression
CoT compression frequently introduces trustworthiness regressions with method-specific degradation profiles; a proposed normalized efficiency score and alignment-aware DPO variant reduce length by 19.3% with smaller t...
-
VC-Soup: Value-Consistency Guided Multi-Value Alignment for Large Language Models
VC-Soup uses a cosine-similarity consistency metric to filter data, trains value-consistent policies, and applies linear merging with Pareto filtering to improve multi-value LLM alignment trade-offs.
-
Robust Policy Optimization to Prevent Catastrophic Forgetting
FRPO applies a max-min robust optimization over KL-bounded policy neighborhoods during RLHF to reduce catastrophic forgetting of safety and accuracy under subsequent SFT or RL fine-tuning.
-
Multiplayer Nash Preference Optimization
MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Zephyr: Direct Distillation of LM Alignment
Zephyr-7B achieves state-of-the-art chat benchmark results among 7B models by distilling alignment via dDPO on AI feedback preferences, surpassing the 70B Llama-2-Chat model on MT-Bench with no human data required.
-
Two Ways to De-Bias an LLM-as-a-Judge: A Continuous-Score Comparison of Hierarchical Bayesian Calibration and Neural-ODE Score Transport
With 100 anchors the Bayesian linear corrector matches or beats the Neural-ODE flow on distribution recovery while both fix mean offset; with 1500 anchors the flow wins on MAE, Pearson correlation, and KL divergence.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
SpikingMamba: Towards Energy-Efficient Large Language Models via Knowledge Distillation from Mamba
SpikingMamba distills Mamba into an SNN LLM achieving 4.76x energy savings with a 4.78% zero-shot accuracy gap that narrows to 2.23% after RL.
-
Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Data-centric filtering yields an 80K preference dataset and reward models that lead RewardBench while boosting other top entries.
-
A Survey on Knowledge Distillation of Large Language Models
A comprehensive survey of knowledge distillation for LLMs structured around algorithms, skill enhancement, and vertical applications, highlighting data augmentation as a key enabler.
Reference graph
Works this paper leans on
-
[1]
Evaluating Large Language Models Trained on Code
URL https://api.semanticscholar. org/CorpusID:266312608. Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Hen- rique Ponde de Oliveira Pinto, Jared Kaplan, Harri Ed- wards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sh...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.5281/zenodo.5371628 2021
-
[2]
URL https://api.semanticscholar. org/CorpusID:22050710. Jiaming Ji, Mickel Liu, Juntao Dai, Xuehai Pan, Chi Zhang, Ce Bian, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. arXiv preprint arXiv:2307.04657, 2023. 10 ULTRAFEEDBACK : Boosting Language Models with Scaled AI Feedb...
-
[3]
Self-critiquing models for assisting human evaluators
URL https://openreview.net/forum? id=9Vrb9D0WI4. William Saunders, Catherine Yeh, Jeff Wu, Steven Bills, Long Ouyang, Jonathan Ward, and Jan Leike. Self-critiquing models for assisting human evaluators. CoRR, abs/2206.05802, 2022. doi: 10.48550/ARXIV . 2206.05802. URL https://doi.org/10.48550/ arXiv.2206.05802. John Schulman, Filip Wolski, Prafulla Dhariw...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2022
-
[4]
Morning: Taking vitamin D supplements in the morning can help you start your day with adequate levels of this essential nutrient. This may be particularly helpful if you have a busy schedule and may not have time to take them later in the day
-
[5]
After sun exposure: Vitamin D is produced in the skin when it’s exposed to sunlight. Taking a vitamin D supplement after spending time outdoors can help boost your levels and ensure you’re getting enough
-
[6]
With a meal: Vitamin D is a fat-soluble vitamin, which means it’s best absorbed when taken with a meal that contains healthy fats. Having a meal with a source of fat, such as avocado, nuts, or seeds, can enhance the absorption of vitamin D
-
[7]
Consistency: The most important thing is to establish a consistent routine and take your vitamin D supplements at the same time every day. This will help maintain a stable level of vitamin D in your body. It’s important to note that excessive intake of vitamin D can be harmful, so it’s essential to consult with a healthcare professional to determine the a...
-
[8]
Season 1: American Horror Story (2011) - Gaga played the role of The Countess or Elizabeth Johnson, a wealthy and sophisticated owner of a Los Angeles apartment building who resides with her ventriloquist dummy
work page 2011
-
[9]
They have gained fame in the 1970s with their horror rock collaboration, The Countess
Season 6: American Horror Story: Hotel (2016) - In this season, Lady Gaga portrayed German singer-songwriter Elizabeth along with her twin brother Dennis. They have gained fame in the 1970s with their horror rock collaboration, The Countess. Gaga’s portrayal in American Horror Story has added a unique dimension to the series, showcasing her acting talent,...
work page 2016
-
[10]
**Irrelevant**: No alignment
-
[11]
**Partial Focus**: Addresses one aspect poorly
-
[12]
- (2) Acknowledges both but slight deviations
**Partial Compliance**: - (1) Meets goals or restrictions, neglecting others. - (2) Acknowledges both but slight deviations
-
[13]
**Almost There**: Near alignment, minor deviations
-
[14]
**Comprehensive Compliance**: Fully aligns, meets all requirements. 22 ULTRAFEEDBACK : Boosting Language Models with Scaled AI Feedback Annotation Template for Critique Feedback Given my answer to an instruction, your role is to provide specific and constructive feedback for me. You should find the best way for me to learn from your feedback and improve m...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.