Recognition: 2 theorem links
· Lean TheoremSkywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Pith reviewed 2026-05-16 22:13 UTC · model grok-4.3
The pith
Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Skywork-Reward-V2 suite consists of eight reward models ranging from 0.6B to 8B parameters, trained on 26 million carefully curated preference pairs drawn from the SynPref-40M dataset. This dataset is assembled through a two-stage human-AI pipeline in which humans supply verified annotations and large language models perform scalable curation guided by those annotations. The resulting models set new performance standards on seven major reward model benchmarks, outperform generative reward models, and exhibit robust behavior in downstream applications such as preference alignment, objective correctness, safety evaluation, and best-of-N scaling.
What carries the argument
The human-AI synergistic two-stage pipeline for preference data curation, where human verification directs AI-driven scaling to produce the SynPref-40M dataset.
If this is right
- Improved reward models enhance the effectiveness of RLHF for better human alignment.
- Models show greater resistance to stylistic biases and stronger safety performance.
- Performance holds across a range of model sizes from small to 8B parameters.
- Curation quality proves more critical than dataset size alone, per ablation results.
- Strong results in best-of-N scaling suggest better utility in generation tasks.
Where Pith is reading between the lines
- Similar hybrid curation pipelines could improve data quality for other machine learning tasks involving human feedback.
- Applying this method to even larger scales might yield further gains if quality controls remain consistent.
- Open release of these models and data could accelerate community progress in safe AI alignment.
- Addressing dataset limitations this way may reduce reliance on purely synthetic data generation.
Load-bearing premise
The reported benchmark improvements arise mainly from the superior quality of data produced by the human-AI curation process rather than from differences in training methods or model designs.
What would settle it
Training equivalent models on 26 million pairs selected without the human-AI quality controls and finding no significant improvement over prior reward models on the same benchmarks would disprove that the curation method is the key driver.
read the original abstract
Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, demonstrating how human-AI curation synergy can unlock significantly higher data quality.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SynPref-40M, a 40M-pair preference dataset curated via a two-stage human-AI pipeline, and trains the Skywork-Reward-V2 family (0.6B–8B) on a 26M subset. It claims these models achieve SOTA on seven major reward-model benchmarks, outperform generative RMs, and show strong downstream performance in alignment, safety, and best-of-N scaling, with ablations attributing gains to curation quality beyond scale.
Significance. If the empirical claims hold with rigorous controls, the work would meaningfully advance open reward modeling by demonstrating a scalable human-AI curation method that addresses brittleness in preference data. The approach could influence future RLHF pipelines and data-collection practices.
major comments (2)
- [Abstract and §5] Abstract and §5 (Ablations): the statement that 'ablations confirm that effectiveness stems not only from data scale but also from high-quality curation' is load-bearing for the central claim, yet no details are given on whether the low-quality baseline matches the 26M curated subset in topic distribution, response length, preference-margin statistics, or label-noise rate; without these controls the performance delta cannot be attributed to the pipeline.
- [§4] §4 (Experiments): the SOTA claims across seven benchmarks and the assertion of outperforming generative reward models require explicit tables with all baselines, exact scores, error bars or statistical tests, and the precise 26M-subset selection criteria; these are absent from the supplied description and are necessary to substantiate the performance gains.
minor comments (2)
- [§3] Clarify the exact human-guidance rules passed to the LLM curation stage and whether the 26M subset is obtained by a deterministic filter whose thresholds are fully specified.
- Add a table or appendix listing the seven benchmarks with references, metric definitions, and the exact Skywork-Reward-V2 scores versus prior open and generative RMs.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We have revised the manuscript to supply the requested controls and tables, which strengthen the attribution of gains to curation quality and the substantiation of our empirical claims.
read point-by-point responses
-
Referee: [Abstract and §5] Abstract and §5 (Ablations): the statement that 'ablations confirm that effectiveness stems not only from data scale but also from high-quality curation' is load-bearing for the central claim, yet no details are given on whether the low-quality baseline matches the 26M curated subset in topic distribution, response length, preference-margin statistics, or label-noise rate; without these controls the performance delta cannot be attributed to the pipeline.
Authors: We agree that explicit matching controls are necessary to support the attribution. In the revised §5 and appendix we now include a table comparing the low-quality baseline against the 26M curated subset on topic distribution (via k-means clustering on embeddings), response length (mean, std, and distribution), preference-margin statistics (mean and variance of chosen-rejected score differences), and label-noise rate (estimated via a 1k human-verified sample). The baseline was constructed via random sampling from the same source pool followed by length and topic re-balancing to match these statistics within 5%. The performance gap remains statistically significant under these matched conditions, supporting the claim that curation quality contributes beyond scale. revision: yes
-
Referee: [§4] §4 (Experiments): the SOTA claims across seven benchmarks and the assertion of outperforming generative reward models require explicit tables with all baselines, exact scores, error bars or statistical tests, and the precise 26M-subset selection criteria; these are absent from the supplied description and are necessary to substantiate the performance gains.
Authors: We have expanded §4 with complete tables listing every baseline (including all open and generative RMs), exact scores on each of the seven benchmarks, and error bars computed from three independent training runs. Paired t-tests with p-values are reported in the appendix. The 26M-subset selection criteria are now detailed in §3.2: starting from SynPref-40M, we apply (1) human-AI agreement filtering (threshold 0.8), (2) diversity sampling via determinantal point processes on embedding clusters, and (3) preference-margin thresholding (>0.3) to retain 26M pairs while preserving topic coverage. revision: yes
Circularity Check
No circularity: empirical training and external benchmark evaluation
full rationale
The paper trains reward models on a curated preference dataset (SynPref-40M subset) and reports performance on seven independent external benchmarks. No equations, derivations, or fitted parameters are defined in terms of the target metrics themselves. Ablations are invoked to attribute gains to curation quality, but these are standard controlled experiments rather than self-referential reductions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claims are falsifiable empirical outcomes on held-out benchmarks, satisfying the criteria for a self-contained non-circular result.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pairwise preference annotations from humans accurately reflect nuanced human preferences
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability... Training on this preference mixture, we introduce Skywork-Reward-V2...
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 17 Pith papers
-
Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
-
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
-
ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation
ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.
-
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
-
Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning
A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% ...
-
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization
ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.
-
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
-
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...
-
QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning
QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.
-
CoAct: Co-Active LLM Preference Learning with Human-AI Synergy
CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.
-
AgentV-RL: Scaling Reward Modeling with Agentic Verifier
AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.
-
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
-
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...
-
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs
Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.
Reference graph
Works this paper leans on
-
[1]
introduced the first taxonomy of reward models, categorizing them into (1) sequence classifiers, (2) direct preference optimization (DPO) models with implicit rewards, (3) generative models, and (4) custom classifiers. Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches. DPO...
work page 2023
-
[2]
Stage 1 metadata isolates “risky” regions.Every pair in the unverified pool receives preference attributes from LLMs: task category, objectivity, controversiality, desired attributes, and annota- tion guideline. This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common. In...
-
[3]
Error-driven adaptive retrieval focuses on “unstable” regions.In Stage 1, we repeatedly train an RM, evaluate it on human-verified gold data, and use error-driven adaptive retrieval to pull in new examples similar (in prompt + attribute space) to misclassified or low-confidence pairs. This concentrates labeling effort where the current BT model finds the ...
-
[4]
Stage 2 dual-RM consistency filtering targets contradictory signals.Stage 2 introduces a consistency filter: we train a gold RM on cumulative human-verified samples and use it together with the Stage-1 best RM to decide which in-the-wild pairs to keep or flip. We retain pairs whose original chosen/rejected labels agree with the gold RM and either the Stag...
work page 2024
-
[5]
Human annotators may not be experts in all types of math and coding problems
LLMs can effectively automate certain types of annotation.For conversations involving reasoning tasks such as math problems or coding questions, LLMs are more efficient and reliable than human annotators. Human annotators may not be experts in all types of math and coding problems. We emphasize using cutting-edge models for this purpose, particularly thos...
-
[6]
Human preferences are complicated, even for humans.During annotation, we consistently encoun- tered preference pairs that were ambiguous, subjective, or context-dependent — making it difficult even for trained annotators to confidently determine which response was better. Factors like subtle tone differences, varying expectations around informativeness or...
work page 2025
-
[7]
Learning clear and aligned preferences significantly enhances reward models.Our experiments demonstrate that when reward models are trained on preference data that is well-structured, verified, and guided by clear annotation protocols, their performance improves substantially across all evalua- tion benchmarks. We hypothesize that this may be due to the s...
-
[8]
Define target performance.Let the desired average score across the six main benchmarks (excluding RewardBench v2) be Starget. Our scaling curve in Figure 5 shows the relationship between fraction of curated data and average RM score
-
[9]
For example, f≈ 0.018 ( 290K pairs) already exceeds previous open SOTA at 8B
Estimate required curated pairs.From Figure 5, practitioners can read off a conservative fraction f of the full curated mixture needed to reach Starget. For example, f≈ 0.018 ( 290K pairs) already exceeds previous open SOTA at 8B. Higher targets correspond to largerf, but with diminishing returns. 3.Decompose costs by stage.Our pipeline separates labeling...
-
[10]
Allocation strategy.Given a dollar budget Bmax, allocate a gold budget BH ≤B max to determine |Dgold|. Our results suggest that roughly O(105) carefully-selected gold pairs suffice to train strong gold and Stage-1 RMs. Allocate the remaining budget Bmax −B H to scaling Stage-2 curation, trading off total curated volume versus LLM quality (e.g., using chea...
-
[11]
Validation and stopping criteria.Monitor (1) RM benchmark scores as in Figure 5, and (2) down- stream BoN curves (e.g., PPE Correctness and RMB) where we observe monotonic scaling with N. Once incremental gains per additional curated million pairs fall below a user-defined threshold (e.g., ¡0.3 points on average benchmark score), it is reasonable to stop ...
-
[12]
Mechanism design as an inner loop in Stage 1.Our error-driven adaptive retrieval already behaves like a “targeted querying mechanism” over the unverified pool. Future work could replace simple similarity-based retrieval with a mechanism-design–inspired query selection rule, e.g., selecting pairs that maximally reduce posterior uncertainty in a generalized...
-
[13]
Using mechanism-design insights to set budgets and stopping rules.Zhang et al.’s framework (Zhang et al., 2024a) suggests principled criteria for which pairwise comparisons are most information- efficient. Combined with our Stage-wise cost decomposition, this could inform better allocation of gold human labels across task types and controversiality levels...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.