pith. machine review for the scientific record. sign in

arxiv: 2507.01352 · v3 · submitted 2025-07-02 · 💻 cs.CL · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Authors on Pith no claims yet

Pith reviewed 2026-05-16 22:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords reward modelspreference dataRLHFdata curationhuman-AI collaborationalignment benchmarksSynPref-40MSkywork-Reward-V2
0
0 comments X

The pith

Human-AI synergy curates 40 million preference pairs to train state-of-the-art reward models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Reward models often fail on nuanced preferences because existing datasets are limited in scope or quality. This work creates SynPref-40M with 40 million pairs using a pipeline where humans verify labels and AI expands the data under that guidance. Eight models from 0.6B to 8B parameters trained on 26 million of these pairs reach leading results on seven benchmarks, beat generative alternatives, and perform well in alignment, safety, and scaling tasks. Ablation tests indicate that the careful curation, not merely the volume, drives the gains. The result points to a practical way to overcome data bottlenecks in aligning language models through combined human and machine strengths.

Core claim

The Skywork-Reward-V2 suite consists of eight reward models ranging from 0.6B to 8B parameters, trained on 26 million carefully curated preference pairs drawn from the SynPref-40M dataset. This dataset is assembled through a two-stage human-AI pipeline in which humans supply verified annotations and large language models perform scalable curation guided by those annotations. The resulting models set new performance standards on seven major reward model benchmarks, outperform generative reward models, and exhibit robust behavior in downstream applications such as preference alignment, objective correctness, safety evaluation, and best-of-N scaling.

What carries the argument

The human-AI synergistic two-stage pipeline for preference data curation, where human verification directs AI-driven scaling to produce the SynPref-40M dataset.

If this is right

  • Improved reward models enhance the effectiveness of RLHF for better human alignment.
  • Models show greater resistance to stylistic biases and stronger safety performance.
  • Performance holds across a range of model sizes from small to 8B parameters.
  • Curation quality proves more critical than dataset size alone, per ablation results.
  • Strong results in best-of-N scaling suggest better utility in generation tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar hybrid curation pipelines could improve data quality for other machine learning tasks involving human feedback.
  • Applying this method to even larger scales might yield further gains if quality controls remain consistent.
  • Open release of these models and data could accelerate community progress in safe AI alignment.
  • Addressing dataset limitations this way may reduce reliance on purely synthetic data generation.

Load-bearing premise

The reported benchmark improvements arise mainly from the superior quality of data produced by the human-AI curation process rather than from differences in training methods or model designs.

What would settle it

Training equivalent models on 26 million pairs selected without the human-AI quality controls and finding no significant improvement over prior reward models on the same benchmarks would disprove that the curation method is the key driver.

read the original abstract

Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, demonstrating how human-AI curation synergy can unlock significantly higher data quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces SynPref-40M, a 40M-pair preference dataset curated via a two-stage human-AI pipeline, and trains the Skywork-Reward-V2 family (0.6B–8B) on a 26M subset. It claims these models achieve SOTA on seven major reward-model benchmarks, outperform generative RMs, and show strong downstream performance in alignment, safety, and best-of-N scaling, with ablations attributing gains to curation quality beyond scale.

Significance. If the empirical claims hold with rigorous controls, the work would meaningfully advance open reward modeling by demonstrating a scalable human-AI curation method that addresses brittleness in preference data. The approach could influence future RLHF pipelines and data-collection practices.

major comments (2)
  1. [Abstract and §5] Abstract and §5 (Ablations): the statement that 'ablations confirm that effectiveness stems not only from data scale but also from high-quality curation' is load-bearing for the central claim, yet no details are given on whether the low-quality baseline matches the 26M curated subset in topic distribution, response length, preference-margin statistics, or label-noise rate; without these controls the performance delta cannot be attributed to the pipeline.
  2. [§4] §4 (Experiments): the SOTA claims across seven benchmarks and the assertion of outperforming generative reward models require explicit tables with all baselines, exact scores, error bars or statistical tests, and the precise 26M-subset selection criteria; these are absent from the supplied description and are necessary to substantiate the performance gains.
minor comments (2)
  1. [§3] Clarify the exact human-guidance rules passed to the LLM curation stage and whether the 26M subset is obtained by a deterministic filter whose thresholds are fully specified.
  2. Add a table or appendix listing the seven benchmarks with references, metric definitions, and the exact Skywork-Reward-V2 scores versus prior open and generative RMs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We have revised the manuscript to supply the requested controls and tables, which strengthen the attribution of gains to curation quality and the substantiation of our empirical claims.

read point-by-point responses
  1. Referee: [Abstract and §5] Abstract and §5 (Ablations): the statement that 'ablations confirm that effectiveness stems not only from data scale but also from high-quality curation' is load-bearing for the central claim, yet no details are given on whether the low-quality baseline matches the 26M curated subset in topic distribution, response length, preference-margin statistics, or label-noise rate; without these controls the performance delta cannot be attributed to the pipeline.

    Authors: We agree that explicit matching controls are necessary to support the attribution. In the revised §5 and appendix we now include a table comparing the low-quality baseline against the 26M curated subset on topic distribution (via k-means clustering on embeddings), response length (mean, std, and distribution), preference-margin statistics (mean and variance of chosen-rejected score differences), and label-noise rate (estimated via a 1k human-verified sample). The baseline was constructed via random sampling from the same source pool followed by length and topic re-balancing to match these statistics within 5%. The performance gap remains statistically significant under these matched conditions, supporting the claim that curation quality contributes beyond scale. revision: yes

  2. Referee: [§4] §4 (Experiments): the SOTA claims across seven benchmarks and the assertion of outperforming generative reward models require explicit tables with all baselines, exact scores, error bars or statistical tests, and the precise 26M-subset selection criteria; these are absent from the supplied description and are necessary to substantiate the performance gains.

    Authors: We have expanded §4 with complete tables listing every baseline (including all open and generative RMs), exact scores on each of the seven benchmarks, and error bars computed from three independent training runs. Paired t-tests with p-values are reported in the appendix. The 26M-subset selection criteria are now detailed in §3.2: starting from SynPref-40M, we apply (1) human-AI agreement filtering (threshold 0.8), (2) diversity sampling via determinantal point processes on embedding clusters, and (3) preference-margin thresholding (>0.3) to retain 26M pairs while preserving topic coverage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and external benchmark evaluation

full rationale

The paper trains reward models on a curated preference dataset (SynPref-40M subset) and reports performance on seven independent external benchmarks. No equations, derivations, or fitted parameters are defined in terms of the target metrics themselves. Ablations are invoked to attribute gains to curation quality, but these are standard controlled experiments rather than self-referential reductions. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the provided text. The central claims are falsifiable empirical outcomes on held-out benchmarks, satisfying the criteria for a self-contained non-circular result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that pairwise human preferences can be reliably captured and scaled via the described human-AI pipeline; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Pairwise preference annotations from humans accurately reflect nuanced human preferences
    Standard assumption in RLHF literature invoked to justify the curation pipeline.

pith-pipeline@v0.9.0 · 5637 in / 1254 out tokens · 31171 ms · 2026-05-16T22:13:47.494891+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  2. Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

    cs.LG 2026-04 unverdicted novelty 7.0

    TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

  3. ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation

    cs.CL 2026-01 unverdicted novelty 7.0

    ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.

  4. LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

    cs.LG 2026-05 conditional novelty 6.0

    A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.

  5. Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

    stat.ML 2026-05 unverdicted novelty 6.0

    A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% ...

  6. ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

    cs.LG 2026-05 unverdicted novelty 6.0

    ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.

  7. When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

    cs.AI 2026-04 unverdicted novelty 6.0

    A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

  8. When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

    cs.LG 2026-04 unverdicted novelty 6.0

    Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward des...

  9. QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.

  10. CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

    cs.CL 2026-04 unverdicted novelty 6.0

    CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.

  11. AgentV-RL: Scaling Reward Modeling with Agentic Verifier

    cs.CL 2026-04 unverdicted novelty 6.0

    AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.

  12. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  13. Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

    cs.LG 2026-02 conditional novelty 6.0

    Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.

  14. Memory in the Age of AI Agents

    cs.CL 2025-12 unverdicted novelty 6.0

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  15. Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

    cs.AI 2026-05 unverdicted novelty 5.0

    Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

  16. DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

    cs.AI 2026-04 unverdicted novelty 5.0

    DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLH...

  17. Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages · cited by 17 Pith papers

  1. [1]

    Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches

    introduced the first taxonomy of reward models, categorizing them into (1) sequence classifiers, (2) direct preference optimization (DPO) models with implicit rewards, (3) generative models, and (4) custom classifiers. Most BT-based models fall under the sequence classifier category, while generative models primarily include LLM-as-a-Judge approaches. DPO...

  2. [2]

    This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common

    Stage 1 metadata isolates “risky” regions.Every pair in the unverified pool receives preference attributes from LLMs: task category, objectivity, controversiality, desired attributes, and annota- tion guideline. This stratification identifies objective/low-controversial versus subjective/high- controversial regions, where intransitivity is more common. In...

  3. [3]

    unstable

    Error-driven adaptive retrieval focuses on “unstable” regions.In Stage 1, we repeatedly train an RM, evaluate it on human-verified gold data, and use error-driven adaptive retrieval to pull in new examples similar (in prompt + attribute space) to misclassified or low-confidence pairs. This concentrates labeling effort where the current BT model finds the ...

  4. [4]

    objective,

    Stage 2 dual-RM consistency filtering targets contradictory signals.Stage 2 introduces a consistency filter: we train a gold RM on cumulative human-verified samples and use it together with the Stage-1 best RM to decide which in-the-wild pairs to keep or flip. We retain pairs whose original chosen/rejected labels agree with the gold RM and either the Stag...

  5. [5]

    Human annotators may not be experts in all types of math and coding problems

    LLMs can effectively automate certain types of annotation.For conversations involving reasoning tasks such as math problems or coding questions, LLMs are more efficient and reliable than human annotators. Human annotators may not be experts in all types of math and coding problems. We emphasize using cutting-edge models for this purpose, particularly thos...

  6. [6]

    Factors like subtle tone differences, varying expectations around informativeness or safety, and individual annotator biases introduced uncertainty into this process

    Human preferences are complicated, even for humans.During annotation, we consistently encoun- tered preference pairs that were ambiguous, subjective, or context-dependent — making it difficult even for trained annotators to confidently determine which response was better. Factors like subtle tone differences, varying expectations around informativeness or...

  7. [7]

    quality beats volume

    Learning clear and aligned preferences significantly enhances reward models.Our experiments demonstrate that when reward models are trained on preference data that is well-structured, verified, and guided by clear annotation protocols, their performance improves substantially across all evalua- tion benchmarks. We hypothesize that this may be due to the s...

  8. [8]

    Our scaling curve in Figure 5 shows the relationship between fraction of curated data and average RM score

    Define target performance.Let the desired average score across the six main benchmarks (excluding RewardBench v2) be Starget. Our scaling curve in Figure 5 shows the relationship between fraction of curated data and average RM score

  9. [9]

    For example, f≈ 0.018 ( 290K pairs) already exceeds previous open SOTA at 8B

    Estimate required curated pairs.From Figure 5, practitioners can read off a conservative fraction f of the full curated mixture needed to reach Starget. For example, f≈ 0.018 ( 290K pairs) already exceeds previous open SOTA at 8B. Higher targets correspond to largerf, but with diminishing returns. 3.Decompose costs by stage.Our pipeline separates labeling...

  10. [10]

    Our results suggest that roughly O(105) carefully-selected gold pairs suffice to train strong gold and Stage-1 RMs

    Allocation strategy.Given a dollar budget Bmax, allocate a gold budget BH ≤B max to determine |Dgold|. Our results suggest that roughly O(105) carefully-selected gold pairs suffice to train strong gold and Stage-1 RMs. Allocate the remaining budget Bmax −B H to scaling Stage-2 curation, trading off total curated volume versus LLM quality (e.g., using chea...

  11. [11]

    Once incremental gains per additional curated million pairs fall below a user-defined threshold (e.g., ¡0.3 points on average benchmark score), it is reasonable to stop spending

    Validation and stopping criteria.Monitor (1) RM benchmark scores as in Figure 5, and (2) down- stream BoN curves (e.g., PPE Correctness and RMB) where we observe monotonic scaling with N. Once incremental gains per additional curated million pairs fall below a user-defined threshold (e.g., ¡0.3 points on average benchmark score), it is reasonable to stop ...

  12. [12]

    targeted querying mechanism

    Mechanism design as an inner loop in Stage 1.Our error-driven adaptive retrieval already behaves like a “targeted querying mechanism” over the unverified pool. Future work could replace simple similarity-based retrieval with a mechanism-design–inspired query selection rule, e.g., selecting pairs that maximally reduce posterior uncertainty in a generalized...

  13. [13]

    universally

    Using mechanism-design insights to set budgets and stopping rules.Zhang et al.’s framework (Zhang et al., 2024a) suggests principled criteria for which pairwise comparisons are most information- efficient. Combined with our Stage-wise cost decomposition, this could inform better allocation of gold human labels across task types and controversiality levels...