pith. sign in

hub Mixed citations

Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy

Mixed citation behavior. Most common role is background (60%).

33 Pith papers citing it
Background 60% of classified citations
abstract

Despite the critical role of reward models (RMs) in Reinforcement Learning from Human Feedback (RLHF), current state-of-the-art open RMs perform poorly on most existing evaluation benchmarks, failing to capture nuanced human preferences. We hypothesize that this brittleness stems primarily from limitations in preference datasets, which are often narrowly scoped, synthetically labeled, or lack rigorous quality control. To address these challenges, we present SynPref-40M, a large-scale preference dataset comprising 40 million preference pairs. To enable data curation at scale, we design a human-AI synergistic two-stage pipeline that leverages the complementary strengths of human annotation quality and AI scalability. In this pipeline, humans provide verified annotations, while LLMs perform automatic curation based on human guidance. Training on this preference mixture, we introduce Skywork-Reward-V2, a suite of eight reward models ranging from 0.6B to 8B parameters, trained on a carefully curated subset of 26 million preference pairs from SynPref-40M. We demonstrate that Skywork-Reward-V2 is versatile across a wide range of capabilities, including alignment with human preferences, objective correctness, safety, resistance to stylistic biases, and best-of-N scaling. These reward models achieve state-of-the-art performance across seven major reward model benchmarks, outperform generative reward models, and demonstrate strong downstream performance. Ablation studies confirm that effectiveness stems not only from data scale but also from high-quality curation. The Skywork-Reward-V2 series represents substantial progress in open reward models, demonstrating how human-AI curation synergy can unlock significantly higher data quality.

hub tools

citation-role summary

background 3 method 1 other 1

citation-polarity summary

years

2026 31 2025 2

clear filters

representative citing papers

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

ATLAS introduces an LLM-orchestrated agentic framework for dynamic test-time scaling via extensible 'explore' actions, achieving higher accuracy with fewer API calls than fixed-workflow baselines on four benchmarks.

Code Generation by Differential Test Time Scaling

cs.SE · 2026-05-19 · unverdicted · novelty 7.0

DiffCodeGen clusters code candidates by behavioral similarity from fuzzing-synthesized inputs and selects the largest cluster's medoid, matching or exceeding prior test-time scaling methods with far less token and time cost.

On the Position Bias of On-Policy Distillation

cs.LG · 2026-06-21 · unverdicted · novelty 6.0

Position bias in on-policy distillation degrades later-token supervision; IW-OPD weights tokens by accumulated discrepancy, yielding faster convergence and up to 6.9 point gains on AIME-2025.

HARVE: Hacking-Aware Reward-Head Vector Editing for Robust Reward Models

cs.LG · 2026-06-02 · unverdicted · novelty 6.0

HARVE removes the component of the reward-head vector aligned with a multi-directional hacking subspace from residual streams using a small set of contrastive examples, improving robustness on RewardHackBench across eight models without fine-tuning while preserving general capability.

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.

citing papers explorer

Showing 10 of 10 citing papers after filters.

  • Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics cs.CL · 2026-06-06 · unverdicted · none · ref 45 · internal anchor

    SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.

  • Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 5 · internal anchor

    Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

  • ModeX: Evaluator-Free Best-of-N Selection for Open-Ended Generation cs.CL · 2026-01-05 · unverdicted · none · ref 43 · internal anchor

    ModeX selects the modal semantic output from multiple LLM generations via a similarity graph and recursive spectral clustering without needing reward models or evaluators.

  • From Trainee to Trainer: LLM-Designed Training Environment for RL with Multi-Agent Reasoning cs.CL · 2026-06-16 · unverdicted · none · ref 40 · internal anchor

    The LLM-as-Environment-Engineer framework lets the policy model redesign its own RL environments on the new MAPF-FrozenLake testbed, outperforming larger models and fixed baselines with Qwen3-4B.

  • Gradient-Guided Reward Optimization for Inference-time Alignment cs.CL · 2026-06-08 · unverdicted · none · ref 26 · internal anchor

    GGRO monitors token entropy to trigger gradient-guided token injection from reward models, improving LLM alignment on safety, helpfulness, and reasoning tasks at inference time.

  • CoAct: Co-Active LLM Preference Learning with Human-AI Synergy cs.CL · 2026-04-19 · unverdicted · none · ref 4 · internal anchor

    CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.

  • AgentV-RL: Scaling Reward Modeling with Agentic Verifier cs.CL · 2026-04-17 · unverdicted · none · ref 4 · internal anchor

    AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.

  • GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CL · 2026-04-17 · unverdicted · none · ref 21 · internal anchor

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  • Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 124 · internal anchor

    The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

  • CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations cs.CL · 2026-05-25 · unverdicted · none · ref 28 · internal anchor

    CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.