hub

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352

Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, et al · 2025 · arXiv 2507.01352

16 Pith papers cite this work. Polarity classification is still indexing.

16 Pith papers citing it

read on arXiv browse 16 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance

cs.CL · 2026-05-08 · unverdicted · novelty 7.0

Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.

Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

cs.LG · 2026-04-03 · unverdicted · novelty 7.0

TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.

LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling

cs.LG · 2026-05-13 · conditional · novelty 6.0

A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.

Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning

stat.ML · 2026-05-13 · unverdicted · novelty 6.0

A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% of cases.

ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.

When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling

cs.AI · 2026-04-29 · unverdicted · novelty 6.0

A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.

When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.

QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

cs.AI · 2026-04-20 · unverdicted · novelty 6.0

QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.

AgentV-RL: Scaling Reward Modeling with Agentic Verifier

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.

GroupDPO: Memory efficient Group-wise Direct Preference Optimization

cs.CL · 2026-04-17 · unverdicted · novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation

cs.LG · 2026-02-12 · conditional · novelty 6.0

Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.

Memory in the Age of AI Agents

cs.CL · 2025-12-15 · unverdicted · novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.

Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models

cs.AI · 2026-05-08 · unverdicted · novelty 5.0

Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

cs.AI · 2026-04-21 · unverdicted · novelty 5.0

DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs

cs.LG · 2026-04-03 · unverdicted · novelty 5.0

Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.

citing papers explorer

Showing 16 of 16 citing papers.

Think-with-Rubrics: From External Evaluator to Internal Reasoning Guidance cs.CL · 2026-05-08 · unverdicted · none · ref 5
Think-with-Rubrics has LLMs generate rubrics internally before responding, outperforming external rubric-as-reward baselines by 3.87 points on average across benchmarks.
Beyond Semantic Manipulation: Token-Space Attacks on Reward Models cs.LG · 2026-04-03 · unverdicted · none · ref 7
TOMPA performs black-box adversarial optimization in token space to discover non-linguistic patterns that nearly double the reward scores of GPT-5 answers on Skywork-Reward-V2 while producing gibberish text.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling cs.LG · 2026-05-13 · conditional · none · ref 33
A metacognitive harness uses LLMs' pre- and post-solution self-monitoring signals to control test-time reasoning, raising pooled accuracy from 48.3% to 56.9% on text, code, and multimodal benchmarks.
Pause and Reflect: Conformal Aggregation for Chain-of-Thought Reasoning stat.ML · 2026-05-13 · unverdicted · none · ref 60
A conformal procedure for CoT replaces majority voting with weighted aggregation and calibrates abstention to guarantee low confident-error rates, achieving 90.1% selective accuracy on GSM8K by abstaining on under 5% of cases.
ODRPO: Ordinal Decompositions of Discrete Rewards for Robust Policy Optimization cs.LG · 2026-05-12 · unverdicted · none · ref 25
ODRPO decomposes discrete rewards into ordinal binary indicators to compute independent advantages and reduce noise corruption in RLAIF policy optimization.
When to Vote, When to Rewrite: Disagreement-Guided Strategy Routing for Test-Time Scaling cs.AI · 2026-04-29 · unverdicted · none · ref 25
A disagreement-guided routing framework dynamically selects among resolution, voting, and rewriting strategies for test-time scaling, delivering 3-7% accuracy gains with lower sampling cost on mathematical benchmarks.
When Errors Can Be Beneficial: A Categorization of Imperfect Rewards for Policy Gradient cs.LG · 2026-04-28 · unverdicted · none · ref 43
Certain errors in proxy rewards for policy gradient methods can be benign or beneficial by preventing policies from stalling on outputs with mediocre ground truth rewards, enabling improved RLHF metrics and reward design insights.
QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning cs.AI · 2026-04-20 · unverdicted · none · ref 5
QuantumQA dataset and verification-aware RL with adaptive reward fusion enable an 8B LLM to achieve performance competitive with proprietary models on quantum mechanics tasks.
CoAct: Co-Active LLM Preference Learning with Human-AI Synergy cs.CL · 2026-04-19 · unverdicted · none · ref 4
CoAct synergistically merges self-rewarding and active learning via self-consistency to select reliable AI labels and oracle-needed samples, delivering 8-13% gains on GSM8K, MATH, and WebInstruct.
AgentV-RL: Scaling Reward Modeling with Agentic Verifier cs.CL · 2026-04-17 · unverdicted · none · ref 4
AgentV-RL introduces bidirectional forward-backward agents and RL-driven tool use to improve LLM verifiers, with a 4B model beating prior outcome reward models by 25.2%.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization cs.CL · 2026-04-17 · unverdicted · none · ref 21
GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation cs.LG · 2026-02-12 · conditional · none · ref 14
Generalized on-policy distillation with reward scaling above one (ExOPD) lets student models surpass teacher performance when merging domain experts on math and code tasks.
Memory in the Age of AI Agents cs.CL · 2025-12-15 · unverdicted · none · ref 124
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models cs.AI · 2026-05-08 · unverdicted · none · ref 29
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling cs.AI · 2026-04-21 · unverdicted · none · ref 24
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
Random Is Hard to Beat: Active Selection in online DPO with Modern LLMs cs.LG · 2026-04-03 · unverdicted · none · ref 24
Random sampling matches active preference learning on win-rate gains in online DPO yet both degrade benchmark performance, making active selection's overhead hard to justify.

Skywork-reward-v2: Scaling preference data curation via human-ai synergy.arXiv preprint arXiv:2507.01352

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer