Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs
Pith reviewed 2026-05-17 16:13 UTC · model grok-4.3
The pith
Strategic data selection and filtering from open-source pairs yields top-ranked reward models with just 80K examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By developing effective data selection and filtering strategies for open-source preference datasets, the authors assemble the Skywork-Reward collection of only 80K pairs. Training the Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B models on this data produces the current top entry on RewardBench, while the techniques themselves directly improve performance for many other top-ranked models.
What carries the argument
data selection and filtering strategies that curate the Skywork-Reward collection of high-quality preference pairs
If this is right
- Smaller, carefully filtered preference datasets can match or exceed larger unfiltered collections in reward model performance.
- The curation techniques transfer directly to raise scores on existing reward models without retraining from scratch.
- Focus on data quality reduces the computational cost of preference learning for LLM alignment.
- Open-source data, once refined, can support leading results on public leaderboards.
Where Pith is reading between the lines
- The same filtering approach might be tested on datasets for other alignment methods such as direct preference optimization to check for similar size reductions.
- One could measure whether the selected pairs reduce specific biases common in raw web-scale preference data.
- Extending the curation pipeline to new model families or languages would test whether the gains hold beyond the current English-centric RewardBench setup.
Load-bearing premise
The data selection and filtering strategies produce generalizable improvements rather than leaderboard-specific gains tied to the particular open-source sources and evaluation distribution.
What would settle it
Evaluating models trained on the Skywork-Reward dataset on a new preference benchmark built from sources and domains entirely outside the original open-source pool used for curation.
read the original abstract
In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a set of data-centric techniques for reward modeling in LLMs, centered on data selection and filtering strategies applied to open-source preference datasets. These yield the compact Skywork-Reward collection of 80K preference pairs. Models trained on this data, including Skywork-Reward-Gemma-27B (currently top-ranked on RewardBench) and Skywork-Reward-Llama-3.1-8B, are presented, along with the claim that the techniques and dataset have directly improved performance of multiple leading models on the benchmark.
Significance. If the curation methods isolate transferable preference signals rather than benchmark-specific artifacts, the work offers a practical demonstration that substantially smaller, high-quality datasets can drive state-of-the-art reward model performance. The reported adoption by other top models provides concrete evidence of real-world utility and supports the value of data-centric approaches in preference learning.
major comments (2)
- [Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.
- [Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.
minor comments (2)
- [Abstract] Abstract: The phrase 'many top-ranked models' is vague; specifying the models, the exact manner in which the dataset or tricks were applied, and quantitative improvements would improve clarity.
- [Throughout] Throughout: Ensure consistent terminology for 'preference pairs' versus 'preference data' and provide explicit definitions or references for any filtering heuristics introduced in the methods.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.
read point-by-point responses
-
Referee: [Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.
Authors: We agree that explicit ablations against unfiltered source pools and random 80K subsets would more directly isolate the contribution of our curation strategies. The manuscript currently supports the value of the curated data through the top leaderboard performance of Skywork-Reward models and, importantly, through documented adoption and gains by multiple independent leading entries on RewardBench. This real-world usage by other teams provides evidence of transferable signals. Nevertheless, we will add the requested ablations on random subsets in the revised manuscript to strengthen the experimental section. revision: yes
-
Referee: [Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.
Authors: We acknowledge that evaluation on a single benchmark leaves open the possibility of distribution-specific effects. Our primary focus was RewardBench as the established standard for reward model assessment. To address generalizability, we will add results on at least one additional, disjoint preference benchmark in the revised manuscript. revision: yes
Circularity Check
No circularity: empirical data curation evaluated on external benchmarks
full rationale
The paper describes data selection, filtering, and curation of an 80K preference dataset from open-source sources, followed by training reward models and reporting leaderboard results on RewardBench. No derivation chain, equations, or predictions are present that reduce to self-defined inputs or fitted parameters by construction. All performance claims rest on external public benchmarks and open-source data pools rather than internal re-use of fitted quantities as 'predictions.' The approach is self-contained against verifiable external leaderboards and does not invoke self-citations for load-bearing uniqueness theorems or ansatzes.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
-
SLAM: Structural Linguistic Activation Marking for Language Models
SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
-
StoryAlign: Evaluating and Training Reward Models for Story Generation
StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
-
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.
-
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
-
Scalable Token-Level Hallucination Detection in Large Language Models
TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
-
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
-
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
-
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
E-GRM triggers CoT reasoning in generative reward models only when parallel generations show high uncertainty, reducing inference cost and raising accuracy on reasoning benchmarks via a hybrid regression-ranking scorer.
-
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
-
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimiza...
-
Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale
LOM unifies ontology construction, semantic alignment, and deterministic reasoning in one architecture, reporting 88.8% accuracy on ontology completion and 94% on complex graph reasoning tasks.
-
MoCo: A One-Stop Shop for Model Collaboration Research
MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
-
Memory in the Age of AI Agents
The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
-
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
-
Visual-RFT: Visual Reinforcement Fine-Tuning
Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
-
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reaso...
-
Users as Annotators: LLM Preference Learning from Comparison Mode
Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
Reference graph
Works this paper leans on
-
[1]
J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P . Bhattacharya, A. Brundyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,
-
[3]
Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi, R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta, et al. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834,
-
[5]
Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P . Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback
S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lind- ner, P . Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377,
work page internal anchor Pith review arXiv
-
[8]
URL https://huggingface.co/datasets/LDJnr/Capybara. H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,
work page internal anchor Pith review arXiv
-
[9]
A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
15 S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y. Lin, N. Lambert, Y. Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495,
work page internal anchor Pith review arXiv
- [11]
-
[12]
Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024a
L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510,
-
[13]
Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787,
N. Lambert, V . Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787,
-
[14]
T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002,
work page internal anchor Pith review Pith/arXiv arXiv
- [15]
- [16]
-
[17]
arXiv preprint arXiv:2404.12358 , year=
R. Rafailov, J. Hejna, R. Park, and C. Finn. From 𝑟 to 𝑞∗: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, ...
-
[18]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
Gemini: A Family of Highly Capable Multimodal Models
doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com /m/3301. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.34740/kaggle/m/3301
-
[20]
G. Team, M. Reid, N. Savinov, D. Teplyashin, L. Dmitry, T. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou, O. Firat, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a. G. Team, M. Riviere, S. Pathak, P . G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A....
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, ...
-
[22]
Z. Wang, Y. Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e. G. I. Winata, D. Anugraha, L. Susanto, G. Kuwanto, and D. T. Wijaya. Metametrics: Calibrating metrics for generation tasks using human p...
-
[23]
17 Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Align- ment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464,
work page internal anchor Pith review Pith/arXiv arXiv
- [24]
- [25]
- [26]
- [27]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.