Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chaojie Wang; Chris Yuhao Liu; Jiacai Liu; Jujie He; Liang Zeng; Rui Yan; Shuicheng Yan; Yahui Zhou; Yang Liu

arxiv: 2410.18451 · v1 · pith:TGZJHWJEnew · submitted 2024-10-24 · 💻 cs.AI · cs.CL

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu , Liang Zeng , Jiacai Liu , Rui Yan , Jujie He , Chaojie Wang , Shuicheng Yan , Yang Liu

show 1 more author

Yahui Zhou

This is my paper

Pith reviewed 2026-05-17 16:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reward modelingpreference datasetsdata curationLLM alignmentRewardBenchdata filteringopen-source data

0 comments

The pith

Strategic data selection and filtering from open-source pairs yields top-ranked reward models with just 80K examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that targeted selection and filtering of open-source preference data can produce a compact 80K-pair training set that supports state-of-the-art LLM reward models. Models trained on this Skywork-Reward collection reach the top of the RewardBench leaderboard. The same curation steps also raise the scores of many other leading reward models when applied to them. A sympathetic reader would conclude that careful data quality work can matter more than raw dataset scale for preference learning.

Core claim

By developing effective data selection and filtering strategies for open-source preference datasets, the authors assemble the Skywork-Reward collection of only 80K pairs. Training the Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B models on this data produces the current top entry on RewardBench, while the techniques themselves directly improve performance for many other top-ranked models.

What carries the argument

data selection and filtering strategies that curate the Skywork-Reward collection of high-quality preference pairs

If this is right

Smaller, carefully filtered preference datasets can match or exceed larger unfiltered collections in reward model performance.
The curation techniques transfer directly to raise scores on existing reward models without retraining from scratch.
Focus on data quality reduces the computational cost of preference learning for LLM alignment.
Open-source data, once refined, can support leading results on public leaderboards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering approach might be tested on datasets for other alignment methods such as direct preference optimization to check for similar size reductions.
One could measure whether the selected pairs reduce specific biases common in raw web-scale preference data.
Extending the curation pipeline to new model families or languages would test whether the gains hold beyond the current English-centric RewardBench setup.

Load-bearing premise

The data selection and filtering strategies produce generalizable improvements rather than leaderboard-specific gains tied to the particular open-source sources and evaluation distribution.

What would settle it

Evaluating models trained on the Skywork-Reward dataset on a new preference benchmark built from sources and domains entirely outside the original open-source pool used for curation.

read the original abstract

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data curation tricks produce a compact 80K preference set that tops RewardBench, but the gains need ablations to rule out benchmark-specific fitting.

read the letter

The main point is that straightforward selection and filtering steps applied to public preference data can shrink the training set to 80K pairs while still training reward models that reach the top of RewardBench. Their Gemma-27B version leads the board, and they report that the same curation steps lifted several other strong models as well. That is the concrete result worth noting first. The work is mostly an engineering report that spells out the pipeline they used: rules for removing low-quality pairs, balancing sources, and keeping only high-signal examples. They end up with a smaller dataset than the usual hundreds of thousands of pairs, which matters for anyone training reward models on limited compute. Sharing the curated collection is also useful; other groups can test the same data directly. The practical angle is the strongest part. Reward modeling sits at the center of current alignment pipelines, and any method that reliably improves quality without scaling data volume is worth trying. The paper stays grounded in open benchmarks and existing sources, so there is no hidden circularity in the setup. The softer area is the strength of the causal claim. The leaderboard numbers show improvement after their filtering, yet the text does not appear to include direct comparisons of identical base models trained on the raw source pools versus the filtered 80K set. Without those controls, or results on a second preference benchmark that differs in construction, it remains possible that the gains partly reflect alignment with RewardBench’s own distribution rather than a broadly better signal. Statistical details on the deltas are also light in the sections I checked. This paper is aimed at people who build or fine-tune reward models for LLMs. A practitioner who needs to curate preference data quickly will pick up usable steps. A reader focused on theoretical advances in preference learning will find less to engage with. It is solid enough for peer review: the methods are described at a level that can be reproduced, the leaderboard result is externally checkable, and the data-centric focus is timely even if more controls would make the conclusions tighter. I would send it to referees rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces a set of data-centric techniques for reward modeling in LLMs, centered on data selection and filtering strategies applied to open-source preference datasets. These yield the compact Skywork-Reward collection of 80K preference pairs. Models trained on this data, including Skywork-Reward-Gemma-27B (currently top-ranked on RewardBench) and Skywork-Reward-Llama-3.1-8B, are presented, along with the claim that the techniques and dataset have directly improved performance of multiple leading models on the benchmark.

Significance. If the curation methods isolate transferable preference signals rather than benchmark-specific artifacts, the work offers a practical demonstration that substantially smaller, high-quality datasets can drive state-of-the-art reward model performance. The reported adoption by other top models provides concrete evidence of real-world utility and supports the value of data-centric approaches in preference learning.

major comments (2)

[Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.
[Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.

minor comments (2)

[Abstract] Abstract: The phrase 'many top-ranked models' is vague; specifying the models, the exact manner in which the dataset or tricks were applied, and quantitative improvements would improve clarity.
[Throughout] Throughout: Ensure consistent terminology for 'preference pairs' versus 'preference data' and provide explicit definitions or references for any filtering heuristics introduced in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.

Authors: We agree that explicit ablations against unfiltered source pools and random 80K subsets would more directly isolate the contribution of our curation strategies. The manuscript currently supports the value of the curated data through the top leaderboard performance of Skywork-Reward models and, importantly, through documented adoption and gains by multiple independent leading entries on RewardBench. This real-world usage by other teams provides evidence of transferable signals. Nevertheless, we will add the requested ablations on random subsets in the revised manuscript to strengthen the experimental section. revision: yes
Referee: [Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.

Authors: We acknowledge that evaluation on a single benchmark leaves open the possibility of distribution-specific effects. Our primary focus was RewardBench as the established standard for reward model assessment. To address generalizability, we will add results on at least one additional, disjoint preference benchmark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data curation evaluated on external benchmarks

full rationale

The paper describes data selection, filtering, and curation of an 80K preference dataset from open-source sources, followed by training reward models and reporting leaderboard results on RewardBench. No derivation chain, equations, or predictions are present that reduce to self-defined inputs or fitted parameters by construction. All performance claims rest on external public benchmarks and open-source data pools rather than internal re-use of fitted quantities as 'predictions.' The approach is self-contained against verifiable external leaderboards and does not invoke self-citations for load-bearing uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. Standard preference-learning assumptions (e.g., that human preferences can be modeled as pairwise comparisons) are implicitly used but not stated as novel.

pith-pipeline@v0.9.0 · 5453 in / 1084 out tokens · 41976 ms · 2026-05-17T16:13:10.262455+00:00 · methodology

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics
cs.CL 2026-06 unverdicted novelty 7.0

SVR learns a bank of contrastive rubrics from preference data via max-margin boundaries and prompt-conditioned selection, narrowing the gap to human rubrics on RubricBench from 24.1 to 0.3 points.
Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion
cs.LG 2026-05 unverdicted novelty 7.0

CDM amortizes SMC inference for reward-tilted discrete diffusion by training a parameterized twist function on contrastive samples with closed-form kernels.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
cs.CV 2026-04 unverdicted novelty 7.0

A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
cs.CL 2026-04 unverdicted novelty 7.0

PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
Bayesian Preference Learning for Test-Time Steerable Reward Models
cs.LG 2026-02 unverdicted novelty 7.0

ICRM casts reward modeling as amortized variational inference over a latent preference probability with a Beta prior, enabling test-time adaptation to unseen preferences and improving benchmark performance.
Incentivizing High-Quality Human Annotations with Golden Questions
cs.GT 2025-05 unverdicted novelty 7.0

The paper derives a Θ(1/√(n log n)) hypothesis testing rate under strategic annotator behavior and shows that high-certainty, format-similar golden questions better reveal annotation quality than standard checks.
Test-Time Verification for Text-to-SQL via Outcome Reward Models
cs.CL 2026-06 unverdicted novelty 6.0

ORM-based test-time verification improves Text-to-SQL accuracy over heuristic selection by up to 4.33% on BIRD and 2.10% on Spider using automated labeling.
PEBS: Per-rater Empirical-Bayes Shrinkage for RLHF Reward-Model Calibration
cs.LG 2026-06 unverdicted novelty 6.0

PEBS applies Morris-James-Stein empirical-Bayes shrinkage to per-rater affine calibrators in RLHF, cutting within-user held-out RMSE by 8.58% on PRISM and 9.66% on PluriHarms versus pooled baselines.
ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval
cs.IR 2026-06 unverdicted novelty 6.0

ELVA applies ranking-driven RLVR to multimodal retrieval to reduce grain blindness in contrastive learning, reporting SOTA results and a 13.1% gain on the new MRBench benchmark.
RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains
cs.LG 2026-05 unverdicted novelty 6.0

RUBRIC-ARROW is an alternating rubric generator and judge framework that uses probability-based scoring and pairwise preferences to improve pointwise reward modeling accuracy for LLM post-training in non-verifiable domains.
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL applies a k-dimensional preference model with per-dimension normalized advantages and a drift monitor to LLM post-training, reporting 56.51% length-controlled win rate on AlpacaEval 2.0 and gains on other benchma...
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Inst...
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL carries k-dimensional skew-symmetric preference structure into policy updates via per-dimension advantages and context-dependent eigenvalues, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llam...
Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment
cs.CL 2026-05 unverdicted novelty 6.0

Introduces HRC model for game-theoretic decomposition of preferences into orthogonal transitive and cyclic components, paired with DSPPO for dynamic Nash-seeking alignment, reporting gains over BT and GPM baselines on...
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
cs.CL 2026-04 unverdicted novelty 6.0

E-GRM triggers CoT reasoning in generative reward models only when parallel generations show high uncertainty, reducing inference cost and raising accuracy on reasoning benchmarks via a hybrid regression-ranking scorer.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
cs.CL 2026-04 unverdicted novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
cs.LG 2026-04 unverdicted novelty 6.0

SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimiza...
Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale
cs.AI 2026-03 unverdicted novelty 6.0

LOM unifies ontology construction, semantic alignment, and deterministic reasoning in one architecture, reporting 88.8% accuracy on ontology completion and 94% on complex graph reasoning tasks.
MoCo: A One-Stop Shop for Model Collaboration Research
cs.CL 2026-01 accept novelty 6.0

MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
Learning to Refine: Self-Refinement of Parallel Reasoning in LLMs
cs.LG 2025-08 conditional novelty 6.0

GSR jointly trains LLMs to generate candidate solutions and refine a superior final answer from them, achieving state-of-the-art performance on five mathematical benchmarks while transferring across model scales.
RewardBench 2: Advancing Reward Model Evaluation
cs.CL 2025-06 unverdicted novelty 6.0

RewardBench 2 is a new benchmark that supplies challenging fresh human prompts for reward model evaluation, yielding lower average scores but higher correlation with downstream best-of-N sampling and RLHF training per...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
How Humans Help LLMs: Assessing and Incentivizing Human Preference Annotators
cs.LG 2025-02 unverdicted novelty 6.0

Develops self-consistency monitoring for preference annotators and derives sample-complexity bounds showing linear contracts achieve near-ideal performance faster than binary ones under continuous actions.
REAR: Test-time Preference Realignment through Reward Decomposition
cs.CL 2026-06 unverdicted novelty 5.0

REAR decomposes the reward into question and preference components, rescales their balance, and expresses the result as a linear combination of token log-probabilities for efficient integration with best-of-N and tree search.
Reward Modeling for Multi-Agent Orchestration
cs.AI 2026-06 unverdicted novelty 5.0

OrchRM uses intermediate artifacts from multi-agent runs to create training pairs for a reward model that guides orchestrator training and test-time scaling, reporting up to 10x token efficiency and 8% accuracy gains ...
DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity
cs.LG 2026-06 unverdicted novelty 5.0

DynaCF dynamically downweights shortcut-sensitive samples in reward model training by tracking margin shifts under online counterfactual perturbations within the Bradley-Terry loss.
Predicting Inference-Time Scaling Gains from Labeled Validation-Set Output Statistics
cs.CL 2026-06 unverdicted novelty 5.0

A ridge predictor using prompt-level agreement spread, label-assisted first-correct position, completion-length variance, and entropy reaches Spearman ρ=0.90 with observed best-of-N gains across three model families a...
CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations
cs.CL 2026-05 unverdicted novelty 5.0

CroCo applies English-reward-ranked self-generations for contrastive preference tuning that improves two LLMs on structured and open-ended tasks across 14 languages without language-specific annotations.
What Is Preference Optimization Doing, and Why?
cs.LG 2025-11 unverdicted novelty 5.0

Gradient analysis and ablations show DPO and PPO have different target directions and component roles in preference optimization for LLMs.
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
cs.AI 2025-11 unverdicted novelty 5.0

A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reaso...
Users as Annotators: LLM Preference Learning from Comparison Mode
cs.CL 2025-10 unverdicted novelty 5.0

Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.
Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap
cs.CL 2025-08 unverdicted novelty 5.0

Selecting preference pairs whose DPO implicit reward gap is small yields better LLM alignment than random or baseline selection while using only 10% of the data.
Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization
cs.AI 2026-06 unverdicted novelty 4.0

Proxy RL produces a staged proxy-internalization capability that emerges before and predicts reward hacking in coding environments.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 39 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al

B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P . Bhattacharya, A. Brundyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,

work page arXiv
[3]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2402.17834 (2024) 34

M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi, R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta, et al. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834,

work page arXiv
[5]

Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P . Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lind- ner, P . Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review arXiv
[8]

URL https://huggingface.co/datasets/LDJnr/Capybara. H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,

work page internal anchor Pith review arXiv
[9]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

15 S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y. Lin, N. Lambert, Y. Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review arXiv
[11]

arXiv preprint arXiv:2306.02561 (2023)

D. Jiang, X. Ren, and B. Y. Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561,

work page arXiv
[12]

arXiv preprint arXiv:2406.18510 (2024)

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510,

work page arXiv
[13]

arXiv preprint arXiv:2403.13787 , year=

N. Lambert, V . Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787,

work page arXiv
[14]

T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

X. Lou, D. Yan, W. Shen, Y. Yan, J. Xie, and J. Zhang. Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847,

work page arXiv
[16]

J. Park, S. Jwa, M. Ren, D. Kim, and S. Choi. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551,

work page arXiv
[17]

Fromrtoq∗: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358,

R. Rafailov, J. Hejna, R. Park, and C. Finn. From 𝑟 to 𝑞∗: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, ...

work page arXiv
[18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com /m/3301. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.34740/kaggle/m/3301
[20]

G. Team, M. Reid, N. Savinov, D. Teplyashin, L. Dmitry, T. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou, O. Firat, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a. G. Team, M. Riviere, S. Pathak, P . G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A....

work page internal anchor Pith review Pith/arXiv arXiv
[21]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, ...

work page arXiv
[22]

Z. Wang, Y. Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e. G. I. Winata, D. Anugraha, L. Susanto, G. Kuwanto, and D. T. Wijaya. Metametrics: Calibrating metrics for generation tasks using human p...

work page arXiv
[23]

17 Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Align- ment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

R. Yang, R. Ding, Y. Lin, H. Zhang, and T. Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216,

work page arXiv
[25]

L. Yuan, G. Cui, H. Wang, N. Ding, X. Wang, J. Deng, B. Shan, H. Chen, R. Xie, Y. Lin, et al. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078,

work page arXiv
[26]

Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, and D. Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641,

work page arXiv
[27]

Zhang, G

Y. Zhang, G. Zhang, Y. Wu, K. Xu, and Q. Gu. General preference modeling with preference representations for aligning language models. arXiv preprint arXiv:2410.02197,

work page arXiv

[1] [1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

H., Bhattacharya, P., Brundyn, A., Casper, J., Catanzaro, B., Clay, S., Cohen, J., et al

B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P . Bhattacharya, A. Brundyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,

work page arXiv

[3] [3]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2402.17834 (2024) 34

M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi, R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta, et al. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834,

work page arXiv

[5] [5]

Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P . Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lind- ner, P . Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review arXiv

[8] [8]

URL https://huggingface.co/datasets/LDJnr/Capybara. H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,

work page internal anchor Pith review arXiv

[9] [9]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

15 S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y. Lin, N. Lambert, Y. Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review arXiv

[11] [11]

arXiv preprint arXiv:2306.02561 (2023)

D. Jiang, X. Ren, and B. Y. Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561,

work page arXiv

[12] [12]

arXiv preprint arXiv:2406.18510 (2024)

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510,

work page arXiv

[13] [13]

arXiv preprint arXiv:2403.13787 , year=

N. Lambert, V . Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787,

work page arXiv

[14] [14]

T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

X. Lou, D. Yan, W. Shen, Y. Yan, J. Xie, and J. Zhang. Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847,

work page arXiv

[16] [16]

J. Park, S. Jwa, M. Ren, D. Kim, and S. Choi. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551,

work page arXiv

[17] [17]

Fromrtoq∗: Your language model is secretly a q-function.arXiv preprint arXiv:2404.12358,

R. Rafailov, J. Hejna, R. Park, and C. Finn. From 𝑟 to 𝑞∗: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, ...

work page arXiv

[18] [18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com /m/3301. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.34740/kaggle/m/3301

[20] [20]

G. Team, M. Reid, N. Savinov, D. Teplyashin, L. Dmitry, T. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou, O. Firat, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a. G. Team, M. Riviere, S. Pathak, P . G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A....

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, ...

work page arXiv

[22] [22]

Z. Wang, Y. Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e. G. I. Winata, D. Anugraha, L. Susanto, G. Kuwanto, and D. T. Wijaya. Metametrics: Calibrating metrics for generation tasks using human p...

work page arXiv

[23] [23]

17 Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Align- ment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

R. Yang, R. Ding, Y. Lin, H. Zhang, and T. Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216,

work page arXiv

[25] [25]

L. Yuan, G. Cui, H. Wang, N. Ding, X. Wang, J. Deng, B. Shan, H. Chen, R. Xie, Y. Lin, et al. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078,

work page arXiv

[26] [26]

Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, and D. Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641,

work page arXiv

[27] [27]

Zhang, G

Y. Zhang, G. Zhang, Y. Wu, K. Xu, and Q. Gu. General preference modeling with preference representations for aligning language models. arXiv preprint arXiv:2410.02197,

work page arXiv