arxiv: 2410.18451 · v1 · pith:TGZJHWJEnew · submitted 2024-10-24 · 💻 cs.AI · cs.CL

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu , Liang Zeng , Jiacai Liu , Rui Yan , Jujie He , Chaojie Wang , Shuicheng Yan , Yang Liu

show 1 more author

Yahui Zhou

This is my paper

Pith reviewed 2026-05-17 16:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords reward modelingpreference datasetsdata curationLLM alignmentRewardBenchdata filteringopen-source data

0 comments

The pith

Strategic data selection and filtering from open-source pairs yields top-ranked reward models with just 80K examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that targeted selection and filtering of open-source preference data can produce a compact 80K-pair training set that supports state-of-the-art LLM reward models. Models trained on this Skywork-Reward collection reach the top of the RewardBench leaderboard. The same curation steps also raise the scores of many other leading reward models when applied to them. A sympathetic reader would conclude that careful data quality work can matter more than raw dataset scale for preference learning.

Core claim

By developing effective data selection and filtering strategies for open-source preference datasets, the authors assemble the Skywork-Reward collection of only 80K pairs. Training the Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B models on this data produces the current top entry on RewardBench, while the techniques themselves directly improve performance for many other top-ranked models.

What carries the argument

data selection and filtering strategies that curate the Skywork-Reward collection of high-quality preference pairs

If this is right

Smaller, carefully filtered preference datasets can match or exceed larger unfiltered collections in reward model performance.
The curation techniques transfer directly to raise scores on existing reward models without retraining from scratch.
Focus on data quality reduces the computational cost of preference learning for LLM alignment.
Open-source data, once refined, can support leading results on public leaderboards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same filtering approach might be tested on datasets for other alignment methods such as direct preference optimization to check for similar size reductions.
One could measure whether the selected pairs reduce specific biases common in raw web-scale preference data.
Extending the curation pipeline to new model families or languages would test whether the gains hold beyond the current English-centric RewardBench setup.

Load-bearing premise

The data selection and filtering strategies produce generalizable improvements rather than leaderboard-specific gains tied to the particular open-source sources and evaluation distribution.

What would settle it

Evaluating models trained on the Skywork-Reward dataset on a new preference benchmark built from sources and domains entirely outside the original open-source pool used for curation.

read the original abstract

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Data curation tricks produce a compact 80K preference set that tops RewardBench, but the gains need ablations to rule out benchmark-specific fitting.

read the letter

The main point is that straightforward selection and filtering steps applied to public preference data can shrink the training set to 80K pairs while still training reward models that reach the top of RewardBench. Their Gemma-27B version leads the board, and they report that the same curation steps lifted several other strong models as well. That is the concrete result worth noting first. The work is mostly an engineering report that spells out the pipeline they used: rules for removing low-quality pairs, balancing sources, and keeping only high-signal examples. They end up with a smaller dataset than the usual hundreds of thousands of pairs, which matters for anyone training reward models on limited compute. Sharing the curated collection is also useful; other groups can test the same data directly. The practical angle is the strongest part. Reward modeling sits at the center of current alignment pipelines, and any method that reliably improves quality without scaling data volume is worth trying. The paper stays grounded in open benchmarks and existing sources, so there is no hidden circularity in the setup. The softer area is the strength of the causal claim. The leaderboard numbers show improvement after their filtering, yet the text does not appear to include direct comparisons of identical base models trained on the raw source pools versus the filtered 80K set. Without those controls, or results on a second preference benchmark that differs in construction, it remains possible that the gains partly reflect alignment with RewardBench’s own distribution rather than a broadly better signal. Statistical details on the deltas are also light in the sections I checked. This paper is aimed at people who build or fine-tune reward models for LLMs. A practitioner who needs to curate preference data quickly will pick up usable steps. A reader focused on theoretical advances in preference learning will find less to engage with. It is solid enough for peer review: the methods are described at a level that can be reproduced, the leaderboard result is externally checkable, and the data-centric focus is timely even if more controls would make the conclusions tighter. I would send it to referees rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper introduces a set of data-centric techniques for reward modeling in LLMs, centered on data selection and filtering strategies applied to open-source preference datasets. These yield the compact Skywork-Reward collection of 80K preference pairs. Models trained on this data, including Skywork-Reward-Gemma-27B (currently top-ranked on RewardBench) and Skywork-Reward-Llama-3.1-8B, are presented, along with the claim that the techniques and dataset have directly improved performance of multiple leading models on the benchmark.

Significance. If the curation methods isolate transferable preference signals rather than benchmark-specific artifacts, the work offers a practical demonstration that substantially smaller, high-quality datasets can drive state-of-the-art reward model performance. The reported adoption by other top models provides concrete evidence of real-world utility and supports the value of data-centric approaches in preference learning.

major comments (2)

[Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.
[Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.

minor comments (2)

[Abstract] Abstract: The phrase 'many top-ranked models' is vague; specifying the models, the exact manner in which the dataset or tricks were applied, and quantitative improvements would improve clarity.
[Throughout] Throughout: Ensure consistent terminology for 'preference pairs' versus 'preference data' and provide explicit definitions or references for any filtering heuristics introduced in the methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We provide point-by-point responses to the major comments below.

read point-by-point responses

Referee: [Experiments / Results] Experiments / Results section: The central claim that the selection and filtering strategies produce the observed leaderboard gains rests on post-curation performance numbers, yet no ablation is reported that trains identical base models on the unfiltered source pools or on random subsets of equal size (80K) and measures the performance delta. Without this control, it remains possible that gains arise from distributional alignment between the chosen open-source sources and RewardBench rather than from the proposed tricks.

Authors: We agree that explicit ablations against unfiltered source pools and random 80K subsets would more directly isolate the contribution of our curation strategies. The manuscript currently supports the value of the curated data through the top leaderboard performance of Skywork-Reward models and, importantly, through documented adoption and gains by multiple independent leading entries on RewardBench. This real-world usage by other teams provides evidence of transferable signals. Nevertheless, we will add the requested ablations on random subsets in the revised manuscript to strengthen the experimental section. revision: yes
Referee: [Data curation and evaluation] Data curation and evaluation sections: To substantiate generalizability, results on at least one disjoint preference benchmark (distinct from RewardBench in both construction and source distribution) should be included; current evidence is confined to a single leaderboard whose test distribution may correlate with the curation heuristics.

Authors: We acknowledge that evaluation on a single benchmark leaves open the possibility of distribution-specific effects. Our primary focus was RewardBench as the established standard for reward model assessment. To address generalizability, we will add results on at least one additional, disjoint preference benchmark in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data curation evaluated on external benchmarks

full rationale

The paper describes data selection, filtering, and curation of an 80K preference dataset from open-source sources, followed by training reward models and reporting leaderboard results on RewardBench. No derivation chain, equations, or predictions are present that reduce to self-defined inputs or fitted parameters by construction. All performance claims rest on external public benchmarks and open-source data pools rather than internal re-use of fitted quantities as 'predictions.' The approach is self-contained against verifiable external leaderboards and does not invoke self-citations for load-bearing uniqueness theorems or ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are described. Standard preference-learning assumptions (e.g., that human preferences can be modeled as pairwise comparisons) are implicitly used but not stated as novel.

pith-pipeline@v0.9.0 · 5453 in / 1084 out tokens · 41976 ms · 2026-05-17T16:13:10.262455+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection accuracy on Gemma-2 models with only 1-2 points of quality loss by causally steering SAE-identified structural directions while preserving lexical sampling and semantics.
SLAM: Structural Linguistic Activation Marking for Language Models
cs.CL 2026-05 unverdicted novelty 8.0

SLAM achieves 100% detection on Gemma-2 models with only 1-2 point quality cost by causally steering SAE-identified residual-stream directions for linguistic structure.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
StoryAlign: Evaluating and Training Reward Models for Story Generation
cs.CL 2026-05 unverdicted novelty 7.0

StoryReward, trained on a new 100k story preference dataset, sets state-of-the-art performance on the introduced StoryRMB benchmark for aligning LLM stories with human preferences.
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
cs.CV 2026-04 unverdicted novelty 7.0

A multi-response discriminative reward model scores N candidates in one pass via concatenation and cross-entropy, achieving SOTA on multimodal benchmarks and improving RL policies over single-response baselines.
Many Preferences, Few Policies: Towards Scalable Language Model Personalization
cs.CL 2026-04 unverdicted novelty 7.0

PALM produces a small portfolio of LLMs that contains a near-optimal model for any user preference weight vector, with theoretical bounds on portfolio size and approximation quality.
Scalable Token-Level Hallucination Detection in Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

TokenHD uses a scalable data synthesis engine and importance-weighted training to create token-level hallucination detectors that work on free-form text and scale from 0.6B to 8B parameters, outperforming larger reaso...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness trade-off in LLM alignment by pre-sampling single-reward prompts and rewriting them to expand multi-dimensional reward diversity, yielding 5-12.4% single-preference gains in sequenti...
Explaining and Breaking the Safety-Helpfulness Ceiling via Preference Dimensional Expansion
cs.AI 2026-05 unverdicted novelty 6.0

MORA breaks the safety-helpfulness ceiling in LLMs by pre-sampling single-reward prompts and rewriting them to incorporate multi-dimensional intents, delivering 5-12.4% gains in sequential alignment and 4.6% overall i...
Reasoning Is Not Free: Robust Adaptive Cost-Efficient Routing for LLM-as-a-Judge
cs.AI 2026-05 unverdicted novelty 6.0

RACER routes between reasoning and non-reasoning LLM judges via constrained distributionally robust optimization to achieve better accuracy-cost trade-offs under distribution shift.
Reason Only When Needed: Efficient Generative Reward Modeling via Model-Internal Uncertainty
cs.CL 2026-04 unverdicted novelty 6.0

E-GRM triggers CoT reasoning in generative reward models only when parallel generations show high uncertainty, reducing inference cost and raising accuracy on reasoning benchmarks via a hybrid regression-ranking scorer.
Personalized RewardBench: Evaluating Reward Models with Human Aligned Personalization
cs.CL 2026-04 unverdicted novelty 6.0

Personalized RewardBench reveals that state-of-the-art reward models reach only 75.94% accuracy on personalized preferences and shows stronger correlation with downstream BoN and PPO performance than prior benchmarks.
Mitigating Reward Hacking in RLHF via Advantage Sign Robustness
cs.LG 2026-04 unverdicted novelty 6.0

SignCert-PO mitigates reward hacking in RLHF by down-weighting completions whose advantage signs are not robust to small reward-model perturbations, using a certified preservation radius derived at the policy optimiza...
Unifying Ontology Construction and Semantic Alignment for Deterministic Enterprise Reasoning at Scale
cs.AI 2026-03 unverdicted novelty 6.0

LOM unifies ontology construction, semantic alignment, and deterministic reasoning in one architecture, reporting 88.8% accuracy on ontology completion and 94% on complex graph reasoning tasks.
MoCo: A One-Stop Shop for Model Collaboration Research
cs.CL 2026-01 accept novelty 6.0

MoCo supplies a unified library of 26 collaboration strategies and benchmarks demonstrating average outperformance over single models in 61 percent of (model, data) pairs.
Memory in the Age of AI Agents
cs.CL 2025-12 unverdicted novelty 6.0

The paper maps agent memory research via three forms (token-level, parametric, latent), three functions (factual, experiential, working), and dynamics of formation/evolution/retrieval, plus benchmarks and future directions.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Visual-RFT: Visual Reinforcement Fine-Tuning
cs.CV 2025-03 conditional novelty 6.0

Visual-RFT applies reinforcement learning with verifiable perception rewards to improve large vision-language models on fine-grained classification, few-shot detection, and grounding tasks.
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis
cs.AI 2025-11 unverdicted novelty 5.0

A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reaso...
Users as Annotators: LLM Preference Learning from Comparison Mode
cs.CL 2025-10 unverdicted novelty 5.0

Introduces a latent user quality model and EM algorithm to infer and filter noisy user-provided pairwise preferences for improved LLM alignment.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 18 Pith papers · 13 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

2406.11704 , archivePrefix=

B. Adler, N. Agarwal, A. Aithal, D. H. Anh, P . Bhattacharya, A. Brundyn, J. Casper, B. Catanzaro, S. Clay, J. Cohen, et al. Nemotron-4 340b technical report. arXiv preprint arXiv:2406.11704,

work page arXiv
[3]

Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Bellagente, J

M. Bellagente, J. Tow, D. Mahan, D. Phung, M. Zhuravinskyi, R. Adithyan, J. Baicoianu, B. Brooks, N. Cooper, A. Datta, et al. Stable lm 2 1.6 b technical report. arXiv preprint arXiv:2402.17834,

work page arXiv
[5]

Z. Cai, M. Cao, H. Chen, K. Chen, K. Chen, X. Chen, X. Chen, Z. Chen, Z. Chen, P . Chu, et al. Internlm2 technical report. arXiv preprint arXiv:2403.17297,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

S. Casper, X. Davies, C. Shi, T. K. Gilbert, J. Scheurer, J. Rando, R. Freedman, T. Korbak, D. Lind- ner, P . Freire, et al. Open problems and fundamental limitations of reinforcement learning from human feedback. arXiv preprint arXiv:2307.15217,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

G. Cui, L. Yuan, N. Ding, G. Yao, W. Zhu, Y. Ni, G. Xie, Z. Liu, and M. Sun. Ultrafeedback: Boosting language models with high-quality feedback. arXiv preprint arXiv:2310.01377,

work page internal anchor Pith review arXiv
[8]

URL https://huggingface.co/datasets/LDJnr/Capybara. H. Dong, W. Xiong, B. Pang, H. Wang, H. Zhao, Y. Zhou, N. Jiang, D. Sahoo, C. Xiong, and T. Zhang. Rlhf workflow: From reward modeling to online rlhf. arXiv preprint arXiv:2405.07863,

work page internal anchor Pith review arXiv
[9]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

15 S. Han, K. Rao, A. Ettinger, L. Jiang, B. Y. Lin, N. Lambert, Y. Choi, and N. Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495,

work page internal anchor Pith review arXiv
[11]

Jiang, X

D. Jiang, X. Ren, and B. Y. Lin. Llm-blender: Ensembling large language models with pairwise ranking and generative fusion. arXiv preprint arXiv:2306.02561,

work page arXiv
[12]

Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models, 2024a

L. Jiang, K. Rao, S. Han, A. Ettinger, F. Brahman, S. Kumar, N. Mireshghallah, X. Lu, M. Sap, Y. Choi, et al. Wildteaming at scale: From in-the-wild jailbreaks to (adversarially) safer language models. arXiv preprint arXiv:2406.18510,

work page arXiv
[13]

Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787,

N. Lambert, V . Pyatkin, J. Morrison, L. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, et al. Rewardbench: Evaluating reward models for language modeling. arXiv preprint arXiv:2403.13787,

work page arXiv
[14]

T. Lin. Focal loss for dense object detection. arXiv preprint arXiv:1708.02002,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

X. Lou, D. Yan, W. Shen, Y. Yan, J. Xie, and J. Zhang. Uncertainty-aware reward model: Teaching reward models to know what is unknown. arXiv preprint arXiv:2410.00847,

work page arXiv
[16]

J. Park, S. Jwa, M. Ren, D. Kim, and S. Choi. Offsetbias: Leveraging debiased data for tuning evaluators. arXiv preprint arXiv:2407.06551,

work page arXiv
[17]

arXiv preprint arXiv:2404.12358 , year=

R. Rafailov, J. Hejna, R. Park, and C. Finn. From 𝑟 to 𝑞∗: Your language model is secretly a q-function. arXiv preprint arXiv:2404.12358, 2024a. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference opti- mization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, ...

work page arXiv
[18]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P . Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.34740/KAGGLE/M/3301. URL https://www.kaggle.com /m/3301. G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.34740/kaggle/m/3301
[20]

G. Team, M. Reid, N. Savinov, D. Teplyashin, L. Dmitry, T. Lillicrap, J. Alayrac, R. Soricut, A. Lazaridou, O. Firat, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. in arxiv [cs. cl]. arxiv, 2024a. G. Team, M. Riviere, S. Pathak, P . G. Sessa, C. Hardin, S. Bhupatiraju, L. Hussenot, T. Mesnard, B. Shahriari, A....

work page internal anchor Pith review Pith/arXiv arXiv
[21]

H. Wang, Y. Lin, W. Xiong, R. Yang, S. Diao, S. Qiu, H. Zhao, and T. Zhang. Arithmetic control of llms for diverse user preferences: Directional preference alignment with multi-objective rewards. In ACL, 2024a. H. Wang, W. Xiong, T. Xie, H. Zhao, and T. Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts. In EMNLP, ...

work page arXiv
[22]

Z. Wang, Y. Dong, O. Delalleau, J. Zeng, G. Shen, D. Egert, J. J. Zhang, M. N. Sreedhar, and O. Kuchaiev. Helpsteer2: Open-source dataset for training top-performing reward models. arXiv preprint arXiv:2406.08673, 2024e. G. I. Winata, D. Anugraha, L. Susanto, G. Kuwanto, and D. T. Wijaya. Metametrics: Calibrating metrics for generation tasks using human p...

work page arXiv
[23]

17 Z. Xu, F. Jiang, L. Niu, Y. Deng, R. Poovendran, Y. Choi, and B. Y. Lin. Magpie: Align- ment data synthesis from scratch by prompting aligned llms with nothing. arXiv preprint arXiv:2406.08464,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

R. Yang, R. Ding, Y. Lin, H. Zhang, and T. Zhang. Regularizing hidden states enables learning generalizable reward model for llms. arXiv preprint arXiv:2406.10216,

work page arXiv
[25]

L. Yuan, G. Cui, H. Wang, N. Ding, X. Wang, J. Deng, B. Shan, H. Chen, R. Xie, Y. Lin, et al. Advancing llm reasoning generalists with preference trees. arXiv preprint arXiv:2404.02078,

work page arXiv
[26]

Z. Zeng, J. Yu, T. Gao, Y. Meng, T. Goyal, and D. Chen. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641,

work page arXiv
[27]

Zhang, G

Y. Zhang, G. Zhang, Y. Wu, K. Xu, and Q. Gu. General preference modeling with preference representations for aligning language models. arXiv preprint arXiv:2410.02197,

work page arXiv