arxiv: 2402.13228 · v2 · pith:NUOAWJTFnew · submitted 2024-02-20 · 💻 cs.CL · cs.AI· cs.LG

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Arka Pal , Deep Karkhanis , Samuel Dooley , Manley Roberts , Siddartha Naidu , Colin White This is my paper

Pith reviewed 2026-05-17 22:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords Direct Preference OptimizationDPOpreference fine-tuningLLM alignmentDPOPSmauglanguage model training

0 comments

The pith

Standard DPO can lower the absolute likelihood of preferred responses while still raising their relative odds; a modified loss called DPOP prevents the drop and improves results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the usual DPO objective can decrease the model's probability of generating its preferred completions, provided the ratio to dispreferred ones grows. This effect appears on typical training sets, especially when the two responses differ by only a few tokens. The authors introduce DPO-Positive, which adds a direct term that raises the likelihood of the preferred side. Models trained this way outperform standard DPO on reasoning, summarization, and alignment benchmarks, including those with larger differences between completions, and they also score higher on unrelated tests such as MT-Bench. The same procedure produces Smaug-72B, the first open-source model to clear 80 percent average accuracy on the Hugging Face Open LLM Leaderboard.

Core claim

The standard DPO loss can reduce the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. This occurs in practice on common fine-tuning datasets. DPO-Positive avoids the reduction by including an extra positive term that explicitly encourages higher likelihood for the preferred completions. The resulting models show stronger performance across a range of datasets and tasks, including independent benchmarks.

What carries the argument

DPO-Positive (DPOP), a loss function that augments standard DPO with a term ensuring the preferred response receives higher absolute likelihood rather than only higher relative likelihood.

If this is right

DPOP produces higher scores on reasoning, summarization, and alignment tasks than standard DPO.
The gains hold for both low- and high-edit-distance preference pairs.
DPOP-tuned models outperform DPO-tuned models on benchmarks that do not share data with the fine-tuning set, such as MT-Bench.
The method yields Smaug-72B, the first open-source LLM above 80 percent average on the Hugging Face Open LLM Leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Many existing models trained with standard DPO could be improved by a second pass using the modified loss on the same data.
The absolute-likelihood behavior may matter more than relative odds alone in other alignment methods that use preference pairs.
The same failure mode could appear in any loss that optimizes only a ratio of probabilities without an anchoring term for the numerator.

Load-bearing premise

That the observed drop in preferred-example likelihood is the main performance bottleneck and that the added term fixes it without creating new side effects on model behavior.

What would settle it

Train the same base model on the same preference pairs with both losses and measure whether the DPOP version assigns higher likelihood to the preferred completions and whether downstream accuracy gains disappear when the likelihood term is removed.

read the original abstract

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DPOP is a straightforward tweak on DPO that delivers measurable gains and strong open models, but the paper leaves the causal link to the identified failure mode untested.

read the letter

The main thing to know is that this paper flags a real issue with standard DPO: it can lower the absolute log probability of the preferred response as long as the ratio to the dispreferred one improves. They add an explicit positive term on the preferred completions to create DPOP and report that this version beats DPO and other baselines on a range of datasets and tasks, including MT-Bench which is held out from the fine-tuning data. They also release Smaug-34B and Smaug-72B, with the larger model crossing 80% on the HuggingFace Open LLM Leaderboard for the first time among open models. That last part is concrete and useful on its own. The theoretical section walks through the math showing how the DPO loss permits the preferred likelihood to drop, and they note this shows up more on low edit-distance pairs. The empirical comparisons look broad enough to be worth attention. The soft spot is exactly the one in the stress-test note. They show end-to-end wins for DPOP but do not run an ablation that isolates whether the gains come from restoring preferred likelihood versus incidental changes in gradient behavior or effective regularization strength. Without that, it is hard to know if the identified failure mode is the main limiter or if almost any modest boost to preferred probability would produce similar downstream effects. The paper does not appear to introduce circular fitting or invented entities to force the result. This work is aimed at people doing preference optimization and LLM fine-tuning. Readers who want a new loss to try or who follow open model releases will get direct value from the experiments and the released checkpoints. It has enough practical substance and a clear, testable idea to deserve a full referee rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that the standard DPO loss can reduce the log-likelihood of preferred completions (y_w) while still increasing the log-ratio between preferred and dispreferred pairs, as derived in the theoretical section (Eq. 3–5). It reports empirical evidence of this phenomenon on common preference datasets, especially those with low edit distances between completions. To address it, the authors introduce DPO-Positive (DPOP), which augments the loss with an explicit positive term on preferred examples. End-to-end experiments show DPOP outperforming DPO and other fine-tuning baselines across datasets and tasks (including high edit-distance cases), as well as on independent benchmarks such as MT-Bench. The work also releases Smaug-34B and Smaug-72B, with the latter claimed as the first open-source model exceeding 80% average accuracy on the HuggingFace Open LLM Leaderboard.

Significance. If the central claims hold, the work identifies a concrete and previously under-discussed limitation of DPO and supplies a lightweight, interpretable fix that yields measurable gains on both in-distribution and out-of-distribution evaluations. The open release of Smaug-72B, which reaches a new milestone on a widely used public leaderboard, would constitute a tangible community contribution beyond the algorithmic insight.

major comments (2)

[Experiments] Experiments section: end-to-end comparisons establish that DPOP outperforms DPO, yet no ablation isolates whether the gains arise specifically from restoring/increasing log π(y_w) versus from incidental changes in gradient magnitude, effective β, or optimization trajectory. Without such a controlled comparison (e.g., a regularizer that boosts preferred likelihood by a different mechanism), the causal link between the identified failure mode and the observed improvements remains unestablished.
[Theoretical analysis] Theoretical section, Eq. (3–5): the derivation correctly shows that the DPO objective can decrease log π(y_w) while increasing the log-ratio, but the manuscript does not quantify how frequently or severely this occurs under the exact training regimes, learning rates, and β values used in the empirical sections. A short analysis or plot of log π(y_w) trajectories on the actual training runs would strengthen the claim that this is the primary limiter rather than a theoretical edge case.

minor comments (2)

[Abstract] Abstract and §5: the claim that Smaug-72B is 'the first open-source LLM to surpass an average accuracy of 80%' should include the precise leaderboard snapshot date and version number to allow independent verification.
[Figures] Figure captions and axis labels in the empirical results could be expanded to explicitly state whether the plotted metrics are averaged over multiple seeds and what error bars represent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to strengthen the causal evidence and empirical quantification in the manuscript. We address each major comment below and commit to revisions that directly incorporate the suggested analyses.

read point-by-point responses

Referee: [Experiments] Experiments section: end-to-end comparisons establish that DPOP outperforms DPO, yet no ablation isolates whether the gains arise specifically from restoring/increasing log π(y_w) versus from incidental changes in gradient magnitude, effective β, or optimization trajectory. Without such a controlled comparison (e.g., a regularizer that boosts preferred likelihood by a different mechanism), the causal link between the identified failure mode and the observed improvements remains unestablished.

Authors: We agree that the current end-to-end results do not fully isolate the contribution of restoring log π(y_w) from other optimization effects. In the revised manuscript we will add a controlled ablation that introduces an auxiliary positive regularizer on preferred responses (independent of the preference ratio term) and compare its performance and likelihood trajectories directly against DPOP. This will clarify whether the observed gains are attributable to addressing the specific failure mode identified in the theoretical analysis. revision: yes
Referee: [Theoretical analysis] Theoretical section, Eq. (3–5): the derivation correctly shows that the DPO objective can decrease log π(y_w) while increasing the log-ratio, but the manuscript does not quantify how frequently or severely this occurs under the exact training regimes, learning rates, and β values used in the empirical sections. A short analysis or plot of log π(y_w) trajectories on the actual training runs would strengthen the claim that this is the primary limiter rather than a theoretical edge case.

Authors: We acknowledge that while the manuscript demonstrates the phenomenon on common preference datasets, a more precise quantification under the exact training settings would be valuable. We will add plots of log π(y_w) trajectories for both DPO and DPOP runs using the precise learning rates, β values, and datasets from the empirical sections. These plots will be included in the theoretical analysis section to show the frequency and magnitude of likelihood reduction during actual training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical derivation of DPO failure mode is independent of paper inputs

full rationale

The paper's core chain begins with the standard DPO loss (Eq. 1-2), derives the possibility of decreasing log π(y_w) while increasing the log-ratio (Eqs. 3-5), observes this empirically on common datasets, and introduces DPOP as an explicit additive term on preferred completions. This derivation uses the externally defined DPO objective without redefining terms in terms of the new loss or fitting parameters to force the outcome. Performance gains are reported via end-to-end benchmarks including MT-Bench and the Open LLM Leaderboard, with no load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work. The central claims remain self-contained against external DPO formulations and do not reduce to the paper's own fitted values or definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the theoretical analysis of DPO failure modes and empirical observations on common datasets. The main addition is the DPOP method introduced to address the failure mode.

axioms (1)

domain assumption Standard DPO loss can reduce the model's likelihood of preferred examples as long as the relative probability between preferred and dispreferred increases.
This is the core theoretical result used to motivate the new method.

invented entities (1)

DPO-Positive (DPOP) loss function no independent evidence
purpose: Avoids the identified failure mode in standard DPO by ensuring increases in likelihood of preferred examples.
New method introduced in the paper without prior independent validation outside the reported experiments.

pith-pipeline@v0.9.0 · 5576 in / 1540 out tokens · 85227 ms · 2026-05-17T22:59:45.344560+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits
cs.LG 2026-05 unverdicted novelty 7.0

The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...
IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning
cs.LG 2026-04 unverdicted novelty 7.0

IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...
Visual Preference Optimization with Rubric Rewards
cs.CV 2026-04 unverdicted novelty 7.0

rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.
DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs
cs.CL 2026-01 unverdicted novelty 7.0

DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.
Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination
cs.MM 2026-05 unverdicted novelty 6.0

LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs
cs.CL 2026-05 unverdicted novelty 6.0

TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
Segment-Aligned Policy Optimization for Multi-Modal Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.
Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner
cs.LG 2026-04 unverdicted novelty 6.0

A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.
GroupDPO: Memory efficient Group-wise Direct Preference Optimization
cs.CL 2026-04 unverdicted novelty 6.0

GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.
Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning
cs.CL 2025-09 unverdicted novelty 6.0

Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and ...
Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization
cs.CL 2024-11 conditional novelty 6.0

Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.
MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark
cs.CL 2024-06 conditional novelty 6.0

MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.
Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance
cs.LG 2026-05 unverdicted novelty 5.0

FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.
Generating Place-Based Compromises Between Two Points of View
cs.CL 2026-04 unverdicted novelty 5.0

Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.
From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents
cs.AI 2026-04 unverdicted novelty 5.0

AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.
InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output
cs.CV 2024-07 conditional novelty 5.0

InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

288 extracted references · 288 canonical work pages · cited by 17 Pith papers · 33 internal anchors

[1]

Yi-34b-200k, 2024

01.AI. Yi-34b-200k, 2024. URL https://huggingface.co/01-ai/Yi-34B-200K

work page 2024
[2]

Ultrafeedback binarized clean, 2024

AllenAI. Ultrafeedback binarized clean, 2024. URL https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned

work page 2024
[3]

Learning from mistakes makes llm better reasoner

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023

work page arXiv 2023
[4]

A general theoretical paradigm to understand learning from human preferences, 2023

Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023

work page 2023
[6]

Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

work page 2022
[7]

Open llm leaderboard

Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

work page 2023
[8]

Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

work page 2023
[9]

R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. doi:10.2307/2334029

work page doi:10.2307/2334029 1952
[10]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[12]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020

work page 2020
[13]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

work page 2023
[14]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

work page 2023
[15]

Deep reinforcement learning from human preferences

Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

work page 2017
[16]

Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[17]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

work page 2018
[18]

Training verifiers to solve math word problems, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

work page 2021
[19]

Bagel-34b-v0.2, 2024 a

Jon Durbin. Bagel-34b-v0.2, 2024 a . URL https://huggingface.co/jondurbin/bagel-34b-v0.2

work page 2024
[20]

Truthy dpo, 2024 b

Jon Durbin. Truthy dpo, 2024 b . URL https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1

work page 2024
[21]

Orca-chat, 2024

Shahul Es. Orca-chat, 2024. URL https://huggingface.co/datasets/shahules786/orca-chat

work page 2024
[22]

Human-centered loss functions (halos)

Kawin Ethayarajh, Winnie Xu, Dan Jurafsky, and Douwe Kiela. Human-centered loss functions (halos). Technical report, Contextual AI, 2023

work page 2023
[24]

A framework for few-shot language model evaluation, September 2021

Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021

work page 2021
[25]

Hadsell, S

R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 2, pages 1735--1742, 2006. doi:10.1109/CVPR.2006.100

work page doi:10.1109/cvpr.2006.100 2006
[26]

Momentum contrast for unsupervised visual representation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729--9738, 2020

work page 2020
[28]

Orca dpo pairs, 2024

Intel. Orca dpo pairs, 2024. URL https://huggingface.co/datasets/Intel/orca_dpo_pairs

work page 2024
[29]

Camels in a changing climate: Enhancing lm adaptation with tulu 2

Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023

work page arXiv 2023
[30]

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

work page 2023
[32]

Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Gonzalez, and Ion Stoica

Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URL https://lmsys.org/blog/2024-04-19-arena-hard/

work page 2024
[34]

Truthfulqa: Measuring how models mimic human falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

work page 2022
[35]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Individual choice behavior: A theoretical analysis

R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005

work page 2005
[37]

Cross-Task Generalization via Natural Language Crowdsourcing Instructions

Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021

work page internal anchor Pith review arXiv 2021
[38]

Momo-72b-lora-1.8.7-dpo, 2024

Moreh. Momo-72b-lora-1.8.7-dpo, 2024. URL https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO

work page 2024
[40]

Gpt-4 technical report

OpenAI. Gpt-4 technical report. Technical Report, 2023

work page 2023
[41]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[43]

The analysis of permutations

Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24 0 (2): 0 193--202, 1975

work page 1975
[44]

Language models are unsupervised multitask learners

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

work page 2019
[45]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[47]

To the cutoff

Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

work page 2024
[48]

Winogrande: An adversarial winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 2020

work page 2020
[49]

A theoretical analysis of contrastive unsupervised representation learning

Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pages 5628--5637. PMLR, 2019

work page 2019
[50]

Facenet: A unified embedding for face recognition and clustering

Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815--823, 2015. doi:10.1109/CVPR.2015.7298682

work page doi:10.1109/cvpr.2015.7298682 2015
[51]

Detect pretrain code contamination

Weijia Shi. Detect pretrain code contamination. https://github.com/swj0419/detect-pretrain-code-contamination, 2023

work page 2023
[52]

Learning to summarize with human feedback

Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2020

work page 2020
[53]

Alpaca: A strong, replicable instruction-following model

Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3 0 (6): 0 7, 2023

work page 2023
[56]

Rush, and Thomas Wolf

Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023

work page 2023
[57]

Advances in prospect theory: Cumulative representation of uncertainty

Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5 0 (4): 0 297--323, 1992

work page 1992
[58]

Trl: Transformer reinforcement learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

work page 2020
[59]

Understanding the behaviour of contrastive loss

Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495--2504, 2021

work page 2021
[61]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929--9939. PMLR, 2020

work page 2020
[62]

Chain-of-thought prompting elicits reasoning in large language models, 2023

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

work page 2023
[63]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[64]

WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023

work page 2023
[67]

Sharegpt\_vicuna\_unfiltered, 2024

Z. Sharegpt\_vicuna\_unfiltered, 2024. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

work page 2024
[68]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1472. URL https://www.aclwe...

work page doi:10.18653/v1/p19-1472 2019
[69]

Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. Calibrating sequence likelihood improves conditional language generation, 2022

work page 2022
[70]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023

work page 2023
[72]

Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020

work page 2020
[73]

and Stoica, Ion , month =

Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Zhu, Banghua and Gonzalez, Joseph E. and Stoica, Ion , month =. From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline , url =

work page
[74]

FaceNet: A unified embedding for face recognition and clustering , year=

Schroff, Florian and Kalenichenko, Dmitry and Philbin, James , booktitle=. FaceNet: A unified embedding for face recognition and clustering , year=

work page
[75]

and Chopra, S

Hadsell, R. and Chopra, S. and LeCun, Y. , booktitle=. Dimensionality Reduction by Learning an Invariant Mapping , year=

work page
[76]

Direct preference optimization: Your language model is secretly a reward model , author=

work page
[77]

Making large language models better reasoners with alignment

Making large language models better reasoners with alignment , author=. arXiv preprint arXiv:2309.02144 , year=

work page arXiv
[78]

Ethayarajh, Kawin and Xu, Winnie and Jurafsky, Dan and Kiela, Douwe , title =

work page
[79]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

2023 , eprint=

A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. 2023 , eprint=

work page 2023
[81]

2022 , eprint=

Calibrating Sequence likelihood Improves Conditional Language Generation , author=. 2022 , eprint=

work page 2022
[82]

2023 , eprint=

SLiC-HF: Sequence Likelihood Calibration with Human Feedback , author=. 2023 , eprint=

work page 2023
[83]

Journal of Risk and Uncertainty , volume=

Advances in Prospect Theory: Cumulative Representation of Uncertainty , author=. Journal of Risk and Uncertainty , volume=. 1992 , publisher=

work page 1992
[84]

Advances in neural information processing systems , volume=

Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

work page
[85]

2020 , eprint=

Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

work page 2020
[86]

2022 , eprint=

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

work page 2022
[87]

2022 , eprint=

Training language models to follow instructions with human feedback , author=. 2022 , eprint=

work page 2022
[88]

The method of paired comparisons , author=

Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , doi=

work page 1952
[89]

2023 , eprint=

Zephyr: Direct Distillation of LM Alignment , author=. 2023 , eprint=

work page 2023
[90]

2005 , publisher=

Individual choice behavior: A theoretical analysis , author=. 2005 , publisher=

work page 2005
[91]

Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

work page 1975
[92]

2023 , eprint=

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2023 , eprint=

work page 2023

Showing first 80 references.