pith. machine review for the scientific record. sign in

arxiv: 2402.13228 · v2 · pith:NUOAWJTFnew · submitted 2024-02-20 · 💻 cs.CL · cs.AI· cs.LG

Smaug: Fixing Failure Modes of Preference Optimisation with DPO-Positive

Pith reviewed 2026-05-17 22:59 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords Direct Preference OptimizationDPOpreference fine-tuningLLM alignmentDPOPSmauglanguage model training
0
0 comments X

The pith

Standard DPO can lower the absolute likelihood of preferred responses while still raising their relative odds; a modified loss called DPOP prevents the drop and improves results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that the usual DPO objective can decrease the model's probability of generating its preferred completions, provided the ratio to dispreferred ones grows. This effect appears on typical training sets, especially when the two responses differ by only a few tokens. The authors introduce DPO-Positive, which adds a direct term that raises the likelihood of the preferred side. Models trained this way outperform standard DPO on reasoning, summarization, and alignment benchmarks, including those with larger differences between completions, and they also score higher on unrelated tests such as MT-Bench. The same procedure produces Smaug-72B, the first open-source model to clear 80 percent average accuracy on the Hugging Face Open LLM Leaderboard.

Core claim

The standard DPO loss can reduce the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. This occurs in practice on common fine-tuning datasets. DPO-Positive avoids the reduction by including an extra positive term that explicitly encourages higher likelihood for the preferred completions. The resulting models show stronger performance across a range of datasets and tasks, including independent benchmarks.

What carries the argument

DPO-Positive (DPOP), a loss function that augments standard DPO with a term ensuring the preferred response receives higher absolute likelihood rather than only higher relative likelihood.

If this is right

  • DPOP produces higher scores on reasoning, summarization, and alignment tasks than standard DPO.
  • The gains hold for both low- and high-edit-distance preference pairs.
  • DPOP-tuned models outperform DPO-tuned models on benchmarks that do not share data with the fine-tuning set, such as MT-Bench.
  • The method yields Smaug-72B, the first open-source LLM above 80 percent average on the Hugging Face Open LLM Leaderboard.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Many existing models trained with standard DPO could be improved by a second pass using the modified loss on the same data.
  • The absolute-likelihood behavior may matter more than relative odds alone in other alignment methods that use preference pairs.
  • The same failure mode could appear in any loss that optimizes only a ratio of probabilities without an anchoring term for the numerator.

Load-bearing premise

That the observed drop in preferred-example likelihood is the main performance bottleneck and that the added term fixes it without creating new side effects on model behavior.

What would settle it

Train the same base model on the same preference pairs with both losses and measure whether the DPOP version assigns higher likelihood to the preferred completions and whether downstream accuracy gains disappear when the likelihood term is removed.

read the original abstract

Direct Preference Optimisation (DPO) is effective at significantly improving the performance of large language models (LLMs) on downstream tasks such as reasoning, summarisation, and alignment. Using pairs of preferred and dispreferred data, DPO models the relative probability of picking one response over another. In this work, first we show theoretically that the standard DPO loss can lead to a reduction of the model's likelihood of the preferred examples, as long as the relative probability between the preferred and dispreferred classes increases. We then show empirically that this phenomenon occurs when fine-tuning LLMs on common datasets, especially datasets in which the edit distance between pairs of completions is low. Using these insights, we design DPO-Positive (DPOP), a new loss function and training procedure which avoids this failure mode. Surprisingly, we find that DPOP outperforms DPO and other fine-tuning procedures across a wide variety of datasets and downstream tasks, including datasets with high edit distances between completions. Furthermore, we find that the DPOP-tuned model outperforms the DPO-tuned model (all else equal) on benchmarks independent of the fine-tuning data, such as MT-Bench. Finally, using DPOP, we create and open-source Smaug-34B and Smaug-72B, with the latter becoming the first open-source LLM to surpass an average accuracy of 80% on the HuggingFace Open LLM Leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that the standard DPO loss can reduce the log-likelihood of preferred completions (y_w) while still increasing the log-ratio between preferred and dispreferred pairs, as derived in the theoretical section (Eq. 3–5). It reports empirical evidence of this phenomenon on common preference datasets, especially those with low edit distances between completions. To address it, the authors introduce DPO-Positive (DPOP), which augments the loss with an explicit positive term on preferred examples. End-to-end experiments show DPOP outperforming DPO and other fine-tuning baselines across datasets and tasks (including high edit-distance cases), as well as on independent benchmarks such as MT-Bench. The work also releases Smaug-34B and Smaug-72B, with the latter claimed as the first open-source model exceeding 80% average accuracy on the HuggingFace Open LLM Leaderboard.

Significance. If the central claims hold, the work identifies a concrete and previously under-discussed limitation of DPO and supplies a lightweight, interpretable fix that yields measurable gains on both in-distribution and out-of-distribution evaluations. The open release of Smaug-72B, which reaches a new milestone on a widely used public leaderboard, would constitute a tangible community contribution beyond the algorithmic insight.

major comments (2)
  1. [Experiments] Experiments section: end-to-end comparisons establish that DPOP outperforms DPO, yet no ablation isolates whether the gains arise specifically from restoring/increasing log π(y_w) versus from incidental changes in gradient magnitude, effective β, or optimization trajectory. Without such a controlled comparison (e.g., a regularizer that boosts preferred likelihood by a different mechanism), the causal link between the identified failure mode and the observed improvements remains unestablished.
  2. [Theoretical analysis] Theoretical section, Eq. (3–5): the derivation correctly shows that the DPO objective can decrease log π(y_w) while increasing the log-ratio, but the manuscript does not quantify how frequently or severely this occurs under the exact training regimes, learning rates, and β values used in the empirical sections. A short analysis or plot of log π(y_w) trajectories on the actual training runs would strengthen the claim that this is the primary limiter rather than a theoretical edge case.
minor comments (2)
  1. [Abstract] Abstract and §5: the claim that Smaug-72B is 'the first open-source LLM to surpass an average accuracy of 80%' should include the precise leaderboard snapshot date and version number to allow independent verification.
  2. [Figures] Figure captions and axis labels in the empirical results could be expanded to explicitly state whether the plotted metrics are averaged over multiple seeds and what error bars represent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight opportunities to strengthen the causal evidence and empirical quantification in the manuscript. We address each major comment below and commit to revisions that directly incorporate the suggested analyses.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: end-to-end comparisons establish that DPOP outperforms DPO, yet no ablation isolates whether the gains arise specifically from restoring/increasing log π(y_w) versus from incidental changes in gradient magnitude, effective β, or optimization trajectory. Without such a controlled comparison (e.g., a regularizer that boosts preferred likelihood by a different mechanism), the causal link between the identified failure mode and the observed improvements remains unestablished.

    Authors: We agree that the current end-to-end results do not fully isolate the contribution of restoring log π(y_w) from other optimization effects. In the revised manuscript we will add a controlled ablation that introduces an auxiliary positive regularizer on preferred responses (independent of the preference ratio term) and compare its performance and likelihood trajectories directly against DPOP. This will clarify whether the observed gains are attributable to addressing the specific failure mode identified in the theoretical analysis. revision: yes

  2. Referee: [Theoretical analysis] Theoretical section, Eq. (3–5): the derivation correctly shows that the DPO objective can decrease log π(y_w) while increasing the log-ratio, but the manuscript does not quantify how frequently or severely this occurs under the exact training regimes, learning rates, and β values used in the empirical sections. A short analysis or plot of log π(y_w) trajectories on the actual training runs would strengthen the claim that this is the primary limiter rather than a theoretical edge case.

    Authors: We acknowledge that while the manuscript demonstrates the phenomenon on common preference datasets, a more precise quantification under the exact training settings would be valuable. We will add plots of log π(y_w) trajectories for both DPO and DPOP runs using the precise learning rates, β values, and datasets from the empirical sections. These plots will be included in the theoretical analysis section to show the frequency and magnitude of likelihood reduction during actual training. revision: yes

Circularity Check

0 steps flagged

No significant circularity; theoretical derivation of DPO failure mode is independent of paper inputs

full rationale

The paper's core chain begins with the standard DPO loss (Eq. 1-2), derives the possibility of decreasing log π(y_w) while increasing the log-ratio (Eqs. 3-5), observes this empirically on common datasets, and introduces DPOP as an explicit additive term on preferred completions. This derivation uses the externally defined DPO objective without redefining terms in terms of the new loss or fitting parameters to force the outcome. Performance gains are reported via end-to-end benchmarks including MT-Bench and the Open LLM Leaderboard, with no load-bearing self-citations, uniqueness theorems, or ansatzes from prior author work. The central claims remain self-contained against external DPO formulations and do not reduce to the paper's own fitted values or definitions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the theoretical analysis of DPO failure modes and empirical observations on common datasets. The main addition is the DPOP method introduced to address the failure mode.

axioms (1)
  • domain assumption Standard DPO loss can reduce the model's likelihood of preferred examples as long as the relative probability between preferred and dispreferred increases.
    This is the core theoretical result used to motivate the new method.
invented entities (1)
  • DPO-Positive (DPOP) loss function no independent evidence
    purpose: Avoids the identified failure mode in standard DPO by ensuring increases in likelihood of preferred examples.
    New method introduced in the paper without prior independent validation outside the reported experiments.

pith-pipeline@v0.9.0 · 5576 in / 1540 out tokens · 85227 ms · 2026-05-17T22:59:45.344560+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Cancellation Hypothesis in Critic-Free RL: From Outcome Rewards to Token Credits

    cs.LG 2026-05 unverdicted novelty 7.0

    The cancellation hypothesis shows how rollout-level rewards produce token-level credit assignment in critic-free RL through cancellation of opposing signals on shared tokens, with empirical support and batching interv...

  2. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  3. Visual Preference Optimization with Rubric Rewards

    cs.CV 2026-04 unverdicted novelty 7.0

    rDPO uses offline-built rubrics to generate on-policy preference data for DPO, raising benchmark scores in visual tasks over outcome-based filtering and style baselines.

  4. DiffCoT: Diffusion-styled Chain-of-Thought Reasoning in LLMs

    cs.CL 2026-01 unverdicted novelty 7.0

    DiffCoT applies diffusion-style iterative denoising to chain-of-thought steps with a causal noise schedule, outperforming standard CoT optimization methods on multi-step reasoning benchmarks.

  5. Vocabulary Hijacking in LVLMs: Unveiling Critical Attention Heads by Excluding Inert Tokens to Mitigate Hallucination

    cs.MM 2026-05 unverdicted novelty 6.0

    LVLMs show vocabulary hijacking by inert tokens that decode to hijacking anchors; HABI locates them, NHAR finds resilient heads, and HAVAE boosts those heads to cut hallucinations.

  6. Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs

    cs.CL 2026-05 unverdicted novelty 6.0

    TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.

  7. Segment-Aligned Policy Optimization for Multi-Modal Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    SAPO introduces segment-level policy optimization using a step-wise MDP abstraction to better align RL updates with reasoning structure in multi-modal LLM tasks.

  8. Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

    cs.LG 2026-04 unverdicted novelty 6.0

    A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

  9. GroupDPO: Memory efficient Group-wise Direct Preference Optimization

    cs.CL 2026-04 unverdicted novelty 6.0

    GroupDPO decouples group-wise preference optimization during backpropagation to cut peak memory while keeping the same gradients, allowing larger groups and consistent gains over single-pair DPO plus an NLL term on positives.

  10. Future Policy Approximation for Offline Reinforcement Learning Improves Mathematical Reasoning

    cs.CL 2025-09 unverdicted novelty 6.0

    Future Policy Approximation (FPA) improves offline RL for LLM mathematical reasoning by extrapolating future policies in logit space to proactively reweight gradients, yielding consistent gains over DPO, RPO, KTO and ...

  11. Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization

    cs.CL 2024-11 conditional novelty 6.0

    Mixed Preference Optimization with the MMPR dataset boosts multimodal CoT reasoning, lifting InternVL2-8B to 67.0 accuracy on MathVista (+8.7 points) and matching the 76B model.

  12. MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

    cs.CL 2024-06 conditional novelty 6.0

    MMLU-Pro is a revised benchmark that makes language model evaluation harder and more stable by using ten options per question and emphasizing reasoning over simple knowledge recall.

  13. Boosting Reinforcement Learning with Verifiable Rewards via Randomly Selected Few-Shot Guidance

    cs.LG 2026-05 unverdicted novelty 5.0

    FEST improves RLVR sample efficiency on math and coding benchmarks by combining supervised signals, on-policy signals, and decaying weights on just 128 randomly chosen demonstrations, matching full-dataset baselines.

  14. Generating Place-Based Compromises Between Two Points of View

    cs.CL 2026-04 unverdicted novelty 5.0

    Empathic similarity feedback in prompts generates more acceptable compromises than chain-of-thought, and margin-based training on the resulting data lets smaller models produce them without ongoing empathy estimation.

  15. From Coarse to Fine: Self-Adaptive Hierarchical Planning for LLM Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    AdaPlan-H enables LLM agents to generate self-adaptive hierarchical plans that adjust detail level to task difficulty, improving success rates in multi-step tasks.

  16. InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output

    cs.CV 2024-07 conditional novelty 5.0

    InternLM-XComposer-2.5 is a 7B vision-language model supporting up to 96K context that reaches GPT-4V-level performance on image, video, and multi-turn tasks and adds LoRA-driven text-image composition capabilities.

  17. Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

    cs.LG 2026-05 unverdicted novelty 3.0

    Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

288 extracted references · 288 canonical work pages · cited by 17 Pith papers · 33 internal anchors

  1. [1]

    Yi-34b-200k, 2024

    01.AI. Yi-34b-200k, 2024. URL https://huggingface.co/01-ai/Yi-34B-200K

  2. [2]

    Ultrafeedback binarized clean, 2024

    AllenAI. Ultrafeedback binarized clean, 2024. URL https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned

  3. [3]

    Learning from mistakes makes llm better reasoner

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou, and Weizhu Chen. Learning from mistakes makes llm better reasoner. arXiv preprint arXiv:2310.20689, 2023

  4. [4]

    A general theoretical paradigm to understand learning from human preferences, 2023

    Mohammad Gheshlaghi Azar, Mark Rowland, Bilal Piot, Daniel Guo, Daniele Calandriello, Michal Valko, and Rémi Munos. A general theoretical paradigm to understand learning from human preferences, 2023

  5. [6]

    Training a helpful and harmless assistant with reinforcement learning from human feedback, 2022

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El-Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, ...

  6. [7]

    Open llm leaderboard

    Edward Beeching, Clémentine Fourrier, Nathan Habib, Sheon Han, Nathan Lambert, Nazneen Rajani, Omar Sanseviero, Lewis Tunstall, and Thomas Wolf. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023

  7. [8]

    Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

    BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models, 2023

  8. [9]

    R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 0 (3/4): 0 324--345, 1952. doi:10.2307/2334029

  9. [10]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  10. [12]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597--1607. PMLR, 2020

  11. [13]

    Gonzalez, Ion Stoica, and Eric P

    Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90\ URL https://lmsys.org/blog/2023-03-30-vicuna/

  12. [14]

    Palm: Scaling language modeling with pathways

    Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24 0 (240): 0 1--113, 2023

  13. [15]

    Deep reinforcement learning from human preferences

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017

  14. [16]

    Scaling Instruction-Finetuned Language Models

    Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022

  15. [17]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  16. [18]

    Training verifiers to solve math word problems, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021

  17. [19]

    Bagel-34b-v0.2, 2024 a

    Jon Durbin. Bagel-34b-v0.2, 2024 a . URL https://huggingface.co/jondurbin/bagel-34b-v0.2

  18. [20]

    Truthy dpo, 2024 b

    Jon Durbin. Truthy dpo, 2024 b . URL https://huggingface.co/datasets/jondurbin/truthy-dpo-v0.1

  19. [21]

    Orca-chat, 2024

    Shahul Es. Orca-chat, 2024. URL https://huggingface.co/datasets/shahules786/orca-chat

  20. [22]

    Human-centered loss functions (halos)

    Kawin Ethayarajh, Winnie Xu, Dan Jurafsky, and Douwe Kiela. Human-centered loss functions (halos). Technical report, Contextual AI, 2023

  21. [24]

    A framework for few-shot language model evaluation, September 2021

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021

  22. [25]

    Hadsell, S

    R. Hadsell, S. Chopra, and Y. LeCun. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 2, pages 1735--1742, 2006. doi:10.1109/CVPR.2006.100

  23. [26]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9729--9738, 2020

  24. [28]

    Orca dpo pairs, 2024

    Intel. Orca dpo pairs, 2024. URL https://huggingface.co/datasets/Intel/orca_dpo_pairs

  25. [29]

    Camels in a changing climate: Enhancing lm adaptation with tulu 2

    Hamish Ivison, Yizhong Wang, Valentina Pyatkin, Nathan Lambert, Matthew Peters, Pradeep Dasigi, Joel Jang, David Wadden, Noah A Smith, Iz Beltagy, et al. Camels in a changing climate: Enhancing lm adaptation with tulu 2. arXiv preprint arXiv:2311.10702, 2023

  26. [30]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  27. [31]

    Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023

  28. [32]

    Reliability and Learnability of Human Bandit Feedback for Sequence-to-Sequence Reinforcement Learning

    Julia Kreutzer, Joshua Uyheng, and Stefan Riezler. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627, 2018

  29. [33]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From live data to high-quality benchmarks: The arena-hard pipeline, April 2024. URL https://lmsys.org/blog/2024-04-19-arena-hard/

  30. [34]

    Truthfulqa: Measuring how models mimic human falsehoods

    Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022

  31. [35]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  32. [36]

    Individual choice behavior: A theoretical analysis

    R Duncan Luce. Individual choice behavior: A theoretical analysis. Courier Corporation, 2005

  33. [37]

    Cross-Task Generalization via Natural Language Crowdsourcing Instructions

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021

  34. [38]

    Momo-72b-lora-1.8.7-dpo, 2024

    Moreh. Momo-72b-lora-1.8.7-dpo, 2024. URL https://huggingface.co/moreh/MoMo-72B-lora-1.8.7-DPO

  35. [40]

    Gpt-4 technical report

    OpenAI. Gpt-4 technical report. Technical Report, 2023

  36. [41]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

  37. [43]

    The analysis of permutations

    Robin L Plackett. The analysis of permutations. Journal of the Royal Statistical Society Series C: Applied Statistics, 24 0 (2): 0 193--202, 1975

  38. [44]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1 0 (8): 0 9, 2019

  39. [45]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2023

  40. [47]

    To the cutoff

    Manley Roberts, Himanshu Thakur, Christine Herlihy, Colin White, and Samuel Dooley. To the cutoff... and beyond? a longitudinal perspective on llm data contamination. In Proceedings of the International Conference on Learning Representations (ICLR), 2024

  41. [48]

    Winogrande: An adversarial winograd schema challenge at scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale. Proceedings of the AAAI Conference on Artificial Intelligence, 34, 2020

  42. [49]

    A theoretical analysis of contrastive unsupervised representation learning

    Nikunj Saunshi, Orestis Plevrakis, Sanjeev Arora, Mikhail Khodak, and Hrishikesh Khandeparkar. A theoretical analysis of contrastive unsupervised representation learning. In International Conference on Machine Learning, pages 5628--5637. PMLR, 2019

  43. [50]

    Facenet: A unified embedding for face recognition and clustering

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. Facenet: A unified embedding for face recognition and clustering. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 815--823, 2015. doi:10.1109/CVPR.2015.7298682

  44. [51]

    Detect pretrain code contamination

    Weijia Shi. Detect pretrain code contamination. https://github.com/swj0419/detect-pretrain-code-contamination, 2023

  45. [52]

    Learning to summarize with human feedback

    Nisan Stiennon, Long Ouyang, Jeffrey Wu, Daniel Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F Christiano. Learning to summarize with human feedback. In Proceedings of the Annual Conference on Neural Information Processing Systems (NeurIPS), 2020

  46. [53]

    Alpaca: A strong, replicable instruction-following model

    Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li, Carlos Guestrin, Percy Liang, and Tatsunori B Hashimoto. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3 0 (6): 0 7, 2023

  47. [56]

    Rush, and Thomas Wolf

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, Nathan Sarrazin, Omar Sanseviero, Alexander M. Rush, and Thomas Wolf. Zephyr: Direct distillation of lm alignment, 2023

  48. [57]

    Advances in prospect theory: Cumulative representation of uncertainty

    Amos Tversky and Daniel Kahneman. Advances in prospect theory: Cumulative representation of uncertainty. Journal of Risk and Uncertainty, 5 0 (4): 0 297--323, 1992

  49. [58]

    Trl: Transformer reinforcement learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, and Shengyi Huang. Trl: Transformer reinforcement learning. https://github.com/huggingface/trl, 2020

  50. [59]

    Understanding the behaviour of contrastive loss

    Feng Wang and Huaping Liu. Understanding the behaviour of contrastive loss. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2495--2504, 2021

  51. [61]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning, pages 9929--9939. PMLR, 2020

  52. [62]

    Chain-of-thought prompting elicits reasoning in large language models, 2023

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models, 2023

  53. [63]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  54. [64]

    WizardLM: Empowering large pre-trained language models to follow complex instructions

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, and Daxin Jiang. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023

  55. [66]

    Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu

    Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, and Weiyang Liu. Metamath: Bootstrap your own mathematical questions for large language models, 2023

  56. [67]

    Sharegpt\_vicuna\_unfiltered, 2024

    Z. Sharegpt\_vicuna\_unfiltered, 2024. URL https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered

  57. [68]

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy, July 2019. Association for Computational Linguistics. doi:10.18653/v1/P19-1472. URL https://www.aclwe...

  58. [69]

    Yao Zhao, Misha Khalman, Rishabh Joshi, Shashi Narayan, Mohammad Saleh, and Peter J. Liu. Calibrating sequence likelihood improves conditional language generation, 2022

  59. [70]

    Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback, 2023

  60. [72]

    Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B

    Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences, 2020

  61. [73]

    and Stoica, Ion , month =

    Li, Tianle and Chiang, Wei-Lin and Frick, Evan and Dunlap, Lisa and Zhu, Banghua and Gonzalez, Joseph E. and Stoica, Ion , month =. From Live Data to High-Quality Benchmarks: The Arena-Hard Pipeline , url =

  62. [74]

    FaceNet: A unified embedding for face recognition and clustering , year=

    Schroff, Florian and Kalenichenko, Dmitry and Philbin, James , booktitle=. FaceNet: A unified embedding for face recognition and clustering , year=

  63. [75]

    and Chopra, S

    Hadsell, R. and Chopra, S. and LeCun, Y. , booktitle=. Dimensionality Reduction by Learning an Invariant Mapping , year=

  64. [76]

    Direct preference optimization: Your language model is secretly a reward model , author=

  65. [77]

    Making large language models better reasoners with alignment

    Making large language models better reasoners with alignment , author=. arXiv preprint arXiv:2309.02144 , year=

  66. [78]

    Ethayarajh, Kawin and Xu, Winnie and Jurafsky, Dan and Kiela, Douwe , title =

  67. [79]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  68. [80]

    2023 , eprint=

    A General Theoretical Paradigm to Understand Learning from Human Preferences , author=. 2023 , eprint=

  69. [81]

    2022 , eprint=

    Calibrating Sequence likelihood Improves Conditional Language Generation , author=. 2022 , eprint=

  70. [82]

    2023 , eprint=

    SLiC-HF: Sequence Likelihood Calibration with Human Feedback , author=. 2023 , eprint=

  71. [83]

    Journal of Risk and Uncertainty , volume=

    Advances in Prospect Theory: Cumulative Representation of Uncertainty , author=. Journal of Risk and Uncertainty , volume=. 1992 , publisher=

  72. [84]

    Advances in neural information processing systems , volume=

    Deep reinforcement learning from human preferences , author=. Advances in neural information processing systems , volume=

  73. [85]

    2020 , eprint=

    Fine-Tuning Language Models from Human Preferences , author=. 2020 , eprint=

  74. [86]

    2022 , eprint=

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback , author=. 2022 , eprint=

  75. [87]

    2022 , eprint=

    Training language models to follow instructions with human feedback , author=. 2022 , eprint=

  76. [88]

    The method of paired comparisons , author=

    Rank analysis of incomplete block designs: I. The method of paired comparisons , author=. Biometrika , volume=. 1952 , doi=

  77. [89]

    2023 , eprint=

    Zephyr: Direct Distillation of LM Alignment , author=. 2023 , eprint=

  78. [90]

    2005 , publisher=

    Individual choice behavior: A theoretical analysis , author=. 2005 , publisher=

  79. [91]

    Journal of the Royal Statistical Society Series C: Applied Statistics , volume=

    The analysis of permutations , author=. Journal of the Royal Statistical Society Series C: Applied Statistics , volume=. 1975 , publisher=

  80. [92]

    2023 , eprint=

    MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=. 2023 , eprint=

Showing first 80 references.