KTO: Model Alignment as Prospect Theoretic Optimization

Dan Jurafsky; Douwe Kiela; Kawin Ethayarajh; Niklas Muennighoff; Winnie Xu

arxiv: 2402.01306 · v4 · submitted 2024-02-02 · 💻 cs.LG · cs.AI

KTO: Model Alignment as Prospect Theoretic Optimization

Kawin Ethayarajh , Winnie Xu , Niklas Muennighoff , Dan Jurafsky , Douwe Kiela This is my paper

Pith reviewed 2026-05-12 12:13 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords LLM alignmentprospect theoryhuman-aware lossKTOpreference optimizationbinary feedbackHALOmodel alignment

0 comments

The pith

KTO aligns LLMs by maximizing prospect-theoretic utility from binary desirability signals rather than paired preferences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing LLM alignment methods like DPO implicitly build in human biases from prospect theory, which explains their success over simple likelihood maximization. It introduces KTO as a new objective that uses the exact utility function from Kahneman-Tversky prospect theory to directly boost the utility of desirable outputs. This approach requires only a binary label for each generation instead of comparative preferences. KTO performs as well or better than established methods across model sizes from 1 billion to 30 billion parameters. The work implies that alignment success depends on choosing the right human-aware loss for the setting rather than seeking a single best method.

Core claim

Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

What carries the argument

KTO, a human-aware loss (HALO) that applies the prospect theory value function to assign utilities to model outputs based on whether they are desirable or not and maximizes the resulting expected utility.

If this is right

KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B using only binary signals.
Current alignment objectives implicitly incorporate prospect theory biases, explaining part of their success over cross-entropy.
There is no universally superior HALO; the best loss depends on the inductive biases appropriate for the setting.
Alignment can succeed by directly optimizing a utility function rather than preference log-likelihood.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Binary desirability labels may be sufficient for high-quality alignment because they allow direct utility maximization without needing preference pairs.
This approach could make alignment more accessible by reducing the data collection burden compared to methods requiring comparative judgments.
The lack of a universal best HALO suggests that practitioners should select the loss function based on how well its biases match the target domain.

Load-bearing premise

That the specific utility function from prospect theory literature accurately captures human judgments of LLM outputs and that optimizing it with only binary desirability labels is sufficient without additional modeling assumptions or reference-point choices.

What would settle it

If models trained with KTO on binary labels receive significantly lower human preference win rates than DPO-trained models on paired data, or if collected human ratings of output desirability deviate from the shape of the prospect theory value function used by KTO.

read the original abstract

Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KTO gives a binary-feedback alignment loss grounded in prospect theory that performs about as well as DPO in the reported runs, but the mapping from classic value function to LLM outputs needs more scrutiny.

read the letter

KTO is worth knowing about because it shows you can get competitive alignment results from a single binary label per generation instead of paired preferences. The authors define a broader class of human-aware losses that includes DPO and similar methods, then derive KTO by plugging a Kahneman-Tversky utility directly into the objective. That framing is new and lets them argue that the inductive bias comes from loss aversion and diminishing sensitivity rather than from maximizing likelihood of preferences. The experiments run from 1B to 30B models and report that KTO matches or beats the baselines on standard benchmarks while using cheaper data collection. That part is useful if the numbers hold up. The paper does a clean job of showing how existing losses implicitly encode some of the same human biases, which is a helpful way to organize the literature. The main soft spot is the translation step. Prospect theory parameters were fitted on monetary gambles with known probabilities; here the authors have to choose a reference point for each generation, turn the binary label into a numeric gain or loss, and fix alpha, beta, and lambda. If those choices are doing heavy lifting or were selected after seeing results, the claimed theoretical advantage shrinks. The abstract states the performance claim without error bars or statistical tests visible in the summary, so the strength of the empirical support is still unclear. Readers working on alignment objectives or data-efficient fine-tuning will find the HALO framing and the binary-feedback results directly relevant. The work is coherent on its own terms and engages the prior literature without obvious contradictions, so it deserves a full referee process rather than a desk reject. I would send it out for review.

Referee Report

3 major / 2 minor

Summary. The paper claims that existing LLM alignment methods (e.g., DPO) implicitly belong to a family of human-aware losses (HALOs) that encode prospect-theoretic biases from Kahneman-Tversky utility. It proposes KTO, which directly optimizes a prospect theory value function v(x) on binary desirability labels for generations rather than pairwise preferences, and reports that KTO matches or exceeds preference-based baselines across 1B–30B model scales.

Significance. If the empirical results hold under rigorous evaluation, the work is significant for showing that competitive alignment is possible with weaker (binary) supervision, which could reduce data collection costs. The HALO framing and observation that no single loss is universally optimal provide a useful conceptual lens for choosing alignment objectives based on inductive biases. The paper does not ship reproducible code or machine-checked proofs, so credit is limited to the conceptual contribution.

major comments (3)

[§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.
[§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.
[§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.

minor comments (2)

[§2] The definition of the HALO family in §2 could be made more precise by including an explicit mathematical characterization rather than a descriptive list.
[Figure 2] Figure 2 (loss curves) lacks axis labels on the y-scale in some panels, reducing clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions that will be incorporated into the next version of the manuscript.

read point-by-point responses

Referee: [§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.

Authors: We will revise §3 to explicitly state that the reference point is set to zero, with desirable generations assigned a positive scalar utility and undesirable generations a negative scalar utility. This choice follows directly from the binary supervision signal, which provides only a directional indicator rather than a magnitude; zero is the natural neutral point separating gains from losses. We will add a short paragraph justifying this mapping and noting that it preserves the key prospect-theoretic asymmetry (loss aversion) without requiring a model-dependent reference. A full ablation of alternative references is not performed, but the performance gains relative to symmetric losses (e.g., standard cross-entropy) are attributable to the functional form rather than the precise reference location. revision: partial
Referee: [§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.

Authors: We agree that the lack of standard errors and statistical tests weakens the ability to interpret the small observed differences. In the revised manuscript we will report the exact number of evaluation prompts per benchmark, include standard errors obtained via bootstrap resampling over the evaluation set, and add paired statistical tests (e.g., Wilcoxon signed-rank) comparing KTO against each baseline. While the absolute margins are modest, the consistent pattern across model scales and the fact that KTO succeeds with strictly weaker (binary) supervision remain the central empirical observations. revision: yes
Referee: [§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.

Authors: The parameters α=0.88, β=0.88, λ=2.25 are the canonical values reported by Tversky and Kahneman (1992) that produce the characteristic concave/convex shape and loss-aversion coefficient of prospect theory. Our contribution is to show that a loss derived from this established functional form is competitive for alignment, not to claim that these exact coefficients are optimal for the task. To address sensitivity concerns we will add an appendix analysis that perturbs the parameters within plausible ranges (e.g., λ ∈ [1.5, 3.0]) and demonstrates that KTO performance remains stable, supporting that the qualitative shape rather than the precise numerical values drives the results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper adopts the Kahneman-Tversky prospect theory utility function directly from the 1992 external literature and defines KTO as a new HALO that maximizes this utility on binary desirability labels rather than preference log-likelihoods. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the implicit-bias analysis of prior methods (DPO etc.) and the performance claims at 1B-30B scales rest on independent empirical evaluation outside any tautological mapping. The reference-point and parameter choices are taken as given from prospect theory rather than optimized against the paper's own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the applicability of the prospect theory utility function to LLM outputs and on the empirical performance being driven by that choice rather than other factors.

free parameters (1)

prospect theory parameters (e.g., loss aversion coefficient)
The utility function is taken from Kahneman-Tversky but its exact parameterization for LLM outputs may require selection or tuning.

axioms (1)

domain assumption Humans perceive random variables in a biased but well-defined manner according to prospect theory
Invoked to justify replacing log-likelihood of preferences with direct utility maximization.

invented entities (1)

Human-aware losses (HALOs) no independent evidence
purpose: A family of loss functions that incorporate human decision biases
Introduced to categorize existing alignment objectives and position KTO within them.

pith-pipeline@v0.9.0 · 5530 in / 1254 out tokens · 50130 ms · 2026-05-12T12:13:01.699710+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning
cs.LG 2024-04 conditional novelty 8.0

NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
ORPO: Monolithic Preference Optimization without Reference Model
cs.CL 2024-03 conditional novelty 8.0

ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
CrossVLA: Cross-Paradigm Post-Training and Inference Optimization for Vision-Language-Action Models
cs.CV 2026-05 conditional novelty 7.0

CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% s...
Conditional Equivalence of DPO and RLHF: Implicit Assumption, Failure Modes, and Provable Alignment
cs.AI 2026-05 conditional novelty 7.0

DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR
cs.LG 2026-05 conditional novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
Learning from Language Feedback via Variational Policy Distillation
cs.LG 2026-05 unverdicted novelty 7.0

VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming f...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 7.0

TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
cs.CL 2026-05 unverdicted novelty 7.0

dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
cs.CV 2026-05 unverdicted novelty 7.0

PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
cs.LG 2026-05 unverdicted novelty 7.0

The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
cs.LG 2026-05 unverdicted novelty 7.0

PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
Mind the Gap: Structure-Aware Consistency in Preference Learning
cs.LG 2026-04 unverdicted novelty 7.0

Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...
Three Models of RLHF Annotation: Extension, Evidence, and Authority
cs.CY 2026-04 unverdicted novelty 7.0

RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
cs.AI 2026-04 unverdicted novelty 7.0

HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
cs.SE 2026-04 unverdicted novelty 7.0

ACE uses a solver-adversary loop with adversarial unit test generation and execution-based preference optimization to enable self-evolving LLM code generation, reporting 3-7% pass@1 gains over solver-verifier baseline...
DDO-RM: Distribution-Level Policy Improvement after Reward Learning
stat.ML 2026-04 unverdicted novelty 7.0

DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training
cs.LG 2026-02 unverdicted novelty 7.0

CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
EyeMulator: Improving Code Language Models by Mimicking Human Visual Attention
cs.SE 2025-08 unverdicted novelty 7.0

EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing
cs.CL 2024-06 unverdicted novelty 7.0

Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, Ar...
Convex Optimization for Alignment and Preference Learning on a Single GPU
cs.LG 2026-05 unverdicted novelty 6.0

COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models...
Towards Context-Invariant Safety Alignment for Large Language Models
cs.CL 2026-05 unverdicted novelty 6.0

Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
DEFLECT: Delay-Robust Execution via Flow-matching Likelihood-Estimated Counterfactual Tuning for VLA Policies
cs.RO 2026-05 unverdicted novelty 6.0

DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen ref...
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Inst...
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL carries k-dimensional skew-symmetric preference structure into policy updates via per-dimension advantages and context-dependent eigenvalues, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llam...
General Preference Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 6.0

GPRL applies a k-dimensional preference model with per-dimension normalized advantages and a drift monitor to LLM post-training, reporting 56.51% length-controlled win rate on AlpacaEval 2.0 and gains on other benchma...
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
cs.CL 2026-05 unverdicted novelty 6.0

TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
cs.CV 2026-05 unverdicted novelty 6.0

Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 conditional novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Positive Alignment: Artificial Intelligence for Human Flourishing
cs.AI 2026-05 unverdicted novelty 6.0

Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs
cs.CL 2026-05 unverdicted novelty 6.0

TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
cs.LG 2026-05 unverdicted novelty 6.0

GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
Threshold-Guided Optimization for Visual Generative Models
cs.LG 2026-05 unverdicted novelty 6.0

A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
cs.LG 2026-05 conditional novelty 6.0

Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
cs.AI 2026-05 unverdicted novelty 6.0

PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
cs.CL 2026-05 unverdicted novelty 6.0

Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
cs.SD 2026-04 unverdicted novelty 6.0

Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
Representation-Guided Parameter-Efficient LLM Unlearning
cs.CL 2026-04 unverdicted novelty 6.0

REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
cs.LG 2026-04 unverdicted novelty 6.0

AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
Target Policy Optimization
cs.LG 2026-04 unverdicted novelty 6.0

TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning
cs.CL 2026-04 unverdicted novelty 6.0

A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeate...
rePIRL: Learn PRM with Inverse RL for LLM Reasoning
cs.LG 2026-02 unverdicted novelty 6.0

rePIRL learns effective process reward models for LLM reasoning via a dual policy-PRM update process inspired by inverse RL, unifying online and offline methods with reported gains over prior approaches on math and co...
f-GRPO and Beyond: Divergence-Based Reinforcement Learning Algorithms for General LLM Alignment
cs.LG 2026-02 unverdicted novelty 6.0

f-GRPO and f-HAL estimate f-divergences between reward-aligned and reward-unaligned response distributions and prove expected reward improvement for general LLM alignment.
Multiplayer Nash Preference Optimization
cs.AI 2025-09 unverdicted novelty 6.0

MNPO extends NLHF to multiplayer Nash games, inheriting equilibrium guarantees while showing empirical gains on instruction-following benchmarks under diverse preferences.
RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards
cs.CL 2025-09 unverdicted novelty 6.0

RLBFF extracts binary principles from human feedback to train reward models that outperform Bradley-Terry models on RM-Bench and JudgeBench and enable customizable inference-time focus for LLM alignment.
LLM Hypnosis: Exploiting User Feedback for Unauthorized Knowledge Injection to All Users
cs.CL 2025-07 unverdicted novelty 6.0

A single attacker can use strategic upvoting and downvoting on language model outputs to inject facts, security flaws, or fake news that persist in the model for all users after preference tuning.
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
cs.CL 2025-06 conditional novelty 6.0

High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
Supervising the search process produces reliable and generalizable information-seeking agents
cs.CL 2025-02 unverdicted novelty 6.0

Process supervision via RAG-Gym produces more reliable and generalizable search agents, with gains driven by higher-quality queries on out-of-domain multi-hop tasks.
Preference Goal Tuning: Post-Training as Latent Control for Frozen Policies
cs.AI 2024-12 unverdicted novelty 6.0

PGT optimizes latent goal embeddings for frozen policies via trajectory-level preference objectives, reporting 72-81.6% relative gains on 17 Minecraft tasks and 13.4% better OOD performance than fine-tuning.
DataComp-LM: In search of the next generation of training sets for language models
cs.LG 2024-06 unverdicted novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
StarCoder 2 and The Stack v2: The Next Generation
cs.SE 2024-02 accept novelty 6.0

StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
Goal-Conditioned Supervised Learning for LLM Fine-Tuning
cs.LG 2026-05 unverdicted novelty 5.0

GCSL reframes LLM fine-tuning as supervised pursuit of quality thresholds using natural-language goals, outperforming SFT and DPO on toxicity, code, and recommendation tasks.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
Multilingual Safety Alignment via Self-Distillation
cs.LG 2026-05 unverdicted novelty 5.0

MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
ACE: Self-Evolving LLM Coding Framework via Adversarial Unit Test Generation and Preference Optimization
cs.SE 2026-04 unverdicted novelty 5.0

ACE introduces a solver-adversary loop where an LLM generates both candidate programs and adversarial tests, using execution outcomes for preference optimization to achieve 3-7% pass@1 gains on code benchmarks without...

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 71 Pith papers · 16 internal anchors

[1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Human irrationality: both bad and good for reward inference

Chan, L., Critch, A., and Dragan, A. Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956,

work page arXiv
[3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review arXiv
[5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Towards ecologically valid research on language user interfaces

De Vries, H., Bahdanau, D., and Manning, C. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435,

work page arXiv 2007
[7]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review arXiv
[10]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

R., Stevens, K., Barhoum, A., Duc, N

K¨opf, A., Kilcher, Y ., von R ¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan- ley, O., Nagyfi, R., et al. Openassistant conversations– democratizing large language model alignment. arXiv preprint arXiv:2304.07327,

work page arXiv
[12]

P., and Sadigh, D

Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P., and Sadigh, D. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human- robot interaction, pp. 43–52,

work page 2020
[13]

15 UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Row- land, M., Guo, Z. D., Tang, Y ., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886,

work page arXiv
[14]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[15]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadal- lah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715,

work page arXiv
[16]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

arXiv preprint arXiv:2401.04056 , year=

Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056,

work page arXiv
[19]

Manning and Chelsea Finn , title =

Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401,

work page arXiv
[20]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Xu, H., Sharaf, A., Chen, Y ., Tan, W., Shen, L., Van Durme, B., Murray, K., and Kim, Y . J. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417,

work page arXiv
[22]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Self-Rewarding Language Models

Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review arXiv
[24]

Zhao, Y ., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425,

work page arXiv
[25]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Fine-Tuning Language Models from Human Preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[27]

13 Model Alignment as Prospect Theoretic Optimization A. Related Work LLM Alignment Human feedback has been used to improve LLM capabilities in translation (Kreutzer et al., 2018), sum- marization (Stiennon et al., 2020), sentiment-conditioned generation (Ziegler et al., 2019), and instruction-following (Ouyang et al., 2022). The RLHF framework (Christian...

work page 2018
[28]

Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023)

traditionally used to accomplish this is detailed in §2. Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023). This single stage of optimization distinguishes DPO from the conventional approach in preference-based RL, which learns a reward and then fits the pol...

work page 2023
[29]

self-training

and IPO (Azar et al., 2024). Binary Feedback Despite not being a human-aware loss, unlikelihood training was among the first methods to align language models using a binary signal (Welleck et al., 2019). However, Korbak et al. (2023) found unlikelihood training to be worse than the CSFT baseline we tested in this work, which is among various approaches th...

work page 2024
[30]

As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero

This gradient is simple to interpret: if y is desirable, then d(y) is negative and we push up the probability of πθ(y|x) to minimize the loss; if y is undesirable, then d(y) is positive and we push down the probability of πθ(y|x) to minimize the loss. As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero. Th...

work page 2023
[31]

and (1 − p) ∈ (0, 0.5) respectively. If p1/βπref(ya|x) < (1 − p)1/βπref(yb|x), then the optimal DPO policy is more likely to produce the minority-preferredyb; the optimal KTO policy will strictly produce the majority-preferred ya for a loss-neutral value function (λD = λU ). Proof. Where u = β(rθ(x, ya) − rθ(x, yb)), we can write the total DPO loss for x ...

work page 2023

[1] [1]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Human irrationality: both bad and good for reward inference

Chan, L., Critch, A., and Dragan, A. Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956,

work page arXiv

[3] [3]

Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. arXiv preprint arXiv:2401.01335,

work page internal anchor Pith review arXiv

[5] [5]

Training Verifiers to Solve Math Word Problems

Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Towards ecologically valid research on language user interfaces

De Vries, H., Bahdanau, D., and Manning, C. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435,

work page arXiv 2007

[7] [7]

The Llama 3 Herd of Models

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

ORPO: Monolithic Preference Optimization without Reference Model

Hong, J., Lee, N., and Thorne, J. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691,

work page internal anchor Pith review arXiv

[10] [10]

Mistral 7B

Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

R., Stevens, K., Barhoum, A., Duc, N

K¨opf, A., Kilcher, Y ., von R ¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan- ley, O., Nagyfi, R., et al. Openassistant conversations– democratizing large language model alignment. arXiv preprint arXiv:2304.07327,

work page arXiv

[12] [12]

P., and Sadigh, D

Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P., and Sadigh, D. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human- robot interaction, pp. 43–52,

work page 2020

[13] [13]

15 UNA: Unifying Alignments of RLHF/PPO, DPO and KTO by a Generalized Implicit Reward Function Nvidia, :, Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H

Munos, R., Valko, M., Calandriello, D., Azar, M. G., Row- land, M., Guo, Z. D., Tang, Y ., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886,

work page arXiv

[14] [14]

Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[15] [15]

Direct nash optimization: Teaching language models to self-improve with general preferences,

Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadal- lah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715,

work page arXiv

[16] [16]

Proximal Policy Optimization Algorithms

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

arXiv preprint arXiv:2401.04056 , year=

Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056,

work page arXiv

[19] [19]

Manning and Chelsea Finn , title =

Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401,

work page arXiv

[20] [20]

LLaMA: Open and Efficient Foundation Language Models

Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Xu, H., Sharaf, A., Chen, Y ., Tan, W., Shen, L., Van Durme, B., Murray, K., and Kim, Y . J. Contrastive preference optimization: Pushing the boundaries of llm performance in machine translation. arXiv preprint arXiv:2401.08417,

work page arXiv

[22] [22]

Qwen2 Technical Report

Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Self-Rewarding Language Models

Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020,

work page internal anchor Pith review arXiv

[24] [24]

Zhao, Y ., Joshi, R., Liu, T., Khalman, M., Saleh, M., and Liu, P. J. Slic-hf: Sequence likelihood calibration with human feedback. arXiv preprint arXiv:2305.10425,

work page arXiv

[25] [25]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Fine-Tuning Language Models from Human Preferences

Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[27] [27]

13 Model Alignment as Prospect Theoretic Optimization A. Related Work LLM Alignment Human feedback has been used to improve LLM capabilities in translation (Kreutzer et al., 2018), sum- marization (Stiennon et al., 2020), sentiment-conditioned generation (Ziegler et al., 2019), and instruction-following (Ouyang et al., 2022). The RLHF framework (Christian...

work page 2018

[28] [28]

Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023)

traditionally used to accomplish this is detailed in §2. Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023). This single stage of optimization distinguishes DPO from the conventional approach in preference-based RL, which learns a reward and then fits the pol...

work page 2023

[29] [29]

self-training

and IPO (Azar et al., 2024). Binary Feedback Despite not being a human-aware loss, unlikelihood training was among the first methods to align language models using a binary signal (Welleck et al., 2019). However, Korbak et al. (2023) found unlikelihood training to be worse than the CSFT baseline we tested in this work, which is among various approaches th...

work page 2024

[30] [30]

As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero

This gradient is simple to interpret: if y is desirable, then d(y) is negative and we push up the probability of πθ(y|x) to minimize the loss; if y is undesirable, then d(y) is positive and we push down the probability of πθ(y|x) to minimize the loss. As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero. Th...

work page 2023

[31] [31]

and (1 − p) ∈ (0, 0.5) respectively. If p1/βπref(ya|x) < (1 − p)1/βπref(yb|x), then the optimal DPO policy is more likely to produce the minority-preferredyb; the optimal KTO policy will strictly produce the majority-preferred ya for a loss-neutral value function (λD = λU ). Proof. Where u = β(rθ(x, ya) − rθ(x, yb)), we can write the total DPO loss for x ...

work page 2023