Recognition: 2 theorem links
· Lean TheoremKTO: Model Alignment as Prospect Theoretic Optimization
Pith reviewed 2026-05-12 12:13 UTC · model grok-4.3
The pith
KTO aligns LLMs by maximizing prospect-theoretic utility from binary desirability signals rather than paired preferences.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
What carries the argument
KTO, a human-aware loss (HALO) that applies the prospect theory value function to assign utilities to model outputs based on whether they are desirable or not and maximizes the resulting expected utility.
If this is right
- KTO matches or exceeds the performance of preference-based methods at scales from 1B to 30B using only binary signals.
- Current alignment objectives implicitly incorporate prospect theory biases, explaining part of their success over cross-entropy.
- There is no universally superior HALO; the best loss depends on the inductive biases appropriate for the setting.
- Alignment can succeed by directly optimizing a utility function rather than preference log-likelihood.
Where Pith is reading between the lines
- Binary desirability labels may be sufficient for high-quality alignment because they allow direct utility maximization without needing preference pairs.
- This approach could make alignment more accessible by reducing the data collection burden compared to methods requiring comparative judgments.
- The lack of a universal best HALO suggests that practitioners should select the loss function based on how well its biases match the target domain.
Load-bearing premise
That the specific utility function from prospect theory literature accurately captures human judgments of LLM outputs and that optimizing it with only binary desirability labels is sufficient without additional modeling assumptions or reference-point choices.
What would settle it
If models trained with KTO on binary labels receive significantly lower human preference win rates than DPO-trained models on paired data, or if collected human ratings of output desirability deviate from the shape of the prospect theory value function used by KTO.
read the original abstract
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that existing LLM alignment methods (e.g., DPO) implicitly belong to a family of human-aware losses (HALOs) that encode prospect-theoretic biases from Kahneman-Tversky utility. It proposes KTO, which directly optimizes a prospect theory value function v(x) on binary desirability labels for generations rather than pairwise preferences, and reports that KTO matches or exceeds preference-based baselines across 1B–30B model scales.
Significance. If the empirical results hold under rigorous evaluation, the work is significant for showing that competitive alignment is possible with weaker (binary) supervision, which could reduce data collection costs. The HALO framing and observation that no single loss is universally optimal provide a useful conceptual lens for choosing alignment objectives based on inductive biases. The paper does not ship reproducible code or machine-checked proofs, so credit is limited to the conceptual contribution.
major comments (3)
- [§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.
- [§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.
- [§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.
minor comments (2)
- [§2] The definition of the HALO family in §2 could be made more precise by including an explicit mathematical characterization rather than a descriptive list.
- [Figure 2] Figure 2 (loss curves) lacks axis labels on the y-scale in some panels, reducing clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. We address each major point below and indicate the revisions that will be incorporated into the next version of the manuscript.
read point-by-point responses
-
Referee: [§3] §3 (KTO objective): The reference point used to classify binary labels as gains or losses is not explicitly defined or ablated. Prospect theory's value function is defined relative to this point, so the lack of justification for the choice (e.g., zero, model prior expectation, or other) and the scaling of binary signals into numeric gains/losses is load-bearing for the claim that the specific Kahneman-Tversky utility provides the performance advantage.
Authors: We will revise §3 to explicitly state that the reference point is set to zero, with desirable generations assigned a positive scalar utility and undesirable generations a negative scalar utility. This choice follows directly from the binary supervision signal, which provides only a directional indicator rather than a magnitude; zero is the natural neutral point separating gains from losses. We will add a short paragraph justifying this mapping and noting that it preserves the key prospect-theoretic asymmetry (loss aversion) without requiring a model-dependent reference. A full ablation of alternative references is not performed, but the performance gains relative to symmetric losses (e.g., standard cross-entropy) are attributable to the functional form rather than the precise reference location. revision: partial
-
Referee: [§5] §5 (Experiments, Tables 1–3): Win-rate differences between KTO and DPO-style baselines are small (typically 1–3 points) at 7B–30B scales, yet no standard errors, number of evaluation prompts, or statistical tests are reported. This makes it impossible to assess whether KTO truly matches or exceeds the baselines, directly undermining the central empirical claim.
Authors: We agree that the lack of standard errors and statistical tests weakens the ability to interpret the small observed differences. In the revised manuscript we will report the exact number of evaluation prompts per benchmark, include standard errors obtained via bootstrap resampling over the evaluation set, and add paired statistical tests (e.g., Wilcoxon signed-rank) comparing KTO against each baseline. While the absolute margins are modest, the consistent pattern across model scales and the fact that KTO succeeds with strictly weaker (binary) supervision remain the central empirical observations. revision: yes
-
Referee: [§3.2] §3.2 (Utility parameters): The prospect theory coefficients (α, β, λ) are taken directly from the 1992 literature without ablation or sensitivity analysis on the alignment task. If performance is sensitive to these fixed values, the results may reflect a particular loss shape rather than the claimed theoretical grounding.
Authors: The parameters α=0.88, β=0.88, λ=2.25 are the canonical values reported by Tversky and Kahneman (1992) that produce the characteristic concave/convex shape and loss-aversion coefficient of prospect theory. Our contribution is to show that a loss derived from this established functional form is competitive for alignment, not to claim that these exact coefficients are optimal for the task. To address sensitivity concerns we will add an appendix analysis that perturbs the parameters within plausible ranges (e.g., λ ∈ [1.5, 3.0]) and demonstrates that KTO performance remains stable, supporting that the qualitative shape rather than the precise numerical values drives the results. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper adopts the Kahneman-Tversky prospect theory utility function directly from the 1992 external literature and defines KTO as a new HALO that maximizes this utility on binary desirability labels rather than preference log-likelihoods. No load-bearing step reduces by construction to a fitted parameter, self-defined quantity, or self-citation chain; the implicit-bias analysis of prior methods (DPO etc.) and the performance claims at 1B-30B scales rest on independent empirical evaluation outside any tautological mapping. The reference-point and parameter choices are taken as given from prospect theory rather than optimized against the paper's own outputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- prospect theory parameters (e.g., loss aversion coefficient)
axioms (1)
- domain assumption Humans perceive random variables in a biased but well-defined manner according to prospect theory
invented entities (1)
-
Human-aware losses (HALOs)
no independent evidence
Forward citations
Cited by 36 Pith papers
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
-
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
-
Block-R1: Rethinking the Role of Block Size in Multi-domain Reinforcement Learning for Diffusion Large Language Models
Block-R1 formulates domain block size conflicts in multi-domain RL for dLLMs, releases a 41K-sample dataset with per-sample best block sizes and a conflict score, and provides a benchmark plus simple cross-domain trai...
-
Offline Preference Optimization for Rectified Flow with Noise-Tracked Pairs
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token co...
-
Mind the Gap: Structure-Aware Consistency in Preference Learning
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guara...
-
Three Models of RLHF Annotation: Extension, Evidence, and Authority
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
-
HiPO: Hierarchical Preference Optimization for Adaptive Reasoning in LLMs
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
-
DDO-RM: Distribution-Level Policy Improvement after Reward Learning
DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
-
TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching
TBPO derives a token-level preference optimization objective from sequence-level pairwise data via Bregman divergence ratio matching that generalizes DPO and improves alignment quality.
-
Power Reinforcement Post-Training of Text-to-Image Models with Super-Linear Advantage Shaping
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to r...
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment introduces AI systems that support human flourishing pluralistically and proactively while remaining safe, as a necessary complement to traditional safety-focused alignment research.
-
Team-Based Self-Play With Dual Adaptive Weighting for Fine-Tuning LLMs
TPAW uses teams of current and historical model checkpoints that collaborate and compete, plus adaptive weightings for responses and players, to improve self-supervised LLM alignment and outperform baselines.
-
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
-
Threshold-Guided Optimization for Visual Generative Models
A threshold-guided alignment method lets visual generative models be optimized directly from scalar human ratings instead of requiring paired preference data.
-
Gradient-Gated DPO: Stabilizing Preference Optimization in Language Models
Gate-DPO attenuates gradients on low-probability rejected responses to reduce probability collapse and improve chosen-response likelihood during preference optimization.
-
Multilingual Safety Alignment via Self-Distillation
MSD enables cross-lingual safety transfer in LLMs via self-distillation with Dual-Perspective Safety Weighting, improving safety in low-resource languages without target response data.
-
PERSA: Reinforcement Learning for Professor-Style Personalized Feedback with LLMs
PERSA combines RLHF with selective parameter-efficient updates to top transformer layers, raising style alignment scores from 35% to 96% on code feedback benchmarks while holding correctness near 100%.
-
Model Organisms Are Leaky: Perplexity Differencing Often Reveals Finetuning Objectives
Perplexity gaps between finetuned and reference models on random-prefill completions often reveal the original finetuning objectives across diverse model organisms.
-
Aligning Language Models for Lyric-to-Melody Generation with Rule-Based Musical Constraints
Rule-generated preference data aligned via sequential DPO and KTO reduces musical constraint violations and improves coherence in lyric-to-melody generation over baselines.
-
Representation-Guided Parameter-Efficient LLM Unlearning
REGLU guides LoRA-based unlearning via representation subspaces and orthogonal regularization to outperform prior methods on forget-retain trade-off in LLM benchmarks.
-
AutoOR: Scalably Post-training LLMs to Autoformalize Operations Research Problems
AutoOR uses synthetic data generation and RL post-training with solver feedback to enable 8B LLMs to autoformalize linear, mixed-integer, and non-linear OR problems, matching larger models on benchmarks.
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
Target Policy Optimization
TPO constructs a target distribution q proportional to the old policy times exp(utility) and trains the policy to match it via cross-entropy, matching or beating PPO and GRPO especially under sparse rewards.
-
Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning
A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeate...
-
Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning
High-entropy minority tokens drive RLVR gains, so restricting gradients to the top 20% maintains or improves performance over full updates on Qwen3 models, especially larger ones.
-
StarCoder 2 and The Stack v2: The Next Generation
StarCoder2-15B matches or beats CodeLlama-34B on code tasks despite being smaller, and StarCoder2-3B outperforms prior 15B models, with open weights and exact training data identifiers released.
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
Multilingual Safety Alignment via Self-Distillation
MSD transfers LLM safety from high-resource to low-resource languages via self-distillation and dual-perspective weighting without needing response data.
-
Medical Reasoning with Large Language Models: A Survey and MR-Bench
LLMs show strong exam performance on medical tasks but exhibit a clear gap in accuracy on authentic clinical decision-making as measured by the new MR-Bench benchmark and unified evaluations.
-
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
The paper unifies perspectives on Long CoT in reasoning LLMs by introducing a taxonomy, detailing characteristics of deep reasoning and reflection, and discussing emergence phenomena and future directions.
-
Positive Alignment: Artificial Intelligence for Human Flourishing
Positive Alignment is introduced as a distinct AI agenda that supports human flourishing through pluralistic and context-sensitive design, complementing traditional safety-focused alignment.
-
K-CARE: Knowledge-driven Symmetrical Contextual Anchoring and Analogical Prototype Reasoning for E-commerce Relevance
K-CARE uses behavior-derived anchoring and expert prototype analogies to ground LLMs and improve relevance on knowledge-intensive e-commerce cases.
Reference graph
Works this paper leans on
-
[1]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Bai, Y ., Jones, A., Ndousse, K., Askell, A., Chen, A., Das- Sarma, N., Drain, D., Fort, S., Ganguli, D., Henighan, T., et al. Training a helpful and harmless assistant with rein- forcement learning from human feedback. arXiv preprint arXiv:2204.05862,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Human irrationality: both bad and good for reward inference
Chan, L., Critch, A., and Dragan, A. Human irrationality: both bad and good for reward inference. arXiv preprint arXiv:2111.06956,
-
[3]
Chen, M., Tworek, J., Jun, H., Yuan, Q., Pinto, H. P. d. O., Kaplan, J., Edwards, H., Burda, Y ., Joseph, N., Brockman, G., et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models
Chen, Z., Deng, Y ., Yuan, H., Ji, K., and Gu, Q. Self-play fine-tuning converts weak language models to strong lan- guage models. arXiv preprint arXiv:2401.01335,
work page internal anchor Pith review arXiv
-
[5]
Training Verifiers to Solve Math Word Problems
Cobbe, K., Kosaraju, V ., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Towards ecologically valid research on language user interfaces
De Vries, H., Bahdanau, D., and Manning, C. Towards ecologically valid research on language user interfaces. arXiv preprint arXiv:2007.14435,
-
[7]
Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned
Ganguli, D., Lovitt, L., Kernion, J., Askell, A., Bai, Y ., Kadavath, S., Mann, B., Perez, E., Schiefer, N., Ndousse, K., et al. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
arXiv preprint arXiv:2403.07691 , year=
Hong, J., Lee, N., and Thorne, J. Reference-free monolithic preference optimization with odds ratio. arXiv preprint arXiv:2403.07691,
-
[10]
Jiang, A. Q., Sablayrolles, A., Mensch, A., Bamford, C., Chaplot, D. S., Casas, D. d. l., Bressand, F., Lengyel, G., Lample, G., Saulnier, L., et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
o pf, Yannic Kilcher, Dimitri von R \
K¨opf, A., Kilcher, Y ., von R ¨utte, D., Anagnostidis, S., Tam, Z.-R., Stevens, K., Barhoum, A., Duc, N. M., Stan- ley, O., Nagyfi, R., et al. Openassistant conversations– democratizing large language model alignment. arXiv preprint arXiv:2304.07327,
-
[12]
Kwon, M., Biyik, E., Talati, A., Bhasin, K., Losey, D. P., and Sadigh, D. When humans aren’t optimal: Robots that collaborate with risk-aware humans. In Proceedings of the 2020 ACM/IEEE international conference on human- robot interaction, pp. 43–52,
work page 2020
-
[13]
Munos, R., Valko, M., Calandriello, D., Azar, M. G., Row- land, M., Guo, Z. D., Tang, Y ., Geist, M., Mesnard, T., Michi, A., et al. Nash learning from human feedback. arXiv preprint arXiv:2312.00886,
-
[14]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Peng, X. B., Kumar, A., Zhang, G., and Levine, S. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177,
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[15]
arXiv preprint arXiv:2404.03715 , year=
Rosset, C., Cheng, C.-A., Mitra, A., Santacroce, M., Awadal- lah, A., and Xie, T. Direct nash optimization: Teaching language models to self-improve with general preferences. arXiv preprint arXiv:2404.03715,
-
[16]
Proximal Policy Optimization Algorithms
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
arXiv preprint arXiv:2401.04056 , year=
Swamy, G., Dann, C., Kidambi, R., Wu, Z. S., and Agarwal, A. A minimaximalist approach to reinforcement learning from human feedback. arXiv preprint arXiv:2401.04056,
-
[19]
and Finn, Chelsea , month = nov, year =
Tian, K., Mitchell, E., Yao, H., Manning, C. D., and Finn, C. Fine-tuning language models for factuality. arXiv preprint arXiv:2311.08401,
-
[20]
LLaMA: Open and Efficient Foundation Language Models
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozi`ere, B., Goyal, N., Hambro, E., Azhar, F., et al. Llama: Open and efficient foundation lan- guage models. arXiv preprint arXiv:2302.13971,
work page internal anchor Pith review Pith/arXiv arXiv
- [21]
-
[22]
Yang, A., Yang, B., Hui, B., Zheng, B., Yu, B., Zhou, C., Li, C., Li, C., Liu, D., Huang, F., et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Self-Rewarding Language Models
Yuan, W., Pang, R. Y ., Cho, K., Sukhbaatar, S., Xu, J., and Weston, J. Self-rewarding language models. arXiv preprint arXiv:2401.10020,
work page internal anchor Pith review arXiv
- [24]
-
[25]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Zheng, L., Chiang, W.-L., Sheng, Y ., Zhuang, S., Wu, Z., Zhuang, Y ., Lin, Z., Li, Z., Li, D., Xing, E., et al. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Fine-Tuning Language Models from Human Preferences
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., and Irving, G. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593,
work page internal anchor Pith review Pith/arXiv arXiv 1909
-
[27]
13 Model Alignment as Prospect Theoretic Optimization A. Related Work LLM Alignment Human feedback has been used to improve LLM capabilities in translation (Kreutzer et al., 2018), sum- marization (Stiennon et al., 2020), sentiment-conditioned generation (Ziegler et al., 2019), and instruction-following (Ouyang et al., 2022). The RLHF framework (Christian...
work page 2018
-
[28]
traditionally used to accomplish this is detailed in §2. Still, momentum has largely shifted in favor of closed-form losses that directly operate on offline preferences, such as DPO (Rafailov et al., 2023). This single stage of optimization distinguishes DPO from the conventional approach in preference-based RL, which learns a reward and then fits the pol...
work page 2023
-
[29]
and IPO (Azar et al., 2024). Binary Feedback Despite not being a human-aware loss, unlikelihood training was among the first methods to align language models using a binary signal (Welleck et al., 2019). However, Korbak et al. (2023) found unlikelihood training to be worse than the CSFT baseline we tested in this work, which is among various approaches th...
work page 2024
-
[30]
This gradient is simple to interpret: if y is desirable, then d(y) is negative and we push up the probability of πθ(y|x) to minimize the loss; if y is undesirable, then d(y) is positive and we push down the probability of πθ(y|x) to minimize the loss. As rθ tends to ±∞, the gradient will tend to zero since either (1 − σ(βz)) or σ(βz) will tend to zero. Th...
work page 2023
-
[31]
and (1 − p) ∈ (0, 0.5) respectively. If p1/βπref(ya|x) < (1 − p)1/βπref(yb|x), then the optimal DPO policy is more likely to produce the minority-preferredyb; the optimal KTO policy will strictly produce the majority-preferred ya for a loss-neutral value function (λD = λU ). Proof. Where u = β(rθ(x, ya) − rθ(x, yb)), we can write the total DPO loss for x ...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.