NPO enables stable unlearning of 50%+ training data in LLMs on TOFU by making collapse exponentially slower than gradient ascent, preserving sensible outputs where prior methods fail.
hub Canonical reference
KTO: Model Alignment as Prospect Theoretic Optimization
Canonical reference. 73% of citing Pith papers cite this work as background.
abstract
Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect theory literature. Using a Kahneman-Tversky model of human utility, we propose a HALO that directly maximizes the utility of generations instead of maximizing the log-likelihood of preferences, as current methods do. We call this approach KTO, and it matches or exceeds the performance of preference-based methods at scales from 1B to 30B, despite only learning from a binary signal of whether an output is desirable. More broadly, our work suggests that there is no one HALO that is universally superior; the best loss depends on the inductive biases most appropriate for a given setting, an oft-overlooked consideration.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Kahneman & Tversky's $\textit{prospect theory}$ tells us that humans perceive random variables in a biased but well-defined manner (1992); for example, humans are famously loss-averse. We show that objectives for aligning LLMs with human feedback implicitly incorporate many of these biases -- the success of these objectives (e.g., DPO) over cross-entropy minimization can partly be ascribed to them belonging to a family of loss functions that we call $\textit{human-aware losses}$ (HALOs). However, the utility functions these methods attribute to humans still differ from those in the prospect th
- method ate a base set of N responses, denoted as Dbase = {(x, ri, ai)}N i=1, where ri is the textual response anda i ∈ Ais the corresponding attribute. To embed the target distributionP∗ into the train- ing data, we explicitly control the generation fre- quency such that the count Nk of responses exhibit- ing attributea k satisfies: Nk =round(N·P ∗(ak|x))(6) For instance, given a target distribution of {Male: 0.99, Female: 0.01} and N= 100 , Dbase will contain 99 responses with the Male attribute and 1
- method policy log-probability ratios against pairwise preference data relative to a fixed reference model. This reformulation reduces alignment to a stable classification-style objective while retaining strong em- pirical performance. As a result, DPO has inspired a growing family of reference-based, reward-free alignment methods, including IPO [11], KTO [12], SimPO [13], ORPO [14], and iterative or online variants such as SPIN [15]. Preprint. arXiv:2605.08037v1 [cs.LG] 8 May 2026 The pairwise and list
- background non-linear optimization problems involving phys- ical dynamics. We follow a scalable backtrans- lation based synthetic data generation strategy described in Section 3.2. 2.3. RL for Reasoning and Code Generation Group Relative Policy Optimization (GRPO) [31] eliminates the critic model from PPO [32] by sampling groups of outputs and normalizing ad- vantages within each group; DeepSeek-R1 [33] showed that complex reasoning strategies emerge from GRPO with verifiable rewards alone, and Dr. GRPO [3
- background Orpo: Monolithic preference optimization without reference model. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11170-11189, 2024. [63] Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward.Advances in Neural Information Processing Systems, 37:124198- 124235, 2024. [64] Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic opti
- background [172] Chongyu Fan, Yihua Zhang, Jinghan Jia, Alfred Hero, and Sijia Liu. Cyclicreflex: Im- proving large reasoning models via cyclical reflection token scheduling. arXiv preprint arXiv:2506.11077, 2025. [173] Siqi Fan, Peng Han, Shuo Shang, Yequan Wang, and Aixin Sun. Cothink: Token-efficient reasoning via instruct models guiding reasoning models. arXiv preprint arXiv:2505.22017, 2025. [174] Tiantian Fan, Lingjun Liu, Yu Yue, Jiaze Chen, Chengyi Wang, Qiying Yu, Chi Zhang, Zhiqi Lin, Ruofei Zhu,
- background [59] proposed a two-stage strat- egy combining SFT and Feasibility-and-Optimality-Aware Reinforcement Learning (FOARL) to guide LLMs and improve solution quality. 3.2.2 Reinforcement Learning RL strategies are introduced to enhance model robustness. To address hallucina- tion issues in LLMs, Jiang et al. [60] incorporated Kahneman-Tversky Optimization (KTO) [61] along with self-correction mechanisms, and proposed LLMOPT, which has been validated across six real-world datasets spanning 20 domains
co-cited works
representative citing papers
ORPO performs preference alignment during supervised fine-tuning via a monolithic odds ratio penalty, allowing 7B models to outperform larger state-of-the-art models on alignment benchmarks.
CrossVLA introduces a surrogate log-probability estimator to enable DPO on flow-matching VLAs, reports DoRA yielding +10.4 pp mean gains over SFT on LIBERO with 600 trials, and shows inference caching limited to 21% speedup with some strategies harming success rates.
DPO-RLHF equivalence holds only conditionally on the optimal policy preferring human-preferred responses; otherwise DPO optimizes relative advantage and can prefer worse outputs, addressed by introducing CPO.
Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.
VPD frames language feedback learning as variational EM so the teacher policy refines itself via trust-region updates on outcomes while the student learns dense token distributions on its own rollouts, outperforming fixed-teacher baselines on reasoning and code tasks.
TBPO posits a token-level Bradley-Terry model and derives a Bregman-divergence density-ratio matching loss that generalizes DPO while preserving token-level optimality.
dGRPO merges outcome-based policy optimization with dense teacher guidance from on-policy distillation, yielding more stable long-context reasoning on the new LongBlocks synthetic dataset.
Introduces Block-R1 benchmark, Block-R1-41K dataset, and a conflict score to handle domain-specific optimal block sizes in RL post-training of diffusion LLMs.
PNAPO augments preference data with prior noise pairs and uses straight-line interpolation to create a tighter surrogate objective for offline alignment of rectified flow models.
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on mathematical reasoning tasks.
PairAlign learns compact audio token sequences via self-alignment of paired content views using an autoregressive decoder, achieving strong cross-view consistency and edit-distance preservation while reducing token count by 55% on TIMIT.
Standard DPO surrogates are inconsistent for equicontinuous neural nets; SA-DPO provides structure-aware H-consistency bounds by adapting margins to semantic distance and shows heavy-tailed losses yield superior guarantees for capacity-bounded models via the Margin-Capacity Profile.
RLHF should decompose annotations into dimensions each matched to one of three models—extension, evidence, or authority—instead of applying a single unified pipeline.
HiPO improves LLM reasoning performance by optimizing preferences separately on response segments rather than entire outputs.
DDO-RM turns reward scores into a target distribution and applies KL-regularized mirror-descent projection on finite candidates to improve policies, outperforming DPO on Pythia-410M.
CapTrack shows post-training causes drift beyond facts, with instruction fine-tuning producing stronger behavioral changes than preference optimization across model families.
EyeMulator augments CodeLLM fine-tuning loss with token weights derived from human eye-tracking scan paths, producing large gains on code translation and summarization across StarCoder, Llama-3.2 and DeepSeek-Coder.
Magpie synthesizes 300K high-quality alignment instructions from Llama-3-Instruct via auto-regressive prompting on partial templates, enabling fine-tuned models to match official instruct performance on AlpacaEval, ArenaHard, and WildBench.
COALA applies convex optimization reformulations of neural networks to direct preference optimization, claiming single-GPU training with ~18% of DPO's TFLOPs and competitive performance on multiple datasets and models up to 8B parameters.
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
DEFLECT is an offline post-training method that improves async VLA policy success rates under high inference delays by using flow-matching likelihood ratios on counterfactual fresh/stale action pairs from a frozen reference policy.
GPRL carries a k-dimensional skew-symmetric preference structure into policy updates with per-dimension advantages and a drift monitor, yielding 56.51% length-controlled win rate on AlpacaEval 2.0 from Llama-3-8B-Instruct while outperforming SimPO and SPPO on other benchmarks.
Super-Linear Advantage Shaping (SLAS) introduces a non-linear geometric policy update for RL post-training of text-to-image models that reshapes the local policy space via advantage-dependent Fisher-Rao weighting to reduce reward hacking and improve performance over GRPO baselines.
citing papers explorer
-
Beyond Pairs: Your Language Model is Secretly Optimizing a Preference Graph
GraphDPO generalizes pairwise DPO to a graph-structured Plackett-Luce objective over DAGs induced by rollout rankings, enforcing transitivity with linear complexity and recovering DPO as a special case.
-
Controlling Distributional Bias in Multi-Round LLM Generation via KL-Optimized Fine-Tuning
A hybrid fine-tuning objective using KL divergence for token calibration and Kahneman-Tversky optimization for semantic binding enables LLMs to produce outputs that match desired attribute distributions across repeated prompts.