pith. machine review for the scientific record. sign in

arxiv: 2503.10990 · v2 · submitted 2025-03-14 · 💻 cs.GT · cs.LG· econ.TH· math.ST· stat.ML· stat.TH

Recognition: unknown

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Authors on Pith no claims yet
classification 💻 cs.GT cs.LGecon.THmath.STstat.MLstat.TH
keywords humanpreferencesllmsaligningmodelcondorcetstatisticalunder
0
0 comments X
read the original abstract

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a general probabilistic preference model called the Luce model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback. We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the Luce model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions

    cs.LG 2026-05 unverdicted novelty 8.0

    With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.

  2. Perturbation is All You Need for Extrapolating Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.

  3. Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.