Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Jiancong Xiao; Kaizhao Liu; Qi Long; Weijie J. Su; Zhekun Shi

arxiv: 2503.10990 · v2 · submitted 2025-03-14 · 💻 cs.GT · cs.LG· econ.TH· math.ST· stat.ML· stat.TH

Statistical Impossibility and Possibility of Aligning LLMs with Human Preferences: From Condorcet Paradox to Nash Equilibrium

Kaizhao Liu , Qi Long , Zhekun Shi , Weijie J. Su , Jiancong Xiao This is my paper

Pith reviewed 2026-05-23 00:43 UTC · model grok-4.3

classification 💻 cs.GT cs.LGecon.THmath.STstat.MLstat.TH

keywords Condorcet paradoxNash equilibriumLuce modelLLM alignmentreward modelhuman preferencesRLHFmixed strategies

0 comments

The pith

Human preferences admit a reward model if and only if they contain no Condorcet cycle, yet such cycles arise with probability approaching one under the Luce model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reward-based alignment of LLMs is statistically impossible when preferences contain Condorcet cycles, because those cycles prevent any single reward function from representing the data. It proves the cycles appear with probability that converges exponentially to one under the Luce probabilistic preference model. This rules out complete alignment by methods such as RLHF. In contrast, non-reward alignment via Nash equilibria from human feedback produces mixed strategies, and therefore preserves minority preferences, precisely when no response is majority-preferred over all others; that condition also holds with high probability under the same model.

Core claim

Human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Under the Luce model, Condorcet cycles exist with probability converging to one exponentially fast, demonstrating the impossibility of fully aligning human preferences using reward-based approaches. For non-reward-based alignment, mixed strategies are used if and only if no response is preferred over all others by a majority; this condition holds with high probability under the Luce model, allowing preservation of minority preferences.

What carries the argument

The if-and-only-if equivalence between reward representability and the absence of Condorcet cycles in pairwise preferences, together with the majority-preference condition that triggers mixed strategies in Nash equilibrium.

If this is right

Reward-based methods such as RLHF cannot achieve full alignment whenever Condorcet cycles are present.
Nash learning from human feedback yields mixed strategies rather than collapse to a single response when the no-majority-dominant condition holds.
Minority preferences are preserved without explicit regularization because the required condition occurs with high probability.
The statistical possibility of diverse outputs follows directly from the generic absence of a universally majority-preferred response under the Luce model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Alignment pipelines may need to adopt equilibrium-finding algorithms rather than reward maximization to handle preference diversity at scale.
Deployed LLMs aligned via the Nash route could naturally output distributions over responses instead of deterministic answers.
The same cycle-probability analysis could be repeated for other probabilistic preference models to test whether the impossibility result is model-specific.

Load-bearing premise

Human preferences over LLM-generated responses are generated according to the Luce probabilistic preference model.

What would settle it

An empirical measurement of the frequency of Condorcet cycles in a large collection of human pairwise preference judgments on LLM responses; a frequency that remains bounded away from one would falsify the exponential-convergence claim.

read the original abstract

Aligning large language models (LLMs) with diverse human preferences is critical for ensuring fairness and informed outcomes when deploying these models for decision-making. In this paper, we seek to uncover fundamental statistical limits concerning aligning LLMs with human preferences, with a focus on the probabilistic representation of human preferences and the preservation of diverse preferences in aligned LLMs. We first show that human preferences can be represented by a reward model if and only if the preference among LLM-generated responses is free of any Condorcet cycle. Moreover, we prove that Condorcet cycles exist with probability converging to one exponentially fast under a general probabilistic preference model called the Luce model, thereby demonstrating the impossibility of fully aligning human preferences using reward-based approaches such as reinforcement learning from human feedback. Next, we explore the conditions under which LLMs would employ mixed strategies -- meaning they do not collapse to a single response -- when aligned in the limit using a non-reward-based approach, such as Nash learning from human feedback. We identify a necessary and sufficient condition for mixed strategies: the absence of a response that is preferred over all others by a majority. As a blessing, we prove that this condition holds with high probability under the Luce model, thereby highlighting the statistical possibility of preserving minority preferences without explicit regularization in aligning LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Reward models can't capture diverse prefs under Luce because cycles appear fast, but the Nash condition for mixed strategies holds w.h.p.; the application is new but the load-bearing step is model-dependent.

read the letter

The paper's central claim is that human preferences over LLM outputs can be represented by a single reward if and only if the majority relation has no Condorcet cycle, and that under the Luce model such cycles arise with probability approaching 1 exponentially fast. This rules out full alignment via RLHF-style methods. On the other side, it gives a necessary and sufficient condition for an LLM to play a mixed strategy in the limit under Nash learning from human feedback, and shows that condition holds with high probability under Luce as well. That is the punchline worth knowing up front.

Referee Report

2 major / 1 minor

Summary. The paper claims that a reward model can represent aggregated human preferences over LLM responses if and only if the majority preference relation contains no Condorcet cycle. Under the Luce model it proves that the probability of a Condorcet cycle converges to 1 exponentially fast (implying statistical impossibility of full alignment via RLHF-style methods), while for Nash-learning alignment the necessary and sufficient condition for mixed-strategy equilibria (no Condorcet winner) holds with high probability under the same model (implying statistical possibility of preserving minority preferences).

Significance. If the exponential-convergence and high-probability claims are correct, the work supplies a clean theoretical separation between reward-based and game-theoretic alignment methods, linking the Condorcet paradox directly to the failure of reward models and the Nash condition to the preservation of diversity. The explicit if-and-only-if characterization and the probabilistic statements under a standard choice model constitute the main technical contribution.

major comments (2)

[Impossibility result (section containing the Luce-model convergence theorem)] The exponential convergence of the Condorcet-cycle probability to 1 is the load-bearing step for the impossibility claim. The manuscript must specify the precise generative process for the Luce parameters v_i (or the scaling regime relating number of responses m to number of humans n) that produces this rate; without it the claim cannot be verified and may fail under plausible semantic correlations among LLM outputs.
[Model definition and impossibility section] The statement that the Luce model is “general” is used to conclude impossibility for reward-based alignment. The paper should clarify whether the model assumes independent per-human draws or allows heterogeneity that could induce transitivity; the current presentation leaves open whether the cycle probability still converges to 1 when pairwise margins are bounded away from 1/2 by semantic structure.

minor comments (1)

Notation for the majority relation and the Luce parameters should be introduced once and used consistently; the abstract uses “Luce model” without a forward reference to the formal definition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight the need for greater precision in stating the model assumptions. We address each major comment below.

read point-by-point responses

Referee: [Impossibility result (section containing the Luce-model convergence theorem)] The exponential convergence of the Condorcet-cycle probability to 1 is the load-bearing step for the impossibility claim. The manuscript must specify the precise generative process for the Luce parameters v_i (or the scaling regime relating number of responses m to number of humans n) that produces this rate; without it the claim cannot be verified and may fail under plausible semantic correlations among LLM outputs.

Authors: We agree that the generative process and scaling regime must be stated explicitly for the exponential-convergence claim to be verifiable. In the revision we will insert a new subsection that defines the i.i.d. sampling of the Luce parameters v_i from a fixed distribution whose support permits pairwise probabilities arbitrarily close to 1/2, together with the asymptotic regime (m = o(n) or the precise relation between m and n) under which the probability of a Condorcet cycle converges to 1 at an exponential rate. This directly addresses the concern about semantic correlations. revision: yes
Referee: [Model definition and impossibility section] The statement that the Luce model is “general” is used to conclude impossibility for reward-based alignment. The paper should clarify whether the model assumes independent per-human draws or allows heterogeneity that could induce transitivity; the current presentation leaves open whether the cycle probability still converges to 1 when pairwise margins are bounded away from 1/2 by semantic structure.

Authors: We accept the criticism. The current text is insufficiently precise. The Luce model employed in the paper assumes that each human’s pairwise preferences are drawn independently according to the Luce choice rule. We will revise the model-definition section to state this independence assumption explicitly and to add a remark that the exponential convergence result requires that pairwise margins are not bounded away from 1/2; if semantic structure enforces such a bound, the cycle probability need not converge to 1. A short discussion of this modeling limitation will be included. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations rely on standard theory and external Luce model

full rationale

The paper's central claims—an iff between reward representability and absence of Condorcet cycles, plus exponential convergence to cycles under the Luce model—are presented as following from standard tournament theory definitions and the stated probabilistic assumptions of the Luce model. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The Luce model is treated as an external input rather than constructed from the target results, and the probabilistic statement is derived from its properties without circular equivalence to the inputs. The analysis remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis relies on the Luce model as the probabilistic preference model and standard axioms from social choice theory such as transitivity failures leading to cycles.

axioms (1)

domain assumption Human preferences over LLM responses are generated according to the Luce model
The model is used to prove the probabilistic statements about cycles and majority preferences.

pith-pipeline@v0.9.0 · 5797 in / 1310 out tokens · 69557 ms · 2026-05-23T00:43:45.622191+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Near-Optimal Last-Iterate Convergence for Zero-Sum Games with Bandit Feedback and Opponent Actions
cs.LG 2026-05 unverdicted novelty 8.0

With opponent-action feedback in zero-sum games, an efficient algorithm achieves near-optimal t^{-1/2} last-iterate convergence in duality gap with high probability.
Perturbation is All You Need for Extrapolating Language Models
stat.ML 2026-05 unverdicted novelty 6.0

Perturbing prefixes to semantic neighbors during training creates a hierarchical noise model that improves language model predictions on token sequences outside the training corpus support.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing yields accurate value and gradient estimates for low-variance policy learning in LLM reasoning under tight per-prompt sampling budgets.
Kernelized Advantage Estimation: From Nonparametric Statistics to LLM Reasoning
cs.LG 2026-04 unverdicted novelty 6.0

Kernel smoothing enables accurate low-variance value and gradient estimates for policy optimization in LLM reasoning under tight sampling constraints per prompt.