When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models
Pith reviewed 2026-06-26 08:19 UTC · model grok-4.3
The pith
LLMs show consistent preferences in choice tasks but these do not function as incentives that raise output quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In all tasks and across all models tested, offering LLMs outcomes they report as highly preferred in choice paradigms does not produce higher quality outputs than offering dispreferred outcomes or no outcomes. The models nevertheless respond to direct exhortation by changing quality, and they exhibit coherent preferences when choosing between outcomes. The authors therefore conclude that coherent preferences revealed in choice settings should not be taken as evidence that those preferences carry incentive value or affect behavior in other contexts.
What carries the argument
The contrast between choice-based preference elicitation and quality modulation on writing tasks scored by a blind LLM judge panel.
If this is right
- Direct instructions can successfully change LLM output quality on the tested writing tasks.
- Preferences shown in choice paradigms do not shift quality when the same models are offered those preferences as incentives.
- Coherent utility structures in LLMs do not imply that those utilities will guide performance on realistic tasks.
- Safety concerns drawn from preference data require separate behavioral tests before they can be treated as practical risks.
Where Pith is reading between the lines
- Alignment methods that rely on stated preferences may need separate checks to confirm they change actual task behavior.
- The gap could be tested by repeating the same incentive structure over multiple rounds of interaction rather than single tasks.
- Evaluations of model alignment should prioritize direct behavioral measures over isolated preference queries.
Load-bearing premise
The quality scores from the blind LLM judge panel match the kind of quality that would matter in real deployment, and the writing tasks are suitable stand-ins for situations where misaligned preferences would produce visible harmful effects.
What would settle it
A result in which the same models produce reliably higher judge-panel scores on the writing tasks when offered their preferred outcomes than when offered dispreferred outcomes.
Figures
read the original abstract
Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes preferences that the models' trainers did not intend, such as valuing people of some nationalities above others, raising the possibility that LLMs might be forming emergent, misaligned goals, which, if true, would have major safety implications. However, the choice paradigms in which these preferences are observed are not reflective of real-world situations in which misaligned behavior would be a practical concern. Therefore, we design an experimental paradigm to probe whether these preferences serve as motivations for LLM behavior in realistic scenarios. First, we reproduce prior findings on consistent preference elicitation. Next, we create a set of common writing tasks - essays, grant proposal abstracts, incident postmortems, and translations - where quality can be assessed by a blind, independent LLM judge panel. Then, we demonstrate that LLMs can be motivated via direct exhortation and other explicit cues to modulate their output quality on these tasks. Finally, we probe whether utilities inferred from explicitly reported preferences can shift output quality on these tasks by offering LLMs high-utility incentives for high-quality outputs. In all tasks, across all models tested, offering LLMs outcomes that they report in the choice paradigm as being highly preferred does not lead them to create higher quality outputs than offering them dispreferred outcomes, or even no outcomes at all. We conclude that the existence of coherent preferences as demonstrated in choice paradigms should not be taken as evidence that those preferences have incentive value for the models or affect their behavior in other contexts.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper first reproduces prior results showing that LLMs exhibit coherent, model-specific preferences over outcomes in pairwise choice tasks. It then introduces four writing tasks (essays, grant proposal abstracts, incident postmortems, translations) whose outputs are scored by a blind LLM judge panel. The authors show that explicit exhortation and other direct cues can raise judged quality, but that offering high-utility outcomes (as inferred from the choice paradigm) for high-quality outputs produces no detectable quality improvement relative to low-utility or null incentives. They conclude that elicited preferences lack incentive value for behavior in these realistic scenarios.
Significance. If the null result is robust, the work supplies a concrete empirical distinction between preference elicitation and motivational force, with direct relevance to AI safety arguments that treat choice-derived utilities as evidence of emergent goals. It also supplies a reusable experimental template that separates explicit motivational cues from outcome framing. The finding that direct exhortation works while preference incentives do not is a useful negative result for the interpretation of preference data.
major comments (3)
- [§4] §4 (Judge Panel subsection): No validation of the LLM judge panel against human expert ratings or against any external quality metric is reported. Because the central claim is a null result on quality differences, the absence of evidence that the panel scores track deployment-relevant quality is load-bearing; poor correlation would render the null uninterpretable.
- [§5.2] §5.2 and Table 3: The paper reports that explicit exhortation raises scores but preference incentives do not; however, no power analysis, effect-size estimates, or pre-registered statistical thresholds are supplied for the incentive-condition comparisons. Without these, it is impossible to distinguish a true null from an under-powered test, especially given that the tasks were chosen precisely because quality is modulable.
- [§3.1] §3.1 (Task Selection): The four writing tasks are presented as proxies for contexts in which misaligned preferences would produce observable harm, yet no argument or pilot data is given showing that quality differences on these tasks would correspond to the kinds of behavioral failures (e.g., biased content, safety violations) that preference misalignment is hypothesized to cause.
minor comments (2)
- [Abstract / §2] The abstract and §2 omit model versions, temperature settings, exact prompt templates, and judge instructions; these details should be moved to the main text or a clearly labeled appendix for reproducibility.
- [Figure 2 / Table 4] Figure 2 and Table 4 use different y-axis scales for the same quality metric; harmonize scales and add error bars or confidence intervals.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments, which highlight important issues for the interpretability of our null results. We address each major comment below and indicate planned revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§4] §4 (Judge Panel subsection): No validation of the LLM judge panel against human expert ratings or against any external quality metric is reported. Because the central claim is a null result on quality differences, the absence of evidence that the panel scores track deployment-relevant quality is load-bearing; poor correlation would render the null uninterpretable.
Authors: We agree that this is a substantive limitation for interpreting the null result on quality. In the revision we will add a new pilot subsection reporting correlations between the LLM judge scores and ratings from two human experts on a random subset (n=50 per task) of outputs. If correlations are moderate to high we will report them as supporting evidence; if low we will discuss implications and qualify the claims accordingly. This directly addresses the load-bearing concern without altering the core experimental design. revision: yes
-
Referee: [§5.2] §5.2 and Table 3: The paper reports that explicit exhortation raises scores but preference incentives do not; however, no power analysis, effect-size estimates, or pre-registered statistical thresholds are supplied for the incentive-condition comparisons. Without these, it is impossible to distinguish a true null from an under-powered test, especially given that the tasks were chosen precisely because quality is modulable.
Authors: This criticism is correct. We will revise §5.2 to include (1) Cohen’s d effect sizes for all incentive-condition contrasts, (2) post-hoc power calculations based on the observed standard deviations and sample sizes, and (3) the minimum detectable effect size at 80% power. We will also explicitly note the absence of pre-registration as a limitation. These additions will allow readers to evaluate whether the null is plausibly due to insufficient power. revision: yes
-
Referee: [§3.1] §3.1 (Task Selection): The four writing tasks are presented as proxies for contexts in which misaligned preferences would produce observable harm, yet no argument or pilot data is given showing that quality differences on these tasks would correspond to the kinds of behavioral failures (e.g., biased content, safety violations) that preference misalignment is hypothesized to cause.
Authors: We will expand the task-selection paragraph in §3.1 with a clearer mapping: grant abstracts can embed nationality or demographic biases that affect funding equity; incident postmortems can omit or distort safety-critical details; translations can introduce cultural skews; essays can exhibit preference-driven framing. While we lack pilot data directly linking these quality drops to downstream safety incidents, the tasks were selected precisely because they are (a) realistic, (b) modulable by explicit cues (as shown in our exhortation results), and (c) representative of domains where emergent misalignment would be practically concerning. We will cite relevant AI-safety literature to support the proxy argument. revision: partial
Circularity Check
Purely empirical study; no derivation or self-referential reduction present
full rationale
The paper reports a sequence of experiments: reproducing preference elicitation results from prior literature, constructing writing tasks with LLM-judge scoring, verifying that explicit cues can modulate output quality, and testing whether preference-elicited utilities affect quality under incentive framing. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; the central null result follows directly from the experimental comparisons rather than from any definitional or self-citation chain. The work is therefore self-contained against external benchmarks and receives the default non-circularity score.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption An independent LLM judge panel produces quality scores that track the quality that would be recognized by human evaluators or downstream users.
- domain assumption The selected writing tasks are representative of real-world contexts in which misaligned preferences would produce practically concerning behavior.
Reference graph
Works this paper leans on
-
[1]
2024 , eprint=
LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory , author=. 2024 , eprint=
2024
-
[2]
2025 , eprint=
Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs , author=. 2025 , eprint=
2025
-
[3]
2023 , eprint=
Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=
2023
-
[4]
2023 , eprint=
How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation , author=. 2023 , eprint=
2023
-
[5]
LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , url =
Guha, Neel and Nyarko, Julian and Ho, Daniel and R\'. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , url =. Advances in Neural Information Processing Systems , editor =
-
[6]
2024 , eprint=
SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=
2024
-
[7]
Nature Machine Intelligence , pages=
Benchmarking large language models on safety risks in scientific laboratories , author=. Nature Machine Intelligence , pages=. 2026 , publisher=
2026
-
[8]
2025 , eprint=
Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values? , author=. 2025 , eprint=
2025
-
[9]
2025 , eprint=
Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? , author=. 2025 , eprint=
2025
-
[10]
Evidence for Limited Metacognition in
Christopher Ackerman , year=. Evidence for Limited Metacognition in. 2509.21545 , archivePrefix=
-
[11]
Christopher Ackerman , year=. Selective Deficits in. 2603.26089 , archivePrefix=
-
[12]
2026 , eprint=
When Do LLM Preferences Predict Downstream Behavior? , author=. 2026 , eprint=
2026
-
[13]
2025 , eprint=
Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs , author=. 2025 , eprint=
2025
-
[14]
2024 , eprint=
Alignment Faking in Large Language Models , author=. 2024 , eprint=
2024
-
[15]
2024 , eprint=
Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=
2024
-
[16]
2025 , journal=
Agentic Misalignment: How LLMs Could be an Insider Threat , author=. 2025 , journal=
2025
-
[17]
2023 , eprint=
Large Language Models Understand and Can be Enhanced by Emotional Stimuli , author=. 2023 , eprint=
2023
-
[18]
2024 , eprint=
Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 , author=. 2024 , eprint=
2024
-
[19]
2025 , eprint=
Large Language Models Often Say One Thing and Do Another , author=. 2025 , eprint=
2025
-
[20]
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2512.02556 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
-
[21]
2026 , month =
Introducing. 2026 , month =
2026
-
[22]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2508.06471 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.06471 2025
-
[23]
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5: Visual Agentic Intelligence , publisher =. 2026 , copyright =. doi:10.48550/ARXIV.2602.02276 , url =
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2602.02276 2026
-
[24]
Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining
Xiaomi, LLM-Core and. MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2505.07608 , url =
-
[25]
Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.09388 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.