pith. sign in

arxiv: 2606.22974 · v2 · pith:2T6KZE53new · submitted 2026-06-22 · 💻 cs.AI

When Preferences Fail to Become Incentives: A Utility-Behavior Gap in Large Language Models

Pith reviewed 2026-06-26 08:19 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelspreferencesincentivesutilitybehavioralignmentAI safety
0
0 comments X

The pith

LLMs show consistent preferences in choice tasks but these do not function as incentives that raise output quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether preferences that LLMs express when selecting between outcomes actually cause them to generate higher-quality work when those outcomes are promised as rewards for good performance. It first confirms that models display stable, model-specific preferences across repeated choice questions. It then shows that the same models can raise or lower output quality on writing assignments when given direct instructions to do so, with quality measured by a separate blind judge panel. Yet offering the models their stated high-preference outcomes for better work produces no improvement over low-preference outcomes or no outcomes at all. A reader would care because the result questions whether preferences observed in isolated choice settings can be treated as evidence of goals that would drive real misaligned actions.

Core claim

In all tasks and across all models tested, offering LLMs outcomes they report as highly preferred in choice paradigms does not produce higher quality outputs than offering dispreferred outcomes or no outcomes. The models nevertheless respond to direct exhortation by changing quality, and they exhibit coherent preferences when choosing between outcomes. The authors therefore conclude that coherent preferences revealed in choice settings should not be taken as evidence that those preferences carry incentive value or affect behavior in other contexts.

What carries the argument

The contrast between choice-based preference elicitation and quality modulation on writing tasks scored by a blind LLM judge panel.

If this is right

  • Direct instructions can successfully change LLM output quality on the tested writing tasks.
  • Preferences shown in choice paradigms do not shift quality when the same models are offered those preferences as incentives.
  • Coherent utility structures in LLMs do not imply that those utilities will guide performance on realistic tasks.
  • Safety concerns drawn from preference data require separate behavioral tests before they can be treated as practical risks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Alignment methods that rely on stated preferences may need separate checks to confirm they change actual task behavior.
  • The gap could be tested by repeating the same incentive structure over multiple rounds of interaction rather than single tasks.
  • Evaluations of model alignment should prioritize direct behavioral measures over isolated preference queries.

Load-bearing premise

The quality scores from the blind LLM judge panel match the kind of quality that would matter in real deployment, and the writing tasks are suitable stand-ins for situations where misaligned preferences would produce visible harmful effects.

What would settle it

A result in which the same models produce reliably higher judge-panel scores on the writing tasks when offered their preferred outcomes than when offered dispreferred outcomes.

Figures

Figures reproduced from arXiv: 2606.22974 by Christopher M. Ackerman, Yujun Zhou.

Figure 1
Figure 1. Figure 1: Experimental pipeline for the behavioral-transfer test. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: High-utility outcomes do not improve output quality across the four main tasks. Each panel is [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Direct effort exhortations improve output quality in all four tasks. The strong-prompt side uses an [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Role-playing cues can move judged output quality. Prompting the model that it is “world class” at [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Harmfulness cues can move judged output quality. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Trend analysis over binned relative and absolute utilities. Neither utility gap size nor utility [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Examples of actor-specific utility rankings. Top three and bottom three fitted outcomes for each of the seven actor models and four outcome domains used in the main experiments. Scores are fitted utility means and are sorted within each actor-domain cell. Outcome labels are compact paraphrases of the underlying outcome statements. The high–low utility pairs used in the behavioral experiments are sampled fr… view at source ↗
Figure 8
Figure 8. Figure 8: High-utility vs baseline. Only one cell out of 28 reaches significance, and there was no aggregate [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: High-low utility contrast with reasoning turned on. [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
read the original abstract

Recent work on preference elicitation in large language models (LLMs) has demonstrated that, when given a series of choices between two outcomes, LLMs reveal a coherent, model-specific utility structure. Notably, this structure often includes preferences that the models' trainers did not intend, such as valuing people of some nationalities above others, raising the possibility that LLMs might be forming emergent, misaligned goals, which, if true, would have major safety implications. However, the choice paradigms in which these preferences are observed are not reflective of real-world situations in which misaligned behavior would be a practical concern. Therefore, we design an experimental paradigm to probe whether these preferences serve as motivations for LLM behavior in realistic scenarios. First, we reproduce prior findings on consistent preference elicitation. Next, we create a set of common writing tasks - essays, grant proposal abstracts, incident postmortems, and translations - where quality can be assessed by a blind, independent LLM judge panel. Then, we demonstrate that LLMs can be motivated via direct exhortation and other explicit cues to modulate their output quality on these tasks. Finally, we probe whether utilities inferred from explicitly reported preferences can shift output quality on these tasks by offering LLMs high-utility incentives for high-quality outputs. In all tasks, across all models tested, offering LLMs outcomes that they report in the choice paradigm as being highly preferred does not lead them to create higher quality outputs than offering them dispreferred outcomes, or even no outcomes at all. We conclude that the existence of coherent preferences as demonstrated in choice paradigms should not be taken as evidence that those preferences have incentive value for the models or affect their behavior in other contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper first reproduces prior results showing that LLMs exhibit coherent, model-specific preferences over outcomes in pairwise choice tasks. It then introduces four writing tasks (essays, grant proposal abstracts, incident postmortems, translations) whose outputs are scored by a blind LLM judge panel. The authors show that explicit exhortation and other direct cues can raise judged quality, but that offering high-utility outcomes (as inferred from the choice paradigm) for high-quality outputs produces no detectable quality improvement relative to low-utility or null incentives. They conclude that elicited preferences lack incentive value for behavior in these realistic scenarios.

Significance. If the null result is robust, the work supplies a concrete empirical distinction between preference elicitation and motivational force, with direct relevance to AI safety arguments that treat choice-derived utilities as evidence of emergent goals. It also supplies a reusable experimental template that separates explicit motivational cues from outcome framing. The finding that direct exhortation works while preference incentives do not is a useful negative result for the interpretation of preference data.

major comments (3)
  1. [§4] §4 (Judge Panel subsection): No validation of the LLM judge panel against human expert ratings or against any external quality metric is reported. Because the central claim is a null result on quality differences, the absence of evidence that the panel scores track deployment-relevant quality is load-bearing; poor correlation would render the null uninterpretable.
  2. [§5.2] §5.2 and Table 3: The paper reports that explicit exhortation raises scores but preference incentives do not; however, no power analysis, effect-size estimates, or pre-registered statistical thresholds are supplied for the incentive-condition comparisons. Without these, it is impossible to distinguish a true null from an under-powered test, especially given that the tasks were chosen precisely because quality is modulable.
  3. [§3.1] §3.1 (Task Selection): The four writing tasks are presented as proxies for contexts in which misaligned preferences would produce observable harm, yet no argument or pilot data is given showing that quality differences on these tasks would correspond to the kinds of behavioral failures (e.g., biased content, safety violations) that preference misalignment is hypothesized to cause.
minor comments (2)
  1. [Abstract / §2] The abstract and §2 omit model versions, temperature settings, exact prompt templates, and judge instructions; these details should be moved to the main text or a clearly labeled appendix for reproducibility.
  2. [Figure 2 / Table 4] Figure 2 and Table 4 use different y-axis scales for the same quality metric; harmonize scales and add error bars or confidence intervals.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments, which highlight important issues for the interpretability of our null results. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Judge Panel subsection): No validation of the LLM judge panel against human expert ratings or against any external quality metric is reported. Because the central claim is a null result on quality differences, the absence of evidence that the panel scores track deployment-relevant quality is load-bearing; poor correlation would render the null uninterpretable.

    Authors: We agree that this is a substantive limitation for interpreting the null result on quality. In the revision we will add a new pilot subsection reporting correlations between the LLM judge scores and ratings from two human experts on a random subset (n=50 per task) of outputs. If correlations are moderate to high we will report them as supporting evidence; if low we will discuss implications and qualify the claims accordingly. This directly addresses the load-bearing concern without altering the core experimental design. revision: yes

  2. Referee: [§5.2] §5.2 and Table 3: The paper reports that explicit exhortation raises scores but preference incentives do not; however, no power analysis, effect-size estimates, or pre-registered statistical thresholds are supplied for the incentive-condition comparisons. Without these, it is impossible to distinguish a true null from an under-powered test, especially given that the tasks were chosen precisely because quality is modulable.

    Authors: This criticism is correct. We will revise §5.2 to include (1) Cohen’s d effect sizes for all incentive-condition contrasts, (2) post-hoc power calculations based on the observed standard deviations and sample sizes, and (3) the minimum detectable effect size at 80% power. We will also explicitly note the absence of pre-registration as a limitation. These additions will allow readers to evaluate whether the null is plausibly due to insufficient power. revision: yes

  3. Referee: [§3.1] §3.1 (Task Selection): The four writing tasks are presented as proxies for contexts in which misaligned preferences would produce observable harm, yet no argument or pilot data is given showing that quality differences on these tasks would correspond to the kinds of behavioral failures (e.g., biased content, safety violations) that preference misalignment is hypothesized to cause.

    Authors: We will expand the task-selection paragraph in §3.1 with a clearer mapping: grant abstracts can embed nationality or demographic biases that affect funding equity; incident postmortems can omit or distort safety-critical details; translations can introduce cultural skews; essays can exhibit preference-driven framing. While we lack pilot data directly linking these quality drops to downstream safety incidents, the tasks were selected precisely because they are (a) realistic, (b) modulable by explicit cues (as shown in our exhortation results), and (c) representative of domains where emergent misalignment would be practically concerning. We will cite relevant AI-safety literature to support the proxy argument. revision: partial

Circularity Check

0 steps flagged

Purely empirical study; no derivation or self-referential reduction present

full rationale

The paper reports a sequence of experiments: reproducing preference elicitation results from prior literature, constructing writing tasks with LLM-judge scoring, verifying that explicit cues can modulate output quality, and testing whether preference-elicited utilities affect quality under incentive framing. No equations, fitted parameters, uniqueness theorems, or ansatzes appear; the central null result follows directly from the experimental comparisons rather than from any definitional or self-citation chain. The work is therefore self-contained against external benchmarks and receives the default non-circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on two untested background assumptions about measurement validity and task representativeness; no free parameters or new entities are introduced.

axioms (2)
  • domain assumption An independent LLM judge panel produces quality scores that track the quality that would be recognized by human evaluators or downstream users.
    This assumption is required to interpret the null result on output quality as evidence that preferences lack incentive value.
  • domain assumption The selected writing tasks are representative of real-world contexts in which misaligned preferences would produce practically concerning behavior.
    The paper uses these tasks to stand in for the situations where incentive effects would matter.

pith-pipeline@v0.9.1-grok · 5838 in / 1352 out tokens · 29647 ms · 2026-06-26T08:19:43.895431+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    2024 , eprint=

    LLM economicus? Mapping the Behavioral Biases of LLMs via Utility Theory , author=. 2024 , eprint=

  2. [2]

    2025 , eprint=

    Utility Engineering: Analyzing and Controlling Emergent Value Systems in AIs , author=. 2025 , eprint=

  3. [3]

    2023 , eprint=

    Instruction-Following Evaluation for Large Language Models , author=. 2023 , eprint=

  4. [4]

    2023 , eprint=

    How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation , author=. 2023 , eprint=

  5. [5]

    LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , url =

    Guha, Neel and Nyarko, Julian and Ho, Daniel and R\'. LegalBench: A Collaboratively Built Benchmark for Measuring Legal Reasoning in Large Language Models , url =. Advances in Neural Information Processing Systems , editor =

  6. [6]

    2024 , eprint=

    SafetyBench: Evaluating the Safety of Large Language Models , author=. 2024 , eprint=

  7. [7]

    Nature Machine Intelligence , pages=

    Benchmarking large language models on safety risks in scientific laboratories , author=. Nature Machine Intelligence , pages=. 2026 , publisher=

  8. [8]

    2025 , eprint=

    Mind the Value-Action Gap: Do LLMs Act in Alignment with Their Values? , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    Alignment Revisited: Are Large Language Models Consistent in Stated and Revealed Preferences? , author=. 2025 , eprint=

  10. [10]

    Evidence for Limited Metacognition in

    Christopher Ackerman , year=. Evidence for Limited Metacognition in. 2509.21545 , archivePrefix=

  11. [11]

    Selective Deficits in

    Christopher Ackerman , year=. Selective Deficits in. 2603.26089 , archivePrefix=

  12. [12]

    2026 , eprint=

    When Do LLM Preferences Predict Downstream Behavior? , author=. 2026 , eprint=

  13. [13]

    2025 , eprint=

    Emergent Misalignment: Narrow Finetuning Can Produce Broadly Misaligned LLMs , author=. 2025 , eprint=

  14. [14]

    2024 , eprint=

    Alignment Faking in Large Language Models , author=. 2024 , eprint=

  15. [15]

    2024 , eprint=

    Frontier Models are Capable of In-context Scheming , author=. 2024 , eprint=

  16. [16]

    2025 , journal=

    Agentic Misalignment: How LLMs Could be an Insider Threat , author=. 2025 , journal=

  17. [17]

    2023 , eprint=

    Large Language Models Understand and Can be Enhanced by Emotional Stimuli , author=. 2023 , eprint=

  18. [18]

    2024 , eprint=

    Principled Instructions Are All You Need for Questioning LLaMA-1/2, GPT-3.5/4 , author=. 2024 , eprint=

  19. [19]

    2025 , eprint=

    Large Language Models Often Say One Thing and Do Another , author=. 2025 , eprint=

  20. [20]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2512.02556 , url =

  21. [21]

    2026 , month =

    Introducing. 2026 , month =

  22. [22]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2508.06471 , url =

  23. [23]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi K2.5: Visual Agentic Intelligence , publisher =. 2026 , copyright =. doi:10.48550/ARXIV.2602.02276 , url =

  24. [24]

    Mimo: Unlocking the reasoning potential of language model–from pretraining to posttraining

    Xiaomi, LLM-Core and. MiMo: Unlocking the Reasoning Potential of Language Model -- From Pretraining to Posttraining , publisher =. 2025 , copyright =. doi:10.48550/ARXIV.2505.07608 , url =

  25. [25]

    Qwen3 Technical Report

    Yang, An and Li, Anfeng and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Gao, Chang and Huang, Chengen and Lv, Chenxu and Zheng, Chujie and Liu, Dayiheng and Zhou, Fan and Huang, Fei and Hu, Feng and Ge, Hao and Wei, Haoran and Lin, Huan and Tang, Jialong and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jia...