Replicating Human Motivated Reasoning Studies with LLMs

Adiba Mahbub Proma; Daniel C. Molden; Ehsan Hoque; Gourab Ghoshal; Hangfeng He; James N. Druckman; Neeley Pate

arxiv: 2601.16130 · v2 · submitted 2026-01-22 · 💻 cs.HC · cs.AI

Replicating Human Motivated Reasoning Studies with LLMs

Neeley Pate , Adiba Mahbub Proma , Hangfeng He , James N. Druckman , Daniel C. Molden , Gourab Ghoshal , Ehsan Hoque This is my paper

Pith reviewed 2026-05-16 11:43 UTC · model grok-4.3

classification 💻 cs.HC cs.AI

keywords motivated reasoninglarge language modelsreplication studypolitical judgmentLLM behaviorhuman-AI comparisonopinion simulation

0 comments

The pith

Base LLMs do not replicate human motivated reasoning in four political studies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper takes four established human experiments on motivated reasoning, in which people process political information to reach preferred conclusions rather than accurate ones, and runs the same setups on base large language models. It finds that the models show none of the human pattern of directional bias under motivational prompts. Instead, LLMs across different base models tend to abstain from answering or neutrally incorporate supplied arguments, behaviors that do not match the original human results. The work matters because many researchers now prompt LLMs to stand in for human opinion or to evaluate arguments; if the models lack the motivational component that shapes human judgment, those substitutions will systematically misrepresent real behavior.

Core claim

Replicating four prior political motivated reasoning studies shows that base LLM behavior does not align with expected human behavior. Base models do not display the tendency to arrive at desired conclusions when processing information under motivational manipulations. Across models, LLMs exhibit similarities such as choosing to abstain from question answering and incorporating provided arguments into opinions, yet these responses fail to reproduce the directional biases documented in human participants.

What carries the argument

Prompt-based replication of motivational manipulations from human studies, which directly compares LLM outputs on the same tasks to the directional bias patterns observed in people.

If this is right

Base LLMs cannot be used as direct substitutes for human subjects in studies of opinion formation or political judgment.
Tasks that require LLMs to assess or generate arguments will miss the motivational component that drives human responses.
Cross-model similarities in abstention and argument incorporation point to shared processing traits that differ from human motivated reasoning.
Researchers relying on LLMs for opinion replication must account for the absence of directional bias effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

LLMs may produce more neutral outputs than humans when given conflicting political information, which could be useful for debiasing applications.
Fine-tuning or instruction tuning might be needed to induce human-like motivated reasoning if that behavior is desired in downstream uses.
The replication approach could be extended to other cognitive biases to map which human mental shortcuts LLMs do and do not inherit.

Load-bearing premise

The prompts successfully preserve the original motivational manipulations without introducing new biases from model training or the prompting format itself.

What would settle it

Running the identical four studies on the same base models but with altered prompts or additional context that produces the same directional bias shifts seen in human participants.

read the original abstract

Motivated reasoning - the idea that individuals processing information may be motivated to either arrive at accurate beliefs or arrive at desired conclusions - has been well-explored as a human phenomenon. However, it remains unclear whether base LLMs are affected by motivational manipulations. Replicating 4 prior political motivated reasoning studies, we find that base LLM behavior does not align with expected human behavior. Furthermore, base LLM behavior across models shares some similarities, such as when selecting to abstain from question answering and incorporating provided arguments into opinions. The results suggest that base LLMs may not emulate human motivated reasoning processes. We emphasize the importance of these findings for researchers using LLMs to for certain tasks such as opinion replication and argument assessment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Base LLMs do not match human motivated reasoning patterns in these four replications, but the prompt translations are the part that needs the most scrutiny.

read the letter

The main thing to know is that the authors ran four established human motivated reasoning studies on base LLMs and found the models did not produce the expected directional biases. Instead the outputs showed more abstention and neutral incorporation of provided arguments, and this pattern held across the models they tested. That gives a concrete data point against treating base LLMs as drop-in replacements for human subjects in this kind of political psychology work.

Referee Report

2 major / 2 minor

Summary. The manuscript replicates four prior human studies on political motivated reasoning by applying analogous designs to base LLMs. It reports that LLMs do not exhibit the directional or identity-based biases seen in humans, instead displaying consistent cross-model behaviors such as abstention from certain questions and incorporation of provided arguments into responses. The authors conclude that base LLMs may not emulate human motivated reasoning processes and discuss implications for using LLMs in opinion replication and argument assessment tasks.

Significance. If the replications faithfully preserve the original motivational manipulations, the result would be significant for HCI and AI research by providing empirical evidence that base LLMs lack intrinsic motivated reasoning, thereby cautioning against their direct use as proxies for human political cognition or judgment under bias. The reported cross-model consistencies offer a useful baseline for future work on LLM limitations in social science simulations.

major comments (2)

[Methods] Methods section: The description of how the four human studies' motivational manipulations (directional goals and identity threats) were encoded into LLM prompts is insufficiently detailed. No exact prompt templates, framing choices, or sensitivity checks for order effects and training-data artifacts are provided. This is load-bearing for the central claim, because any deviation in how the manipulations were operationalized could produce the observed null results as prompting artifacts rather than evidence of absent motivated reasoning.
[Results] Results section: Quantitative details on the number of LLM queries per condition, statistical tests used to compare against human benchmarks, and measures of variability (e.g., across temperature settings or repeated runs) are missing. Without these, the reported cross-model patterns of abstention and argument incorporation cannot be assessed for robustness or statistical reliability.

minor comments (2)

[Abstract] The abstract should explicitly name the four replicated studies and the primary dependent measures used for LLM-human comparison.
[Results] Tables or figures comparing LLM outputs to human data should include error bars or confidence intervals to facilitate visual assessment of alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us improve the clarity and rigor of our manuscript. We address each major comment below and indicate the revisions we plan to make.

read point-by-point responses

Referee: [Methods] Methods section: The description of how the four human studies' motivational manipulations (directional goals and identity threats) were encoded into LLM prompts is insufficiently detailed. No exact prompt templates, framing choices, or sensitivity checks for order effects and training-data artifacts are provided. This is load-bearing for the central claim, because any deviation in how the manipulations were operationalized could produce the observed null results as prompting artifacts rather than evidence of absent motivated reasoning.

Authors: We agree that the Methods section requires substantially more detail to support the central claim. In the revised manuscript we will include the complete prompt templates for each of the four studies, explicit descriptions of how directional goals and identity threats were operationalized, and sensitivity analyses addressing order effects and potential training-data influences. These additions will allow readers to assess the fidelity of the replications directly. revision: yes
Referee: [Results] Results section: Quantitative details on the number of LLM queries per condition, statistical tests used to compare against human benchmarks, and measures of variability (e.g., across temperature settings or repeated runs) are missing. Without these, the reported cross-model patterns of abstention and argument incorporation cannot be assessed for robustness or statistical reliability.

Authors: We acknowledge that the Results section currently lacks the quantitative detail needed for full evaluation. The revised version will specify the number of queries per condition, describe the statistical tests used for comparisons with human benchmarks, and report variability measures including standard deviations across repeated runs and different temperature settings. These changes will strengthen the assessment of robustness for the observed patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical replication against external benchmarks

full rationale

The paper conducts a straightforward empirical replication of four prior human motivated-reasoning studies by prompting base LLMs and comparing outputs to published human data. No mathematical derivations, equations, fitted parameters, or predictions appear. The central claim rests on external human benchmarks from the cited prior studies rather than any self-referential construction, self-citation chain, or ansatz. Prompting choices are described as attempts to preserve original manipulations, but this is an empirical design decision, not a definitional reduction. The analysis is therefore self-contained against independent external data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLM responses to study prompts can be directly compared to human responses without systematic differences arising from training data or interface effects. No free parameters, invented entities, or non-standard axioms are stated.

axioms (1)

domain assumption LLM outputs under direct prompting faithfully reflect internal processing comparable to human cognitive tasks
Invoked implicitly when treating LLM answers as equivalent to human motivated reasoning measures.

pith-pipeline@v0.9.0 · 5437 in / 1068 out tokens · 16708 ms · 2026-05-16T11:43:25.352387+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Can LLMs Emulate Human Belief Dynamics?
cs.SI 2026-05 unverdicted novelty 4.0

LLMs fail to emulate human belief dynamics: they mismatch initial distributions and show higher conformity than humans in network interactions.