Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

Anthony GX-Chen; Ayush Rajesh Jhaveri; Eunsol Choi; Ilia Sucholutsky

arxiv: 2604.02485 · v1 · submitted 2026-04-02 · 💻 cs.CL · cs.LG

Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models

Ayush Rajesh Jhaveri , Anthony GX-Chen , Ilia Sucholutsky , Eunsol Choi This is my paper

Pith reviewed 2026-05-13 21:27 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords confirmation biaslarge language modelsrule discoveryhypothesis testingprompting interventionscognitive biasBlicket test

0 comments

The pith

Large language models exhibit confirmation bias by favoring confirming evidence over falsifying tests in rule-discovery tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts a classic human psychology experiment on rule discovery to test whether LLMs show confirmation bias. In this setup, models propose number triples and receive feedback on whether they fit a hidden rule, then guess the rule. Across multiple LLMs, they tend to propose triples that confirm their current guess rather than challenge it, leading to fewer successful rule discoveries. The authors show that simple prompting instructions to consider counterexamples reduce this bias and boost discovery rates from 42% to 56% on average. They also demonstrate that training on this improved behavior transfers to a different task.

Core claim

LLMs exhibit confirmation bias in an interactive rule-discovery task by proposing triples that confirm rather than falsify their hypotheses, resulting in slower and less frequent rule discovery. Prompting with instructions to consider counterexamples decreases this bias and improves average rule discovery rates from 42% to 56%. Distilling the intervention-induced behavior into the models yields generalization to the Blicket test task.

What carries the argument

The interactive triple-proposal feedback loop, where an agent proposes a new triple, receives yes/no feedback on the hidden rule, and guesses the rule.

If this is right

LLMs discover hidden rules less often and more slowly due to their preference for confirming evidence.
Prompting LLMs to consider counterexamples consistently reduces confirmation bias across model families and scales.
Distilling intervention behaviors into LLMs enables improved performance on new tasks like the Blicket test.
Confirmation bias limits LLMs' effectiveness in interactive hypothesis exploration settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This bias may impair LLMs in domains requiring systematic falsification, such as scientific hypothesis testing or code debugging.
Similar interventions could be tested in non-numeric hypothesis tasks to check if the bias is domain-general.
The transfer via distillation suggests that targeted human-inspired training can improve LLM reasoning without full retraining.
Larger models may still require explicit debiasing steps rather than overcoming the bias through scale alone.

Load-bearing premise

The specific triple-proposal task and feedback loop measures confirmation bias in LLMs in the same way it does for humans.

What would settle it

If LLMs in repeated runs of the triple-proposal task propose falsifying triples at rates comparable to unbiased human performance or if the counterexample prompting fails to raise discovery rates above 42 percent.

Figures

Figures reproduced from arXiv: 2604.02485 by Anthony GX-Chen, Ayush Rajesh Jhaveri, Eunsol Choi, Ilia Sucholutsky.

**Figure 1.** Figure 1: Confirmation bias leads to narrow exploration. We show two trajectories for rule discovery task, where an agent aims to infer a hidden numerical rule over multiple turns. Starting from an initial triple, the agent guess a hypothesis and test a new triple, receiving binary feedback on whether the proposed triple satisfies the hidden rule. A compatible test is consistent with the agent’s current hypothesis, … view at source ↗

**Figure 2.** Figure 2: Confirmation bias correlates with task success. Higher I:C (more disconfirmatory testing) is associated with higher task success. Each point represents a (model, variant) averaged over 80 episodes, and shaded regions show 95% confidence intervals [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Interventions shift exploration in thinking but not non-thinking models. Red denotes Baseline; blue denotes intervention runs (Dual-Goal / Think-in-Opposites). Ellipses show 95% covariance regions. Each point represents a (model, variant) averaged over 80 episodes. Impact of Interventions Overall, the intervention strategies improves task success. Dual-Goal improves task success in eight out of eleven mo… view at source ↗

**Figure 4.** Figure 4: Confirmation bias correlates with task success. Higher I:C is associated with higher task success. Shaded regions show 95% confidence intervals. Each point represents a (model, variant) averaged over 192 episodes. Does confirmation bias correlate negatively with task success in this domain? We first evaluate whether the relationship between confirmation bias and task performance observed in the Wason task… view at source ↗

read the original abstract

Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs exhibit confirmation bias on a Wason-style triple task and that a counterexample prompt lifts rule discovery from 42% to 56%, but the labeling of proposals may not cleanly separate bias from generation patterns.

read the letter

Colleague, the core finding is straightforward: across eleven LLMs the models propose triples that confirm their current hypothesis more often than they test alternatives, which slows rule discovery. Adding a prompt to consider counterexamples raises success from 42% to 56% on average, and distilling the behavior gives some transfer to the Blicket test. That is the main new piece. They are the first to run this exact psychology paradigm at scale on current models with quantitative results across families and sizes, and the intervention is simple enough to replicate. The numbers are reported directly from behavior, with no fitted parameters or derivations, so the empirical claim is easy to check once the prompts are public. The generalization step to Blicket is a nice addition that shows the effect is not locked to one task format. The soft spot sits in the measurement itself. Labeling a proposal as confirmatory requires knowing what hypothesis the model is holding at that step. Because the model conditions on the full history, any post-hoc inference from the sequence risks mixing in surface generation habits rather than isolating an internal testing strategy. If the paper relies mainly on experimenter judgment for the labels, small changes in prompt wording or feedback phrasing could shift the counts. The abstract also leaves out the exact templates and any statistical tests, so it is hard to judge how stable the 14-point lift really is. This is the sort of work that would interest people building reasoning benchmarks or studying LLM hypothesis testing. A reader who wants concrete examples of bias measurement and a low-cost fix would get value from it. The experiment is grounded enough and the result practical enough that it deserves a serious referee rather than a desk reject, provided the methods section supplies the missing prompt details and checks for sensitivity.

Referee Report

3 major / 2 minor

Summary. The paper adapts Wason's rule-discovery paradigm to LLMs: models receive an initial triple, propose new triples in an interactive loop receiving yes/no feedback on a hidden rule, and finally guess the rule. Across eleven models, the authors report that LLMs exhibit confirmation bias by preferentially generating confirmatory rather than falsifying triples, resulting in slower and less frequent rule discovery (baseline 42 %). Human-inspired interventions (e.g., explicit instructions to consider counter-examples) reduce this bias and raise average discovery to 56 %; the improved behavior is then distilled into the models and shown to generalize to the Blicket causal-reasoning task.

Significance. If the measurement of confirmation bias is shown to be robust, the work supplies concrete evidence that a well-studied human cognitive limitation also appears in current LLMs and can be partially mitigated by prompting and distillation. The multi-family, multi-scale evaluation and the cross-task generalization result are the strongest elements; they suggest a practical route to improving hypothesis-driven reasoning in deployed systems.

major comments (3)

[§3, §4] §3 (Methods) and §4 (Results): exact prompt templates, system instructions, temperature, top-p, and number of samples per model are not reported. Because the 42 % → 56 % lift is obtained by comparing two closely related prompt conditions, the absence of these details prevents assessment of prompt sensitivity and reproducibility.
[§4.2] §4.2 (Triple labeling): confirmation bias is scored by post-hoc classification of each proposed triple as confirmatory or falsifying relative to an inferred hypothesis. The manuscript does not state whether an explicit hypothesis is elicited from the model at each turn or whether the label is derived solely from the triple sequence; the latter risks circularity because the same surface behavior is used both to infer the hypothesis and to score the bias.
[§4.1, §5] §4.1 and §5: no statistical tests, confidence intervals, or per-model variance are provided for the reported average improvement or for the Blicket-test generalization. Without these, it is impossible to determine whether the 14-percentage-point gain is reliable or could be explained by sampling variability across the eleven models.

minor comments (2)

[§5] The Blicket-test evaluation is described only at a high level; a short appendix table listing the exact causal structures and success criteria would improve clarity.
[Figure 2] Figure 2 (discovery-rate bar plot) would benefit from error bars or per-model scatter points to convey variability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve reproducibility, clarity, and statistical rigor.

read point-by-point responses

Referee: [§3, §4] §3 (Methods) and §4 (Results): exact prompt templates, system instructions, temperature, top-p, and number of samples per model are not reported. Because the 42 % → 56 % lift is obtained by comparing two closely related prompt conditions, the absence of these details prevents assessment of prompt sensitivity and reproducibility.

Authors: We agree that these implementation details are essential for reproducibility. In the revised manuscript we will expand §3 to include the complete prompt templates for both the baseline and intervention conditions, the full system instructions, the precise values of temperature and top-p, and the exact number of samples per model. revision: yes
Referee: [§4.2] §4.2 (Triple labeling): confirmation bias is scored by post-hoc classification of each proposed triple as confirmatory or falsifying relative to an inferred hypothesis. The manuscript does not state whether an explicit hypothesis is elicited from the model at each turn or whether the label is derived solely from the triple sequence; the latter risks circularity because the same surface behavior is used both to infer the hypothesis and to score the bias.

Authors: We will revise §4.2 to explicitly describe the labeling procedure. No explicit hypothesis is elicited from the model at each turn. The hypothesis is inferred post-hoc from the full sequence of triples and feedback using a deterministic rule-based procedure that is independent of the bias-scoring step. We will document this inference algorithm in detail and add an alternative labeling that uses consistency with the true hidden rule as a robustness check. revision: partial
Referee: [§4.1, §5] §4.1 and §5: no statistical tests, confidence intervals, or per-model variance are provided for the reported average improvement or for the Blicket-test generalization. Without these, it is impossible to determine whether the 14-percentage-point gain is reliable or could be explained by sampling variability across the eleven models.

Authors: We acknowledge the need for statistical support. In the revised manuscript we will report per-model discovery rates together with standard deviations, 95 % confidence intervals for the average improvement, and the results of paired statistical tests (e.g., McNemar’s test) evaluating the significance of the 42 % to 56 % gain as well as the Blicket-task generalization. revision: yes

Circularity Check

0 steps flagged

No significant circularity: purely empirical behavioral measurement

full rationale

The paper reports direct experimental observations of LLM outputs in a triple-proposal feedback loop, with rule-discovery rates measured from model generations. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text; the central claims rest on observable proposal sequences and post-hoc classification against hidden rules, without reducing any result to its own inputs by construction. This is a standard empirical evaluation and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities; the work is an empirical measurement study that relies only on standard LLM prompting and interaction protocols.

pith-pipeline@v0.9.0 · 5538 in / 1048 out tokens · 25565 ms · 2026-05-13T21:27:45.245781+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

Insufficient feasible intersection:the feasible intersection set contains fewer than five valid triples

work page
[2]

c is prime

Unary dependence:the rule constrains only a single variable (e.g., “ c is prime”). Conjunc- tive properties applied uniformly to all three variables (e.g., “all odd”) are retained

work page
[3]

all numbers above (or below) 50

Excessive complexity:the rule requires multi-step arithmetic dependencies unlikely to arise during hypothesis testing. When a group fails any criterion, we re-prompt the language model to replace only the offending rule(s) while keeping the remaining rules fixed. This process is repeated until all constraints are satisfied. Approximately one-third of init...

work page 2018
[4]

DAX:”, “DAX is

If the text contains “DAX:”, “DAX is”, “A DAX triple is”, or “The DAX rule is”, take the phrase following that up to the next ”;”, ”.”, ”,”, or the start of a MED clause

work page
[5]

Otherwise, if there are two clauses separated by ”;” or ”, and”, take the first clause as the DAX rule and ignore the rest

work page
[6]

{announced rule}

Otherwise, treat the whole line as the DAX rule. Guidance specific to the ground-truth rule: {rule guidance} Rules: - Ignore phrasing/synonyms if meaning is identical. - If scopes differ (broader/narrower) or the family differs, answer NO. Respond with ONLY one token: YES or NO. ANNOUNCE TEXT: “{announced rule}” GROUND-TRUTH DAX RULE: “{ground truth rule}...

work page
[7]

The model’s current hypothesis is taken from the immediately precedingAnnounceturn

work page
[8]

The hypothesis is converted into an executable Python function using an LLM

work page
[9]

The tested triple(a, b, c)is evaluated using this function

work page
[10]

prime”, “square

If the function returnsTrue, the test is compatible; otherwise incompatible. 22 C.4.1 COMPATIBILITYJUDGEPROMPT(RULE-TO-PYTHON) Your task is to write a Python function that determines whether three integers (a, b, c) satisfy a given natural-language rule. The function should directly implement the logical condition described in the rule. Use basic arithmet...

work page 2000
[11]

Form a concise current hypothesis of the relevant objects and hidden rule

work page
[12]

Identify one feature implicit in that hypothesis

work page
[13]

Test the OPPOSITE of that feature

work page
[14]

If the hypothesis still holds, the feature is probably irrelevant; if it does not, that feature might be crucial

work page
[15]

object N

In short, prefer tests that both confirm and contradict your current idea. Format (must follow exactly): - If the instruction is Turn - Announce, output exactly one line: Announce: relevant=[object A, object B, object C]; rule=<one short description of the rule> (there can be any number of relevant objects) - If the instruction is Turn - Test, output exac...

work page
[16]

- The agent may say ”relevant objects are ...” OR directly name the correct objects; both are acceptable

The set of guessed relevant objects matches the true relevant objects exactly (same members; order doesn’t matter; no extras; no missing). - The agent may say ”relevant objects are ...” OR directly name the correct objects; both are acceptable

work page
[17]

object i

The stated rule matches the true rule semantics exactly. - If the agent states OR / ANY / AT LEAST ONE when the true rule is conjunctive, that is False. - If the agent states AND / ALL / NEEDS BOTH when the true rule is disjunctive, that is False. - If the agent states XOR / EXACTLY ONE / A OR B BUT NOT BOTH / ONE AND ONLY ONE when the true rule is xor, t...

work page arXiv
[18]

TRUE TRUE relevant=[object 0, object 1, object 3]; rule=The device turns on when at least one of object 0, object 1, or object 3 is placed on it

work page
[19]

FALSE FALSE relevant=[object 0, object 2]; rule=contains object 0 or object 2 [1, 3] FALSE FALSE relevant=[object 0, object 1, object 2, object 3]; rule=all objects must be placed on the device [0, 1, 3] FALSE FALSE 35

work page

[1] [1]

Insufficient feasible intersection:the feasible intersection set contains fewer than five valid triples

work page

[2] [2]

c is prime

Unary dependence:the rule constrains only a single variable (e.g., “ c is prime”). Conjunc- tive properties applied uniformly to all three variables (e.g., “all odd”) are retained

work page

[3] [3]

all numbers above (or below) 50

Excessive complexity:the rule requires multi-step arithmetic dependencies unlikely to arise during hypothesis testing. When a group fails any criterion, we re-prompt the language model to replace only the offending rule(s) while keeping the remaining rules fixed. This process is repeated until all constraints are satisfied. Approximately one-third of init...

work page 2018

[4] [4]

DAX:”, “DAX is

If the text contains “DAX:”, “DAX is”, “A DAX triple is”, or “The DAX rule is”, take the phrase following that up to the next ”;”, ”.”, ”,”, or the start of a MED clause

work page

[5] [5]

Otherwise, if there are two clauses separated by ”;” or ”, and”, take the first clause as the DAX rule and ignore the rest

work page

[6] [6]

{announced rule}

Otherwise, treat the whole line as the DAX rule. Guidance specific to the ground-truth rule: {rule guidance} Rules: - Ignore phrasing/synonyms if meaning is identical. - If scopes differ (broader/narrower) or the family differs, answer NO. Respond with ONLY one token: YES or NO. ANNOUNCE TEXT: “{announced rule}” GROUND-TRUTH DAX RULE: “{ground truth rule}...

work page

[7] [7]

The model’s current hypothesis is taken from the immediately precedingAnnounceturn

work page

[8] [8]

The hypothesis is converted into an executable Python function using an LLM

work page

[9] [9]

The tested triple(a, b, c)is evaluated using this function

work page

[10] [10]

prime”, “square

If the function returnsTrue, the test is compatible; otherwise incompatible. 22 C.4.1 COMPATIBILITYJUDGEPROMPT(RULE-TO-PYTHON) Your task is to write a Python function that determines whether three integers (a, b, c) satisfy a given natural-language rule. The function should directly implement the logical condition described in the rule. Use basic arithmet...

work page 2000

[11] [11]

Form a concise current hypothesis of the relevant objects and hidden rule

work page

[12] [12]

Identify one feature implicit in that hypothesis

work page

[13] [13]

Test the OPPOSITE of that feature

work page

[14] [14]

If the hypothesis still holds, the feature is probably irrelevant; if it does not, that feature might be crucial

work page

[15] [15]

object N

In short, prefer tests that both confirm and contradict your current idea. Format (must follow exactly): - If the instruction is Turn - Announce, output exactly one line: Announce: relevant=[object A, object B, object C]; rule=<one short description of the rule> (there can be any number of relevant objects) - If the instruction is Turn - Test, output exac...

work page

[16] [16]

- The agent may say ”relevant objects are ...” OR directly name the correct objects; both are acceptable

The set of guessed relevant objects matches the true relevant objects exactly (same members; order doesn’t matter; no extras; no missing). - The agent may say ”relevant objects are ...” OR directly name the correct objects; both are acceptable

work page

[17] [17]

object i

The stated rule matches the true rule semantics exactly. - If the agent states OR / ANY / AT LEAST ONE when the true rule is conjunctive, that is False. - If the agent states AND / ALL / NEEDS BOTH when the true rule is disjunctive, that is False. - If the agent states XOR / EXACTLY ONE / A OR B BUT NOT BOTH / ONE AND ONLY ONE when the true rule is xor, t...

work page arXiv

[18] [18]

TRUE TRUE relevant=[object 0, object 1, object 3]; rule=The device turns on when at least one of object 0, object 1, or object 3 is placed on it

work page

[19] [19]

FALSE FALSE relevant=[object 0, object 2]; rule=contains object 0 or object 2 [1, 3] FALSE FALSE relevant=[object 0, object 1, object 2, object 3]; rule=all objects must be placed on the device [0, 1, 3] FALSE FALSE 35

work page