Failing to Falsify: Evaluating and Mitigating Confirmation Bias in Language Models
Pith reviewed 2026-05-13 21:27 UTC · model grok-4.3
The pith
Large language models exhibit confirmation bias by favoring confirming evidence over falsifying tests in rule-discovery tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs exhibit confirmation bias in an interactive rule-discovery task by proposing triples that confirm rather than falsify their hypotheses, resulting in slower and less frequent rule discovery. Prompting with instructions to consider counterexamples decreases this bias and improves average rule discovery rates from 42% to 56%. Distilling the intervention-induced behavior into the models yields generalization to the Blicket test task.
What carries the argument
The interactive triple-proposal feedback loop, where an agent proposes a new triple, receives yes/no feedback on the hidden rule, and guesses the rule.
If this is right
- LLMs discover hidden rules less often and more slowly due to their preference for confirming evidence.
- Prompting LLMs to consider counterexamples consistently reduces confirmation bias across model families and scales.
- Distilling intervention behaviors into LLMs enables improved performance on new tasks like the Blicket test.
- Confirmation bias limits LLMs' effectiveness in interactive hypothesis exploration settings.
Where Pith is reading between the lines
- This bias may impair LLMs in domains requiring systematic falsification, such as scientific hypothesis testing or code debugging.
- Similar interventions could be tested in non-numeric hypothesis tasks to check if the bias is domain-general.
- The transfer via distillation suggests that targeted human-inspired training can improve LLM reasoning without full retraining.
- Larger models may still require explicit debiasing steps rather than overcoming the bias through scale alone.
Load-bearing premise
The specific triple-proposal task and feedback loop measures confirmation bias in LLMs in the same way it does for humans.
What would settle it
If LLMs in repeated runs of the triple-proposal task propose falsifying triples at rates comparable to unbiased human performance or if the counterexample prompting fails to raise discovery rates above 42 percent.
Figures
read the original abstract
Confirmation bias, the tendency to seek evidence that supports rather than challenges one's belief, hinders one's reasoning ability. We examine whether large language models (LLMs) exhibit confirmation bias by adapting the rule-discovery study from human psychology: given a sequence of three numbers (a "triple"), an agent engages in an interactive feedback loop where it (1) proposes a new triple, (2) receives feedback on whether it satisfies the hidden rule, and (3) guesses the rule. Across eleven LLMs of multiple families and scales, we find that LLMs exhibit confirmation bias, often proposing triples to confirm their hypothesis rather than trying to falsify it. This leads to slower and less frequent discovery of the hidden rule. We further explore intervention strategies (e.g., encouraging the agent to consider counter examples) developed for humans. We find prompting LLMs with such instruction consistently decreases confirmation bias in LLMs, improving rule discovery rates from 42% to 56% on average. Lastly, we mitigate confirmation bias by distilling intervention-induced behavior into LLMs, showing promising generalization to a new task, the Blicket test. Our work shows that confirmation bias is a limitation of LLMs in hypothesis exploration, and that it can be mitigated via injecting interventions designed for humans.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper adapts Wason's rule-discovery paradigm to LLMs: models receive an initial triple, propose new triples in an interactive loop receiving yes/no feedback on a hidden rule, and finally guess the rule. Across eleven models, the authors report that LLMs exhibit confirmation bias by preferentially generating confirmatory rather than falsifying triples, resulting in slower and less frequent rule discovery (baseline 42 %). Human-inspired interventions (e.g., explicit instructions to consider counter-examples) reduce this bias and raise average discovery to 56 %; the improved behavior is then distilled into the models and shown to generalize to the Blicket causal-reasoning task.
Significance. If the measurement of confirmation bias is shown to be robust, the work supplies concrete evidence that a well-studied human cognitive limitation also appears in current LLMs and can be partially mitigated by prompting and distillation. The multi-family, multi-scale evaluation and the cross-task generalization result are the strongest elements; they suggest a practical route to improving hypothesis-driven reasoning in deployed systems.
major comments (3)
- [§3, §4] §3 (Methods) and §4 (Results): exact prompt templates, system instructions, temperature, top-p, and number of samples per model are not reported. Because the 42 % → 56 % lift is obtained by comparing two closely related prompt conditions, the absence of these details prevents assessment of prompt sensitivity and reproducibility.
- [§4.2] §4.2 (Triple labeling): confirmation bias is scored by post-hoc classification of each proposed triple as confirmatory or falsifying relative to an inferred hypothesis. The manuscript does not state whether an explicit hypothesis is elicited from the model at each turn or whether the label is derived solely from the triple sequence; the latter risks circularity because the same surface behavior is used both to infer the hypothesis and to score the bias.
- [§4.1, §5] §4.1 and §5: no statistical tests, confidence intervals, or per-model variance are provided for the reported average improvement or for the Blicket-test generalization. Without these, it is impossible to determine whether the 14-percentage-point gain is reliable or could be explained by sampling variability across the eleven models.
minor comments (2)
- [§5] The Blicket-test evaluation is described only at a high level; a short appendix table listing the exact causal structures and success criteria would improve clarity.
- [Figure 2] Figure 2 (discovery-rate bar plot) would benefit from error bars or per-model scatter points to convey variability.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We address each major point below and will revise the manuscript to improve reproducibility, clarity, and statistical rigor.
read point-by-point responses
-
Referee: [§3, §4] §3 (Methods) and §4 (Results): exact prompt templates, system instructions, temperature, top-p, and number of samples per model are not reported. Because the 42 % → 56 % lift is obtained by comparing two closely related prompt conditions, the absence of these details prevents assessment of prompt sensitivity and reproducibility.
Authors: We agree that these implementation details are essential for reproducibility. In the revised manuscript we will expand §3 to include the complete prompt templates for both the baseline and intervention conditions, the full system instructions, the precise values of temperature and top-p, and the exact number of samples per model. revision: yes
-
Referee: [§4.2] §4.2 (Triple labeling): confirmation bias is scored by post-hoc classification of each proposed triple as confirmatory or falsifying relative to an inferred hypothesis. The manuscript does not state whether an explicit hypothesis is elicited from the model at each turn or whether the label is derived solely from the triple sequence; the latter risks circularity because the same surface behavior is used both to infer the hypothesis and to score the bias.
Authors: We will revise §4.2 to explicitly describe the labeling procedure. No explicit hypothesis is elicited from the model at each turn. The hypothesis is inferred post-hoc from the full sequence of triples and feedback using a deterministic rule-based procedure that is independent of the bias-scoring step. We will document this inference algorithm in detail and add an alternative labeling that uses consistency with the true hidden rule as a robustness check. revision: partial
-
Referee: [§4.1, §5] §4.1 and §5: no statistical tests, confidence intervals, or per-model variance are provided for the reported average improvement or for the Blicket-test generalization. Without these, it is impossible to determine whether the 14-percentage-point gain is reliable or could be explained by sampling variability across the eleven models.
Authors: We acknowledge the need for statistical support. In the revised manuscript we will report per-model discovery rates together with standard deviations, 95 % confidence intervals for the average improvement, and the results of paired statistical tests (e.g., McNemar’s test) evaluating the significance of the 42 % to 56 % gain as well as the Blicket-task generalization. revision: yes
Circularity Check
No significant circularity: purely empirical behavioral measurement
full rationale
The paper reports direct experimental observations of LLM outputs in a triple-proposal feedback loop, with rule-discovery rates measured from model generations. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text; the central claims rest on observable proposal sequences and post-hoc classification against hidden rules, without reducing any result to its own inputs by construction. This is a standard empirical evaluation and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Insufficient feasible intersection:the feasible intersection set contains fewer than five valid triples
-
[2]
Unary dependence:the rule constrains only a single variable (e.g., “ c is prime”). Conjunc- tive properties applied uniformly to all three variables (e.g., “all odd”) are retained
-
[3]
all numbers above (or below) 50
Excessive complexity:the rule requires multi-step arithmetic dependencies unlikely to arise during hypothesis testing. When a group fails any criterion, we re-prompt the language model to replace only the offending rule(s) while keeping the remaining rules fixed. This process is repeated until all constraints are satisfied. Approximately one-third of init...
work page 2018
-
[4]
If the text contains “DAX:”, “DAX is”, “A DAX triple is”, or “The DAX rule is”, take the phrase following that up to the next ”;”, ”.”, ”,”, or the start of a MED clause
-
[5]
Otherwise, if there are two clauses separated by ”;” or ”, and”, take the first clause as the DAX rule and ignore the rest
-
[6]
Otherwise, treat the whole line as the DAX rule. Guidance specific to the ground-truth rule: {rule guidance} Rules: - Ignore phrasing/synonyms if meaning is identical. - If scopes differ (broader/narrower) or the family differs, answer NO. Respond with ONLY one token: YES or NO. ANNOUNCE TEXT: “{announced rule}” GROUND-TRUTH DAX RULE: “{ground truth rule}...
-
[7]
The model’s current hypothesis is taken from the immediately precedingAnnounceturn
-
[8]
The hypothesis is converted into an executable Python function using an LLM
-
[9]
The tested triple(a, b, c)is evaluated using this function
-
[10]
If the function returnsTrue, the test is compatible; otherwise incompatible. 22 C.4.1 COMPATIBILITYJUDGEPROMPT(RULE-TO-PYTHON) Your task is to write a Python function that determines whether three integers (a, b, c) satisfy a given natural-language rule. The function should directly implement the logical condition described in the rule. Use basic arithmet...
work page 2000
-
[11]
Form a concise current hypothesis of the relevant objects and hidden rule
-
[12]
Identify one feature implicit in that hypothesis
-
[13]
Test the OPPOSITE of that feature
-
[14]
If the hypothesis still holds, the feature is probably irrelevant; if it does not, that feature might be crucial
-
[15]
In short, prefer tests that both confirm and contradict your current idea. Format (must follow exactly): - If the instruction is Turn - Announce, output exactly one line: Announce: relevant=[object A, object B, object C]; rule=<one short description of the rule> (there can be any number of relevant objects) - If the instruction is Turn - Test, output exac...
-
[16]
The set of guessed relevant objects matches the true relevant objects exactly (same members; order doesn’t matter; no extras; no missing). - The agent may say ”relevant objects are ...” OR directly name the correct objects; both are acceptable
-
[17]
The stated rule matches the true rule semantics exactly. - If the agent states OR / ANY / AT LEAST ONE when the true rule is conjunctive, that is False. - If the agent states AND / ALL / NEEDS BOTH when the true rule is disjunctive, that is False. - If the agent states XOR / EXACTLY ONE / A OR B BUT NOT BOTH / ONE AND ONLY ONE when the true rule is xor, t...
-
[18]
TRUE TRUE relevant=[object 0, object 1, object 3]; rule=The device turns on when at least one of object 0, object 1, or object 3 is placed on it
-
[19]
FALSE FALSE relevant=[object 0, object 2]; rule=contains object 0 or object 2 [1, 3] FALSE FALSE relevant=[object 0, object 1, object 2, object 3]; rule=all objects must be placed on the device [0, 1, 3] FALSE FALSE 35
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.