LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

Adam Goli\'nski; Chacha Chen; Guillermo Sapiro; Masha Fedzechkina; Matthew J\"orke; Nicholas Foti; Sinead Williamson

arxiv: 2605.06915 · v2 · pith:UN2O3Z3Znew · submitted 2026-05-07 · 💻 cs.LG

LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs

Chacha Chen , Matthew J\"orke , Adam Goli\'nski , Masha Fedzechkina , Guillermo Sapiro , Sinead Williamson , Nicholas Foti This is my paper

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 💻 cs.LG

keywords large language modelsBayesian updatingprobabilistic beliefsinformation processing gapbelief consistencyheuristicsmodel misspecification

0 comments

The pith

Large language models do not consistently update probabilistic beliefs according to Bayesian rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to treat LLMs as information processing rules and introduces the information processing gap to measure how consistently they update beliefs when new evidence arrives. Experiments across multiple incorporation approaches show that some produce nearly Bayesian updates while others rely on learned heuristics. The heuristic approaches often deliver better performance on downstream tasks than exact Bayesian computation, which points to misspecification in the models' internal representations of the world. The gap measure additionally functions as a diagnostic for problems in LLM-based inference systems.

Core claim

LLMs are not consistently Bayesian: when evidence is incorporated, some methods produce nearly Bayesian belief updates while others follow learned heuristics, and the non-Bayesian heuristics frequently outperform exact Bayesian computation on tasks, indicating that the models' probabilistic world models are misspecified.

What carries the argument

The information processing gap, a quantitative measure of the inconsistency between an LLM's actual belief updates from evidence and the updates required by Bayesian probability.

If this is right

Some methods for incorporating evidence into LLMs achieve nearly Bayesian updates.
Other methods use learned heuristics that deviate from Bayesian standards.
Heuristic-based updates can produce higher downstream task performance than exact Bayesian computation.
The information processing gap serves as a diagnostic for identifying issues in LLM-powered inferential systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Forcing stricter Bayesian consistency might reduce rather than improve practical performance in some applications.
The gap could be applied to test consistency in other forms of reasoning beyond probability.
Training data may encourage approximate heuristics over exact inference rules.

Load-bearing premise

That LLMs maintain stable internal probabilistic beliefs which can be reliably elicited and that the information processing gap captures genuine inconsistencies rather than prompting artifacts.

What would settle it

Repeating the same evidence incorporation experiment with varied but logically equivalent prompt phrasings or interfaces and finding that the measured gaps and downstream performance rankings remain unchanged.

read the original abstract

Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows LLMs update beliefs inconsistently with Bayesian rules and that their heuristics sometimes beat exact Bayes on tasks, but the elicitation method leaves open whether this reflects real internals or prompt artifacts.

read the letter

The key things here are that LLMs don't update probabilities consistently like Bayesians, and that their heuristic updates can actually do better on tasks than proper Bayesian ones, pointing to misspecified internal models. They back this with a new way to measure the gap between what the model says directly and what Bayesian math would predict from its own elicited beliefs. The paper does well by treating LLMs as black-box information processors and testing multiple update methods in experiments. The surprise result about heuristics winning is the kind of thing that makes you rethink assumptions about how these models represent uncertainty. It's a concrete step beyond just saying LLMs are bad at probabilities. The main soft spot is the elicitation of those beliefs. Prompting for priors and likelihoods can be flaky, and if the paper doesn't show that the same values come out reliably under different phrasings, then the information processing gap might be measuring prompt inconsistency more than internal belief inconsistency. That undercuts the claim about the models being misspecified. Minor issues could include whether they controlled for order effects or used proper statistical tests across runs, but those are fixable. This work is for researchers focused on making LLMs reliable for decisions under uncertainty, like in medical or legal AI. Readers who evaluate LLM capabilities or build systems that need probabilistic reasoning will find the diagnostics helpful. It deserves serious peer review because the idea is original and the experiments are broad, even if the interpretation needs tightening around the prompting assumption. I would recommend sending it out for review, but suggest the authors add robustness checks on elicitation as a condition.

Referee Report

2 major / 2 minor

Summary. The paper introduces the 'information processing gap' to quantify inconsistencies in how LLMs update probabilistic beliefs upon receiving evidence. It evaluates multiple evidence-incorporation methods via prompting, finding that some produce nearly Bayesian updates while others rely on learned heuristics; surprisingly, the heuristic updates often outperform exact Bayesian computation on downstream tasks, which the authors interpret as evidence that LLMs' internal world models are misspecified. The work also positions the gap as a diagnostic tool for LLM-powered inferential systems.

Significance. If the prompted elicitations reliably reflect stable internal probabilistic representations rather than surface artifacts, the results would be significant for understanding and improving LLM reasoning under uncertainty in domains like medicine and science. The empirical comparison of update rules and the counterintuitive finding that non-Bayesian heuristics can outperform exact Bayes provide a concrete, falsifiable lens on model misspecification that goes beyond standard accuracy benchmarks.

major comments (2)

[§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
[§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.

minor comments (2)

[Eq. (2)] Notation for the information processing gap (Eq. 2) could be clarified by explicitly stating whether the gap is an absolute difference, KL divergence, or another metric, and by providing the exact formula used for the Bayesian posterior computation.
[Figure 3] Figure 3 (or equivalent) showing update consistency across methods would benefit from error bars or confidence intervals to allow readers to assess the reliability of the 'nearly Bayesian' classification.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help clarify the robustness and interpretability of our results on the information processing gap. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.

Authors: We agree that systematic robustness checks are essential to ensure the information processing gap captures internal (in)consistencies rather than elicitation artifacts. While the original experiments used a consistent prompting protocol across models and tasks, we did not report exhaustive variations. In the revised manuscript, we will add a dedicated robustness subsection to §4 that includes: (i) multiple alternative phrasings for prior, likelihood, and posterior elicitation; (ii) variations in evidence presentation order; and (iii) results across temperatures from 0.0 to 1.0. We will quantify how the gap and downstream performance differences vary (or remain stable) under these conditions, with statistical summaries of sensitivity. revision: yes
Referee: [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.

Authors: We appreciate the request for greater transparency in the downstream evaluation. The Bayesian baseline throughout §4.3 is computed from the same elicited priors and likelihoods used for the prompted posteriors (not an oracle). To address the lack of breakdowns, the revision will expand §4.3 with per-task and per-scale tables reporting: mean performance differences, effect sizes (Cohen's d), and p-values from paired statistical tests. These will be disaggregated by task domain (e.g., medical, scientific) and model size, allowing clearer assessment of whether the heuristic advantage is consistent with misspecification. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation of LLM belief updates with no self-referential derivations or fitted predictions

full rationale

The paper frames its contribution as an empirical study introducing the 'information processing gap' to compare LLM evidence incorporation methods against exact Bayesian updates. No load-bearing mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the abstract or described methodology. All central claims rest on experimental comparisons of prompted outputs versus computed baselines, which are externally falsifiable and do not reduce to the paper's own inputs by construction. This is the expected non-circular outcome for a purely evaluative work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on treating LLMs as having elicitable probabilistic beliefs and on the validity of the new information processing gap as a diagnostic; no free parameters are described, but the approach introduces a new conceptual entity without external validation beyond the reported experiments.

axioms (1)

domain assumption LLMs maintain internal probabilistic beliefs that can be accessed and compared to ideal Bayesian updates through prompting or other interfaces.
This is the foundational premise for defining the information processing gap and evaluating consistency.

invented entities (1)

information processing gap no independent evidence
purpose: Quantifies the difference between an LLM's actual belief update and the ideal Bayesian update from evidence.
New measure introduced to study internal inconsistencies; no independent evidence provided outside the paper's experiments.

pith-pipeline@v0.9.0 · 5483 in / 1445 out tokens · 86054 ms · 2026-05-11T00:49:03.024641+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation washburn_uniqueness_aczel matches

?

matches
MATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.

Δ(q) ≜ I_out − I_in = D_KL(q||p) ≥ 0 ... zero only when the post-data distribution is obtained via Bayes’ theorem (Zellner, 1988).
Foundation.LogicAsFunctionalEquation Translation Theorem / bilinear_family_forced echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Δ(q) = D_KL(q||π) − I_LER ... the amount we move our beliefs ... should comport with how strong the evidence supports the hypothesis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
cs.AI 2026-06 unverdicted novelty 6.0

Humans and LLMs exhibit similar error patterns in common-sense reasoning, consistent with shared pattern-matching mechanisms rather than abstract world models.