LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs
Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3
The pith
Large language models do not consistently update probabilistic beliefs according to Bayesian rules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLMs are not consistently Bayesian: when evidence is incorporated, some methods produce nearly Bayesian belief updates while others follow learned heuristics, and the non-Bayesian heuristics frequently outperform exact Bayesian computation on tasks, indicating that the models' probabilistic world models are misspecified.
What carries the argument
The information processing gap, a quantitative measure of the inconsistency between an LLM's actual belief updates from evidence and the updates required by Bayesian probability.
If this is right
- Some methods for incorporating evidence into LLMs achieve nearly Bayesian updates.
- Other methods use learned heuristics that deviate from Bayesian standards.
- Heuristic-based updates can produce higher downstream task performance than exact Bayesian computation.
- The information processing gap serves as a diagnostic for identifying issues in LLM-powered inferential systems.
Where Pith is reading between the lines
- Forcing stricter Bayesian consistency might reduce rather than improve practical performance in some applications.
- The gap could be applied to test consistency in other forms of reasoning beyond probability.
- Training data may encourage approximate heuristics over exact inference rules.
Load-bearing premise
That LLMs maintain stable internal probabilistic beliefs which can be reliably elicited and that the information processing gap captures genuine inconsistencies rather than prompting artifacts.
What would settle it
Repeating the same evidence incorporation experiment with varied but logically equivalent prompt phrasings or interfaces and finding that the measured gaps and downstream performance rankings remain unchanged.
read the original abstract
Modern AI systems are being deployed in complex domains such as medicine, science, and law, where it is important that they not only produce correct answers, but also represent and update uncertain beliefs about the world as new evidence arrives. We introduce the novel technique of studying LLMs as information processing rules and utilize the information processing gap to study the internal (in)consistencies of how LLMs update their probabilistic beliefs from evidence. Our extensive experiments evaluate multiple approaches in which LLMs can incorporate evidence into their beliefs. Some of these approaches produce (nearly) Bayesian updates; others seem to use a learned heuristic. Surprisingly, the non-Bayesian heuristic updates often outperform exact Bayesian computation in terms of downstream task performance -- indicating the LLMs' probabilistic models of the world are misspecified. Lastly, we show how our measure can provide diagnostics to identify issues with LLM-powered inferential systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'information processing gap' to quantify inconsistencies in how LLMs update probabilistic beliefs upon receiving evidence. It evaluates multiple evidence-incorporation methods via prompting, finding that some produce nearly Bayesian updates while others rely on learned heuristics; surprisingly, the heuristic updates often outperform exact Bayesian computation on downstream tasks, which the authors interpret as evidence that LLMs' internal world models are misspecified. The work also positions the gap as a diagnostic tool for LLM-powered inferential systems.
Significance. If the prompted elicitations reliably reflect stable internal probabilistic representations rather than surface artifacts, the results would be significant for understanding and improving LLM reasoning under uncertainty in domains like medicine and science. The empirical comparison of update rules and the counterintuitive finding that non-Bayesian heuristics can outperform exact Bayes provide a concrete, falsifiable lens on model misspecification that goes beyond standard accuracy benchmarks.
major comments (2)
- [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
- [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.
minor comments (2)
- [Eq. (2)] Notation for the information processing gap (Eq. 2) could be clarified by explicitly stating whether the gap is an absolute difference, KL divergence, or another metric, and by providing the exact formula used for the Bayesian posterior computation.
- [Figure 3] Figure 3 (or equivalent) showing update consistency across methods would benefit from error bars or confidence intervals to allow readers to assess the reliability of the 'nearly Bayesian' classification.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which help clarify the robustness and interpretability of our results on the information processing gap. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3 and §4] §3 (Method) and §4 (Experiments): The information processing gap is defined by comparing a direct prompted posterior to a Bayesian posterior computed from separately elicited prior and likelihood. This comparison is load-bearing for the central claim that some updates are 'nearly Bayesian' while others are heuristic, yet the paper reports no systematic robustness checks across elicitation variants (e.g., different phrasings, order of evidence presentation, or temperature settings). Without such controls, the gap and the downstream performance advantage of heuristics could reflect prompt sensitivity rather than internal (in)consistency.
Authors: We agree that systematic robustness checks are essential to ensure the information processing gap captures internal (in)consistencies rather than elicitation artifacts. While the original experiments used a consistent prompting protocol across models and tasks, we did not report exhaustive variations. In the revised manuscript, we will add a dedicated robustness subsection to §4 that includes: (i) multiple alternative phrasings for prior, likelihood, and posterior elicitation; (ii) variations in evidence presentation order; and (iii) results across temperatures from 0.0 to 1.0. We will quantify how the gap and downstream performance differences vary (or remain stable) under these conditions, with statistical summaries of sensitivity. revision: yes
-
Referee: [§4.3] §4.3 (Downstream task evaluation): The claim that non-Bayesian heuristic updates 'often outperform exact Bayesian computation' is central to the misspecification interpretation. However, the reported effect sizes and statistical significance are not broken down by task or model scale, and it is unclear whether the Bayesian baseline uses the same elicited quantities or an oracle prior; this makes it difficult to isolate whether the advantage stems from misspecification or from differences in how evidence is represented.
Authors: We appreciate the request for greater transparency in the downstream evaluation. The Bayesian baseline throughout §4.3 is computed from the same elicited priors and likelihoods used for the prompted posteriors (not an oracle). To address the lack of breakdowns, the revision will expand §4.3 with per-task and per-scale tables reporting: mean performance differences, effect sizes (Cohen's d), and p-values from paired statistical tests. These will be disaggregated by task domain (e.g., medical, scientific) and model size, allowing clearer assessment of whether the heuristic advantage is consistent with misspecification. revision: yes
Circularity Check
Empirical evaluation of LLM belief updates with no self-referential derivations or fitted predictions
full rationale
The paper frames its contribution as an empirical study introducing the 'information processing gap' to compare LLM evidence incorporation methods against exact Bayesian updates. No load-bearing mathematical derivations, parameter fits renamed as predictions, or self-citation chains appear in the abstract or described methodology. All central claims rest on experimental comparisons of prompted outputs versus computed baselines, which are externally falsifiable and do not reduce to the paper's own inputs by construction. This is the expected non-circular outcome for a purely evaluative work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption LLMs maintain internal probabilistic beliefs that can be accessed and compared to ideal Bayesian updates through prompting or other interfaces.
invented entities (1)
-
information processing gap
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel matches?
matchesMATCHES: this paper passage directly uses, restates, or depends on the cited Recognition theorem or module.
Δ(q) ≜ I_out − I_in = D_KL(q||p) ≥ 0 ... zero only when the post-data distribution is obtained via Bayes’ theorem (Zellner, 1988).
-
Foundation.LogicAsFunctionalEquationTranslation Theorem / bilinear_family_forced echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Δ(q) = D_KL(q||π) − I_LER ... the amount we move our beliefs ... should comport with how strong the evidence supports the hypothesis.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Reasoning as Pattern Matching: Shared Mechanisms in Human and LLM Everyday Reasoning
Humans and LLMs exhibit similar error patterns in common-sense reasoning, consistent with shared pattern-matching mechanisms rather than abstract world models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.