Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs
Pith reviewed 2026-05-13 22:09 UTC · model grok-4.3
The pith
One confirmed intermediate fact increases confident wrong answers in LLMs before full evidence corrects them.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Anchored confabulation occurs when a partial anchor commits the model to confident parametric completion of remaining reasoning steps. This is formalized as Parametric Hallucination Confidence (PHC). A causal injection experiment shows the rate rising from 0.613 to 0.656 before falling to 0.595 and 0.536. The effect scales with model capability across five families with Spearman rho of 0.900. The Anchoring Threshold Law k*(n)=floor(n/3) predicts the amplification by hop depth and is confirmed in four cases. A LearnedRouter using PHC closes 81.1 percent of the oracle gap on 1,800 queries across four benchmarks.
What carries the argument
Parametric Hallucination Confidence (PHC): the non-monotonic rise in confident errors triggered when a single confirmed intermediate fact anchors the model to complete the chain from its parameters.
If this is right
- The Anchoring Threshold Law predicts the size of the PHC amplification from the number of hops remaining in the chain.
- A LearnedRouter that exploits PHC closes 81.1 percent of the gap to oracle performance on RAG tasks across four benchmarks.
- An epistemic humility prompt lowers the PHC spike by 0.118.
- Explicit self-rating of confidence serves as a stronger routing signal than lexical confidence measures.
Where Pith is reading between the lines
- The same anchoring mechanism may produce overconfident completions in sequential tasks such as code generation or step-by-step planning.
- Routing systems could use self-reported confidence to decide when retrieved facts are sufficient to override parametric knowledge.
- The effect may appear in multimodal models when a single verified visual or textual anchor is supplied mid-reasoning.
- Testing whether the threshold law holds for chains longer than those examined would extend the current predictions.
Load-bearing premise
The PHC spike is caused by the model treating the partial fact as a confirmed anchor that triggers parametric completion rather than by prompt formatting, token position, or other surface features.
What would settle it
An experiment that changes only the semantic content of the injected fact while holding all surface formatting and token positions fixed and finds no PHC spike would falsify the anchoring account.
Figures
read the original abstract
We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model's confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to identify a new calibration property in LLMs termed anchored confabulation, in which providing one confirmed intermediate fact toward a multi-step reasoning chain non-monotonically increases the model's confident-wrong-answer rate (measured as Parametric Hallucination Confidence, PHC) before full evidence eliminates it. This is formalized via the Anchoring Threshold Law k*(n)=floor(n/3) and supported by a causal injection experiment (PHC sequence 0.613→0.656→0.595→0.536, N=160), a scaling correlation across five model families (Spearman rho=0.900, p=0.037), four confirmed predictions of the threshold law, and a LearnedRouter application that closes 81.1% of the oracle gap (macro F1=0.426, p<1e-6) on 1,800 queries without fine-tuning.
Significance. If the central non-monotonic effect is robustly isolated to the confirmed-anchor mechanism, the result would offer a concrete, testable account of how partial evidence triggers overconfident parametric completion in LLMs, with immediate implications for RAG routing and calibration. The reported routing gains and cross-family scaling correlation constitute reproducible, falsifiable strengths that could inform practical systems; however, the ad-hoc character of the threshold law limits its theoretical contribution.
major comments (2)
- [Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.
- [Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.
minor comments (2)
- [Abstract] The exact operational definition of PHC (including how confident-wrong answers are scored and how the sequence of partial evidence is constructed) is referenced but not restated in the abstract; a concise inline definition would improve accessibility.
- [RAG routing results] The routing experiment reports p<1e-6 but does not specify the exact statistical test or correction for multiple comparisons across the four benchmarks; adding this detail would strengthen reproducibility.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which help clarify the presentation of our results. We address each major comment below and indicate the revisions we will make.
read point-by-point responses
-
Referee: [Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.
Authors: We agree that direct verification would strengthen the causal claim. In the revised version, we will add follow-up probe questions after the injection to confirm that the model accepts the anchor as true. This will rule out alternative explanations such as lexical framing or token position effects. We have already conducted preliminary probes showing acceptance rates above 85%, which we will report. revision: yes
-
Referee: [Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.
Authors: The threshold law is indeed empirical, fitted to the observed non-monotonic pattern in our hop-depth experiments. However, the four confirmed predictions include tests on held-out data and different model scales, providing evidence of generalizability beyond the fitting data. We will revise the text to explicitly state that the law is an empirical finding rather than a theoretically derived formula, and we will discuss potential cognitive or architectural motivations for the 1/3 factor in the discussion section. revision: partial
Circularity Check
Anchoring Threshold Law k*(n)=floor(n/3) presented as predictive law but functional form matches hop-depth data by construction
specific steps
-
fitted input called prediction
[Abstract]
"The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions."
The law is invoked to predict amplification, yet its exact functional form floor(n/3) is not derived from first principles or external axioms; it directly encodes the hop-depth thresholds at which the paper's own PHC measurements show the non-monotonic spike, rendering the four 'confirmed predictions' statistically forced by the same data used to select the form.
full rationale
The paper's central formalization introduces the Anchoring Threshold Law as predicting PHC amplification by hop depth, with four confirmed predictions. The specific form floor(n/3) has no independent derivation shown and aligns exactly with the observed non-monotonic spike pattern (PHC rising then falling at partial evidence points), making the 'predictions' reduce to post-hoc fitting of the same empirical observations. The causal injection results (PHC 0.613→0.656→0.595→0.536) and scaling correlations remain independent of this law and do not rely on self-citation chains. No other load-bearing steps reduce to inputs by definition. This yields moderate circularity confined to the law's status as a derived result.
Axiom & Free-Parameter Ledger
free parameters (1)
- Anchoring Threshold Law divisor
axioms (1)
- domain assumption Models treat a single confirmed intermediate fact as a reliable anchor that triggers parametric completion of remaining steps
invented entities (1)
-
Parametric Hallucination Confidence (PHC)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
, a(k) K are sampled at temperature τ=0.85
∑ i<j F1(a(k) i ,a (k) j )(3) where a(k) 1 , . . ., a(k) K are sampled at temperature τ=0.85. SSC is a model-internal certainty measure independent of lexical hedge phrases, prompt phrasing, or calibration training. High SSC means the model is committed to a specific completion; low SSC means it is genuinely uncertain. Results (N=160,K=3, GPT-4o). Oracle ...
-
[2]
Partial retrieval is the worst operating point
All-or-nothing retrieval: either retrieve sufficient context to exceed k∗(n) on the first pass, or retrieve nothing and admit uncertainty. Partial retrieval is the worst operating point
-
[3]
PHC-calibrated confidence: scale expressed confidence by (1 −PHC risk(n, k)) where PHC risk(n, k) =1[k=k ∗(n)]· ˆγ and ˆγ is estimated from a small calibration set. We evaluate this calibration via the Epistemic Humility Prompt (Section G). B.10 IRCoT Per-Iteration PHC: Completed Test The Anchoring Threshold Law predicts a structural vulnerability for ite...
work page 2023
-
[4]
Signals must measureretrieval adequacy, not answer surface
-
[5]
For single-hop questions, passage relevance is a good proxy for retrieval adequacy (GraphRAG helps when top-kpassages miss the answer)
-
[6]
For multi-hop questions, passage relevance is insufficient: individually relevant passages may collectively lack thebridge factsconnecting entity chains
-
[7]
Bridge entity detec- tion (Section 6) addresses finding (3)
This motivatestype-aware signals: one signal for retrieval failure (Type C), another for reasoning gap (Type A/B) Our Grounded Self-Rating (Section 5) addresses finding (2) directly. Bridge entity detec- tion (Section 6) addresses finding (3). The formal condition for Theorem 3’s bound to be achievable in practice is that U must capture the structure of t...
- [8]
-
[9]
Specificity( w= 0.25): fraction of named entities and numerals in the answer (low specificity⇒uncertain)
-
[10]
Reasoning struggle( w= 0.20): phrases like “I cannot determine”, “based on the context”, “it’s unclear”
-
[11]
Length anomaly( w= 0.10): very short or very long answers relative to dataset mean 5.Entity coverage(w=0.15): fraction of question entities mentioned in the answer 20 Preprint. Under review. Final confidence= ∑i wi ·s i, thresholded atτ=0.65 to trigger escalation. Why do lexical signals fail?Lexical features measurehow the model expresses uncertainty, not...
work page 2023
-
[12]
Generate K= 3 answers usingonlythe retrieved passages (passage-restricted, with explicit instruction not to use external knowledge)
-
[13]
Measure pairwise token-F1 agreement between all (K
-
[14]
cannot determine from passages
SSC confidence=mean pairwise token-F1 Passage-restriction is critical.Without passage restriction, the model uses parametric knowledge, producing highly consistent (but potentially wrong) answers even when pas- sages are insufficient. A question about a bridge entity in HotpotQA generates the same confident answer three times at temperature T= 0.9 because...
work page 2023
-
[15]
This comparison is fully fair: same protocol, same data, only feature set differs
Combined GB beats pre-gen GB (5-fold): ∆= + 0.024, 95% CI [+0.012, +0.036], p< 0.0001. This comparison is fully fair: same protocol, same data, only feature set differs. Post-gen signals are the source of improvement
-
[16]
Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG
BGE embedding router (768-dim) achieves the same F1 as HybridRouter (0.295), confirming that richer pre-gen representations do not close the gap. Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG
-
[17]
pre-gen GB), directly validating Theorem 3 in downstream F1
Post-gen alone (0.314) beats all pre-gen baselines( p= 0.0285 vs. pre-gen GB), directly validating Theorem 3 in downstream F1
-
[18]
MuSiQue: the largest gain (+0.090 vs. HybridRouter). Query features cannot distinguish answerable from unanswerable multi-hop questions before generation; post-gen signals observe the model’s actual reasoning failure
-
[19]
Who is the director of the film starring Actor X?
2Wiki: LearnedRouter −0.041vs. HybridRouter, where over-escalation of compar- ison questions hurts; type-aware suppression is the fix (Section F.20). Oracle gap closed: 45.2% (Combined GB) vs. 36.7% (pre-gen GB, 5-fold OOF) vs. 35.1% (HybridRouter).The post-gen combined signal closes 8.5 percentage points more of the ora- cle gap than the strongest fairly...
-
[20]
With retrieval (VanillaRAG):standard top- k= 5 ChromaDB passages (existing results)
-
[21]
No retrieval (pure parametric):identical prompt with passages removed. The model answers from parametric knowledge only. Result: retrieval amplifies PHC inversion by differentiating the two groups.Retrieved passages reduce confidence inbothgroups ( ∆< 0, p< 10−6 for both, paired t-test), but with a critical asymmetry: the reduction is larger for oracle-Fa...
-
[22]
At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not
Multi-hop chain confabulation(MuSiQue 3-hop): 3-hop bridge chains are dense enough in parametric memory to fabricate confidently but exceed typical retrieved- passage coverage. At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not
-
[23]
Is X taller/older/faster than Y?
Comparative judgment confabulation(HotpotQA comparison): Questions of the form “Is X taller/older/faster than Y?” require explicit comparison that retrieved passages rarely state directly. The model draws on parametric comparative priors and expresses the result with high confidence—whether or not the prior is correct. Both mechanisms instantiate the same...
work page 1904
-
[24]
Context integration loss:Re-generation from flat KG text fails to leverage graph- structured relationships. Direct routing uses the GraphRAG pipeline’s structured prompt (which already formats KG paths and entity links explicitly). Implementation. ReasonRAGPipeline(direct routing=True) in src/reason rag.py: when should escalate=True, call self.graph.run(q...
-
[25]
For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops
Sub-question extraction loss.ReasonRAG generates a sub-question from the uncertain initial answer. For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops. The most immediate fix: drop sub-question extraction; re-query GraphRAG with the original question directly (this is exactly what direct routing evaluates in...
-
[26]
Rate your confidence [CON- FIDENCE: X/5]
Context integration failure.The re-generation prompt receives KG passages as flat text, failing to leverage graph-structured relationships (paths, entity links). Structured prompting that encodes the KG path explicitly may improve multi-hop answer synthesis. The direct routing result (Section E.4) demonstrates that fixing bottleneck (1)—dropping sub-quest...
work page 1994
-
[27]
Better escalation signals(our Grounded Self-Rating, Section 5) help route correctly but have limited impact on macro F1 given current re-generation quality
-
[28]
The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration
Better re-generation(full GraphRAG pipeline reuse, multi-hop sub-question chain- ing) is the primary lever for macro F1 improvement H.4 Calibration of Confidence Scores Figure 5 shows reliability diagrams for ReasonRAG confidence scores across datasets. The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration. Confi- dence scores...
-
[29]
Themaximum achievable gainfrom routing is +0.285 F1 (macro), assuming perfect escalation and perfect regeneration
-
[30]
Thelexical baseline gap closedis 10.4%, leaving 89.6% of potential improvement on the table
-
[31]
ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality
Thepre-generation baseline(HybridRouter) closes approximately 35.1% of the oracle gap (direct routing)—better than ReasonRAG lexical despite using only query features; the gap vs. ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality
-
[32]
Both escalationandregeneration must improve to approach the oracle; improving one without the other yields diminishing returns These findings motivate our Grounded Self-Rating signal (Section 5): a single additional LLM call that directly measuresretrieval adequacy—the root cause of escalation need—rather than answer style. I DSPy Integration ReasonRAG ca...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.