pith. sign in

arxiv: 2604.25931 · v1 · submitted 2026-04-02 · 💻 cs.CL

Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs

Pith reviewed 2026-05-13 22:09 UTC · model grok-4.3

classification 💻 cs.CL
keywords anchored confabulationparametric hallucination confidenceLLM confidence calibrationmulti-step reasoningRAG routinghallucination amplificationpartial evidence
0
0 comments X

The pith

One confirmed intermediate fact increases confident wrong answers in LLMs before full evidence corrects them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models become more likely to output a confidently wrong answer when given just one confirmed fact partway through a multi-step reasoning chain. This effect appears before additional facts arrive and override the error. The paper measures the rise as Parametric Hallucination Confidence and traces it to the model using its internal parameters to complete the remaining steps. The pattern holds across multiple experiments and model families and follows a simple rule based on chain length. The finding is used to route retrieval-augmented queries more effectively without retraining the model.

Core claim

Anchored confabulation occurs when a partial anchor commits the model to confident parametric completion of remaining reasoning steps. This is formalized as Parametric Hallucination Confidence (PHC). A causal injection experiment shows the rate rising from 0.613 to 0.656 before falling to 0.595 and 0.536. The effect scales with model capability across five families with Spearman rho of 0.900. The Anchoring Threshold Law k*(n)=floor(n/3) predicts the amplification by hop depth and is confirmed in four cases. A LearnedRouter using PHC closes 81.1 percent of the oracle gap on 1,800 queries across four benchmarks.

What carries the argument

Parametric Hallucination Confidence (PHC): the non-monotonic rise in confident errors triggered when a single confirmed intermediate fact anchors the model to complete the chain from its parameters.

If this is right

  • The Anchoring Threshold Law predicts the size of the PHC amplification from the number of hops remaining in the chain.
  • A LearnedRouter that exploits PHC closes 81.1 percent of the gap to oracle performance on RAG tasks across four benchmarks.
  • An epistemic humility prompt lowers the PHC spike by 0.118.
  • Explicit self-rating of confidence serves as a stronger routing signal than lexical confidence measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring mechanism may produce overconfident completions in sequential tasks such as code generation or step-by-step planning.
  • Routing systems could use self-reported confidence to decide when retrieved facts are sufficient to override parametric knowledge.
  • The effect may appear in multimodal models when a single verified visual or textual anchor is supplied mid-reasoning.
  • Testing whether the threshold law holds for chains longer than those examined would extend the current predictions.

Load-bearing premise

The PHC spike is caused by the model treating the partial fact as a confirmed anchor that triggers parametric completion rather than by prompt formatting, token position, or other surface features.

What would settle it

An experiment that changes only the semantic content of the injected fact while holding all surface formatting and token positions fixed and finds no PHC spike would falsify the anchoring account.

Figures

Figures reproduced from arXiv: 2604.25931 by Ashish Balkishan Lathkar.

Figure 1
Figure 1. Figure 1: Cost-accuracy Pareto frontier. ReasonRAG+Rating achieves higher accuracy than [PITH_FULL_IMAGE:figures/full_fig_p027_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Threshold ablation: macro F1 vs. escalation threshold [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PHCk = AUC(conf → oraclek ) by reasoning hop depth. Non-monotone inverted-U shape peaks at 3-hop (0.702, p < 10−5 , ***). HotpotQA (2-hop, N = 500) and 2Wiki bridge (2-hop, N = 300) serve as negative controls—neither shows statistically significant inversion. MuSiQue 4-hop falls back to 0.634 (*), confirming the peak is at the “confabulation sweet spot” where parametric chain depth matches LLM memorization… view at source ↗
Figure 4
Figure 4. Figure 4: PHC3 vs. Chatbot Arena ELO capability proxy. More capable models tend to confabulate more confidently at 3-hop bridge depth. Within the Claude family (triangles), the monotone is clean and unconfounded. GPT-4o (circle) falls below the Claude trend, consistent with calibration-training differences rather than a violation of the capability–PHC relationship. Linear fit R 2 = 0.70. Spearman ρ = 0.900 (p = 0.03… view at source ↗
Figure 5
Figure 5. Figure 5: Reliability diagrams for ReasonRAG confidence scores across four datasets. ECE [PITH_FULL_IMAGE:figures/full_fig_p052_5.png] view at source ↗
read the original abstract

We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model's confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify a new calibration property in LLMs termed anchored confabulation, in which providing one confirmed intermediate fact toward a multi-step reasoning chain non-monotonically increases the model's confident-wrong-answer rate (measured as Parametric Hallucination Confidence, PHC) before full evidence eliminates it. This is formalized via the Anchoring Threshold Law k*(n)=floor(n/3) and supported by a causal injection experiment (PHC sequence 0.613→0.656→0.595→0.536, N=160), a scaling correlation across five model families (Spearman rho=0.900, p=0.037), four confirmed predictions of the threshold law, and a LearnedRouter application that closes 81.1% of the oracle gap (macro F1=0.426, p<1e-6) on 1,800 queries without fine-tuning.

Significance. If the central non-monotonic effect is robustly isolated to the confirmed-anchor mechanism, the result would offer a concrete, testable account of how partial evidence triggers overconfident parametric completion in LLMs, with immediate implications for RAG routing and calibration. The reported routing gains and cross-family scaling correlation constitute reproducible, falsifiable strengths that could inform practical systems; however, the ad-hoc character of the threshold law limits its theoretical contribution.

major comments (2)
  1. [Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.
  2. [Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.
minor comments (2)
  1. [Abstract] The exact operational definition of PHC (including how confident-wrong answers are scored and how the sequence of partial evidence is constructed) is referenced but not restated in the abstract; a concise inline definition would improve accessibility.
  2. [RAG routing results] The routing experiment reports p<1e-6 but does not specify the exact statistical test or correction for multiple comparisons across the four benchmarks; adding this detail would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the presentation of our results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.

    Authors: We agree that direct verification would strengthen the causal claim. In the revised version, we will add follow-up probe questions after the injection to confirm that the model accepts the anchor as true. This will rule out alternative explanations such as lexical framing or token position effects. We have already conducted preliminary probes showing acceptance rates above 85%, which we will report. revision: yes

  2. Referee: [Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.

    Authors: The threshold law is indeed empirical, fitted to the observed non-monotonic pattern in our hop-depth experiments. However, the four confirmed predictions include tests on held-out data and different model scales, providing evidence of generalizability beyond the fitting data. We will revise the text to explicitly state that the law is an empirical finding rather than a theoretically derived formula, and we will discuss potential cognitive or architectural motivations for the 1/3 factor in the discussion section. revision: partial

Circularity Check

1 steps flagged

Anchoring Threshold Law k*(n)=floor(n/3) presented as predictive law but functional form matches hop-depth data by construction

specific steps
  1. fitted input called prediction [Abstract]
    "The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions."

    The law is invoked to predict amplification, yet its exact functional form floor(n/3) is not derived from first principles or external axioms; it directly encodes the hop-depth thresholds at which the paper's own PHC measurements show the non-monotonic spike, rendering the four 'confirmed predictions' statistically forced by the same data used to select the form.

full rationale

The paper's central formalization introduces the Anchoring Threshold Law as predicting PHC amplification by hop depth, with four confirmed predictions. The specific form floor(n/3) has no independent derivation shown and aligns exactly with the observed non-monotonic spike pattern (PHC rising then falling at partial evidence points), making the 'predictions' reduce to post-hoc fitting of the same empirical observations. The causal injection results (PHC 0.613→0.656→0.595→0.536) and scaling correlations remain independent of this law and do not rely on self-citation chains. No other load-bearing steps reduce to inputs by definition. This yields moderate circularity confined to the law's status as a derived result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that partial anchors trigger parametric completion and on the empirically observed threshold law; no external benchmarks or machine-checked derivations are mentioned.

free parameters (1)
  • Anchoring Threshold Law divisor
    The specific form floor(n/3) is presented as predictive but its selection appears tuned to the observed hop-depth pattern.
axioms (1)
  • domain assumption Models treat a single confirmed intermediate fact as a reliable anchor that triggers parametric completion of remaining steps
    Invoked in the definition of anchored confabulation and the causal injection setup.
invented entities (1)
  • Parametric Hallucination Confidence (PHC) no independent evidence
    purpose: Quantitative measure of the confident-wrong-answer rate under partial evidence
    New metric introduced to track the non-monotonic effect

pith-pipeline@v0.9.0 · 5545 in / 1448 out tokens · 52537 ms · 2026-05-13T22:09:30.942012+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    , a(k) K are sampled at temperature τ=0.85

    ∑ i<j F1(a(k) i ,a (k) j )(3) where a(k) 1 , . . ., a(k) K are sampled at temperature τ=0.85. SSC is a model-internal certainty measure independent of lexical hedge phrases, prompt phrasing, or calibration training. High SSC means the model is committed to a specific completion; low SSC means it is genuinely uncertain. Results (N=160,K=3, GPT-4o). Oracle ...

  2. [2]

    Partial retrieval is the worst operating point

    All-or-nothing retrieval: either retrieve sufficient context to exceed k∗(n) on the first pass, or retrieve nothing and admit uncertainty. Partial retrieval is the worst operating point

  3. [3]

    escalate iff C(A)<τ

    PHC-calibrated confidence: scale expressed confidence by (1 −PHC risk(n, k)) where PHC risk(n, k) =1[k=k ∗(n)]· ˆγ and ˆγ is estimated from a small calibration set. We evaluate this calibration via the Epistemic Humility Prompt (Section G). B.10 IRCoT Per-Iteration PHC: Completed Test The Anchoring Threshold Law predicts a structural vulnerability for ite...

  4. [4]

    Signals must measureretrieval adequacy, not answer surface

  5. [5]

    For single-hop questions, passage relevance is a good proxy for retrieval adequacy (GraphRAG helps when top-kpassages miss the answer)

  6. [6]

    For multi-hop questions, passage relevance is insufficient: individually relevant passages may collectively lack thebridge factsconnecting entity chains

  7. [7]

    Bridge entity detec- tion (Section 6) addresses finding (3)

    This motivatestype-aware signals: one signal for retrieval failure (Type C), another for reasoning gap (Type A/B) Our Grounded Self-Rating (Section 5) addresses finding (2) directly. Bridge entity detec- tion (Section 6) addresses finding (3). The formal condition for Theorem 3’s bound to be achievable in practice is that U must capture the structure of t...

  8. [8]

    I believe

    Hedging phrases( w= 0.30): presence of “I believe”, “approximately”, “it is possible that”, etc

  9. [9]

    Specificity( w= 0.25): fraction of named entities and numerals in the answer (low specificity⇒uncertain)

  10. [10]

    I cannot determine

    Reasoning struggle( w= 0.20): phrases like “I cannot determine”, “based on the context”, “it’s unclear”

  11. [11]

    Under review

    Length anomaly( w= 0.10): very short or very long answers relative to dataset mean 5.Entity coverage(w=0.15): fraction of question entities mentioned in the answer 20 Preprint. Under review. Final confidence= ∑i wi ·s i, thresholded atτ=0.65 to trigger escalation. Why do lexical signals fail?Lexical features measurehow the model expresses uncertainty, not...

  12. [12]

    Generate K= 3 answers usingonlythe retrieved passages (passage-restricted, with explicit instruction not to use external knowledge)

  13. [13]

    Measure pairwise token-F1 agreement between all (K

  14. [14]

    cannot determine from passages

    SSC confidence=mean pairwise token-F1 Passage-restriction is critical.Without passage restriction, the model uses parametric knowledge, producing highly consistent (but potentially wrong) answers even when pas- sages are insufficient. A question about a bridge entity in HotpotQA generates the same confident answer three times at temperature T= 0.9 because...

  15. [15]

    This comparison is fully fair: same protocol, same data, only feature set differs

    Combined GB beats pre-gen GB (5-fold): ∆= + 0.024, 95% CI [+0.012, +0.036], p< 0.0001. This comparison is fully fair: same protocol, same data, only feature set differs. Post-gen signals are the source of improvement

  16. [16]

    Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

    BGE embedding router (768-dim) achieves the same F1 as HybridRouter (0.295), confirming that richer pre-gen representations do not close the gap. Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

  17. [17]

    pre-gen GB), directly validating Theorem 3 in downstream F1

    Post-gen alone (0.314) beats all pre-gen baselines( p= 0.0285 vs. pre-gen GB), directly validating Theorem 3 in downstream F1

  18. [18]

    HybridRouter)

    MuSiQue: the largest gain (+0.090 vs. HybridRouter). Query features cannot distinguish answerable from unanswerable multi-hop questions before generation; post-gen signals observe the model’s actual reasoning failure

  19. [19]

    Who is the director of the film starring Actor X?

    2Wiki: LearnedRouter −0.041vs. HybridRouter, where over-escalation of compar- ison questions hurts; type-aware suppression is the fix (Section F.20). Oracle gap closed: 45.2% (Combined GB) vs. 36.7% (pre-gen GB, 5-fold OOF) vs. 35.1% (HybridRouter).The post-gen combined signal closes 8.5 percentage points more of the ora- cle gap than the strongest fairly...

  20. [20]

    With retrieval (VanillaRAG):standard top- k= 5 ChromaDB passages (existing results)

  21. [21]

    partial-chain amplifica- tion

    No retrieval (pure parametric):identical prompt with passages removed. The model answers from parametric knowledge only. Result: retrieval amplifies PHC inversion by differentiating the two groups.Retrieved passages reduce confidence inbothgroups ( ∆< 0, p< 10−6 for both, paired t-test), but with a critical asymmetry: the reduction is larger for oracle-Fa...

  22. [22]

    At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

    Multi-hop chain confabulation(MuSiQue 3-hop): 3-hop bridge chains are dense enough in parametric memory to fabricate confidently but exceed typical retrieved- passage coverage. At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

  23. [23]

    Is X taller/older/faster than Y?

    Comparative judgment confabulation(HotpotQA comparison): Questions of the form “Is X taller/older/faster than Y?” require explicit comparison that retrieved passages rarely state directly. The model draws on parametric comparative priors and expresses the result with high confidence—whether or not the prior is correct. Both mechanisms instantiate the same...

  24. [24]

    compare/versus/both

    Context integration loss:Re-generation from flat KG text fails to leverage graph- structured relationships. Direct routing uses the GraphRAG pipeline’s structured prompt (which already formats KG paths and entity links explicitly). Implementation. ReasonRAGPipeline(direct routing=True) in src/reason rag.py: when should escalate=True, call self.graph.run(q...

  25. [25]

    For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops

    Sub-question extraction loss.ReasonRAG generates a sub-question from the uncertain initial answer. For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops. The most immediate fix: drop sub-question extraction; re-query GraphRAG with the original question directly (this is exactly what direct routing evaluates in...

  26. [26]

    Rate your confidence [CON- FIDENCE: X/5]

    Context integration failure.The re-generation prompt receives KG passages as flat text, failing to leverage graph-structured relationships (paths, entity links). Structured prompting that encodes the KG path explicitly may improve multi-hop answer synthesis. The direct routing result (Section E.4) demonstrates that fixing bottleneck (1)—dropping sub-quest...

  27. [27]

    Better escalation signals(our Grounded Self-Rating, Section 5) help route correctly but have limited impact on macro F1 given current re-generation quality

  28. [28]

    The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration

    Better re-generation(full GraphRAG pipeline reuse, multi-hop sub-question chain- ing) is the primary lever for macro F1 improvement H.4 Calibration of Confidence Scores Figure 5 shows reliability diagrams for ReasonRAG confidence scores across datasets. The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration. Confi- dence scores...

  29. [29]

    Themaximum achievable gainfrom routing is +0.285 F1 (macro), assuming perfect escalation and perfect regeneration

  30. [30]

    Thelexical baseline gap closedis 10.4%, leaving 89.6% of potential improvement on the table

  31. [31]

    ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

    Thepre-generation baseline(HybridRouter) closes approximately 35.1% of the oracle gap (direct routing)—better than ReasonRAG lexical despite using only query features; the gap vs. ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

  32. [32]

    low confidence⇒escalate

    Both escalationandregeneration must improve to approach the oracle; improving one without the other yields diminishing returns These findings motivate our Grounded Self-Rating signal (Section 5): a single additional LLM call that directly measuresretrieval adequacy—the root cause of escalation need—rather than answer style. I DSPy Integration ReasonRAG ca...