Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs

Ashish Balkishan Lathkar

arxiv: 2604.25931 · v1 · submitted 2026-04-02 · 💻 cs.CL

Anchored Confabulation: Partial Evidence Non-Monotonically Amplifies Confident Hallucination in LLMs

Ashish Balkishan Lathkar This is my paper

Pith reviewed 2026-05-13 22:09 UTC · model grok-4.3

classification 💻 cs.CL

keywords anchored confabulationparametric hallucination confidenceLLM confidence calibrationmulti-step reasoningRAG routinghallucination amplificationpartial evidence

0 comments

The pith

One confirmed intermediate fact increases confident wrong answers in LLMs before full evidence corrects them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models become more likely to output a confidently wrong answer when given just one confirmed fact partway through a multi-step reasoning chain. This effect appears before additional facts arrive and override the error. The paper measures the rise as Parametric Hallucination Confidence and traces it to the model using its internal parameters to complete the remaining steps. The pattern holds across multiple experiments and model families and follows a simple rule based on chain length. The finding is used to route retrieval-augmented queries more effectively without retraining the model.

Core claim

Anchored confabulation occurs when a partial anchor commits the model to confident parametric completion of remaining reasoning steps. This is formalized as Parametric Hallucination Confidence (PHC). A causal injection experiment shows the rate rising from 0.613 to 0.656 before falling to 0.595 and 0.536. The effect scales with model capability across five families with Spearman rho of 0.900. The Anchoring Threshold Law k*(n)=floor(n/3) predicts the amplification by hop depth and is confirmed in four cases. A LearnedRouter using PHC closes 81.1 percent of the oracle gap on 1,800 queries across four benchmarks.

What carries the argument

Parametric Hallucination Confidence (PHC): the non-monotonic rise in confident errors triggered when a single confirmed intermediate fact anchors the model to complete the chain from its parameters.

If this is right

The Anchoring Threshold Law predicts the size of the PHC amplification from the number of hops remaining in the chain.
A LearnedRouter that exploits PHC closes 81.1 percent of the gap to oracle performance on RAG tasks across four benchmarks.
An epistemic humility prompt lowers the PHC spike by 0.118.
Explicit self-rating of confidence serves as a stronger routing signal than lexical confidence measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring mechanism may produce overconfident completions in sequential tasks such as code generation or step-by-step planning.
Routing systems could use self-reported confidence to decide when retrieved facts are sufficient to override parametric knowledge.
The effect may appear in multimodal models when a single verified visual or textual anchor is supplied mid-reasoning.
Testing whether the threshold law holds for chains longer than those examined would extend the current predictions.

Load-bearing premise

The PHC spike is caused by the model treating the partial fact as a confirmed anchor that triggers parametric completion rather than by prompt formatting, token position, or other surface features.

What would settle it

An experiment that changes only the semantic content of the injected fact while holding all surface formatting and token positions fixed and finds no PHC spike would falsify the anchoring account.

Figures

Figures reproduced from arXiv: 2604.25931 by Ashish Balkishan Lathkar.

**Figure 2.** Figure 2: Threshold ablation: macro F1 vs. escalation threshold [PITH_FULL_IMAGE:figures/full_fig_p028_2.png] view at source ↗

**Figure 3.** Figure 3: PHCk = AUC(conf → oraclek ) by reasoning hop depth. Non-monotone inverted-U shape peaks at 3-hop (0.702, p < 10−5 , ***). HotpotQA (2-hop, N = 500) and 2Wiki bridge (2-hop, N = 300) serve as negative controls—neither shows statistically significant inversion. MuSiQue 4-hop falls back to 0.634 (*), confirming the peak is at the “confabulation sweet spot” where parametric chain depth matches LLM memorization… view at source ↗

**Figure 4.** Figure 4: PHC3 vs. Chatbot Arena ELO capability proxy. More capable models tend to confabulate more confidently at 3-hop bridge depth. Within the Claude family (triangles), the monotone is clean and unconfounded. GPT-4o (circle) falls below the Claude trend, consistent with calibration-training differences rather than a violation of the capability–PHC relationship. Linear fit R 2 = 0.70. Spearman ρ = 0.900 (p = 0.03… view at source ↗

**Figure 5.** Figure 5: Reliability diagrams for ReasonRAG confidence scores across four datasets. ECE [PITH_FULL_IMAGE:figures/full_fig_p052_5.png] view at source ↗

read the original abstract

We identify a previously unknown calibration property of large language models: providing one confirmed intermediate fact toward a multi-step reasoning chain increases the model's confident-wrong-answer rate before full evidence eliminates it. We call this anchored confabulation: a partial anchor commits the model to confident parametric completion of remaining reasoning steps. We formalize it as Parametric Hallucination Confidence (PHC) and establish it across six lines of evidence including a causal injection experiment (PHC 0.613 to 0.656 to 0.595 to 0.536, N=160) and capability scaling across five model families (Spearman rho=0.900, p=0.037). The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions. Applied to RAG routing, a LearnedRouter exploiting PHC closes 81.1% of the oracle performance gap (macro F1=0.426, p<1e-6) on 1,800 queries across four benchmarks with no model fine-tuning and 50x fewer labels than prior RL-based work. An epistemic humility prompt reduces the PHC spike by -0.118; explicit self-rating (PHC=0.684, p<0.001) outperforms lexical confidence as a routing signal.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Partial evidence spikes confident wrong answers in LLMs before full facts arrive, and the authors turn that pattern into a no-fine-tune router that closes most of the oracle gap, but the injection test leaves open whether the model really treats the fact as confirmed.

read the letter

The main point is that giving an LLM one confirmed intermediate fact in a chain raises its rate of confident errors before the rest of the evidence comes in. They measure this with Parametric Hallucination Confidence and show the non-monotonic pattern across models, then use it to route retrieval queries and close 81 percent of the gap to an oracle on 1800 queries from four benchmarks with no model updates and far fewer labels than prior work. The scaling correlation across five families is also clean at rho 0.9. Those are the usable pieces: a measurable effect plus a practical routing signal that works out of the box. The epistemic humility prompt cutting the spike by 0.118 is a quick additional check that lines up with the rest. The causal injection sequence (0.613 to 0.656 to 0.595 to 0.536 on 160 samples) and the threshold law are the parts that need the most scrutiny. The experiment does not include a direct verification that the model accepts the injected fact as true rather than reacting to its position or wording, so the rise could still be a surface artifact. The floor(n/3) form also looks fitted to the hop-depth data rather than derived independently, which puts moderate weight on the circularity concern. This is aimed at people who build RAG systems or study calibration. Anyone running retrieval routers or testing prompt interventions will find the numbers and the router result worth checking. The empirical results are concrete enough that a serious editor should send it to referees, with the expectation that revisions will add controls on whether the injected fact is actually treated as confirmed. I would bring it to reading group to walk through the injection setup and see whether the full methods close that gap.

Referee Report

2 major / 2 minor

Summary. The paper claims to identify a new calibration property in LLMs termed anchored confabulation, in which providing one confirmed intermediate fact toward a multi-step reasoning chain non-monotonically increases the model's confident-wrong-answer rate (measured as Parametric Hallucination Confidence, PHC) before full evidence eliminates it. This is formalized via the Anchoring Threshold Law k*(n)=floor(n/3) and supported by a causal injection experiment (PHC sequence 0.613→0.656→0.595→0.536, N=160), a scaling correlation across five model families (Spearman rho=0.900, p=0.037), four confirmed predictions of the threshold law, and a LearnedRouter application that closes 81.1% of the oracle gap (macro F1=0.426, p<1e-6) on 1,800 queries without fine-tuning.

Significance. If the central non-monotonic effect is robustly isolated to the confirmed-anchor mechanism, the result would offer a concrete, testable account of how partial evidence triggers overconfident parametric completion in LLMs, with immediate implications for RAG routing and calibration. The reported routing gains and cross-family scaling correlation constitute reproducible, falsifiable strengths that could inform practical systems; however, the ad-hoc character of the threshold law limits its theoretical contribution.

major comments (2)

[Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.
[Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.

minor comments (2)

[Abstract] The exact operational definition of PHC (including how confident-wrong answers are scored and how the sequence of partial evidence is constructed) is referenced but not restated in the abstract; a concise inline definition would improve accessibility.
[RAG routing results] The routing experiment reports p<1e-6 but does not specify the exact statistical test or correction for multiple comparisons across the four benchmarks; adding this detail would strengthen reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their insightful comments, which help clarify the presentation of our results. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Causal injection experiment] Causal injection experiment (abstract and associated results): the reported PHC rise from 0.613 to 0.656 is attributed to the model treating the injected fact as confirmed evidence that triggers parametric completion, yet the manuscript provides no verification (e.g., follow-up probe questions or acceptance checks) that the model actually registers the fact as true rather than responding to token position, lexical framing, or hypothetical status. This verification is load-bearing for the central claim.

Authors: We agree that direct verification would strengthen the causal claim. In the revised version, we will add follow-up probe questions after the injection to confirm that the model accepts the anchor as true. This will rule out alternative explanations such as lexical framing or token position effects. We have already conducted preliminary probes showing acceptance rates above 85%, which we will report. revision: yes
Referee: [Anchoring Threshold Law] Anchoring Threshold Law (k*(n)=floor(n/3), results section): the functional form is stated to predict amplification and is then confirmed by four predictions, but the divisor of 3 appears selected to match the hop-depth data rather than derived from an independent principle, creating circularity that weakens the claim of predictive confirmation.

Authors: The threshold law is indeed empirical, fitted to the observed non-monotonic pattern in our hop-depth experiments. However, the four confirmed predictions include tests on held-out data and different model scales, providing evidence of generalizability beyond the fitting data. We will revise the text to explicitly state that the law is an empirical finding rather than a theoretically derived formula, and we will discuss potential cognitive or architectural motivations for the 1/3 factor in the discussion section. revision: partial

Circularity Check

1 steps flagged

Anchoring Threshold Law k*(n)=floor(n/3) presented as predictive law but functional form matches hop-depth data by construction

specific steps

fitted input called prediction [Abstract]
"The Anchoring Threshold Law k*(n)=floor(n/3) predicts PHC amplification by hop depth with four confirmed predictions."

The law is invoked to predict amplification, yet its exact functional form floor(n/3) is not derived from first principles or external axioms; it directly encodes the hop-depth thresholds at which the paper's own PHC measurements show the non-monotonic spike, rendering the four 'confirmed predictions' statistically forced by the same data used to select the form.

full rationale

The paper's central formalization introduces the Anchoring Threshold Law as predicting PHC amplification by hop depth, with four confirmed predictions. The specific form floor(n/3) has no independent derivation shown and aligns exactly with the observed non-monotonic spike pattern (PHC rising then falling at partial evidence points), making the 'predictions' reduce to post-hoc fitting of the same empirical observations. The causal injection results (PHC 0.613→0.656→0.595→0.536) and scaling correlations remain independent of this law and do not rely on self-citation chains. No other load-bearing steps reduce to inputs by definition. This yields moderate circularity confined to the law's status as a derived result.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that partial anchors trigger parametric completion and on the empirically observed threshold law; no external benchmarks or machine-checked derivations are mentioned.

free parameters (1)

Anchoring Threshold Law divisor
The specific form floor(n/3) is presented as predictive but its selection appears tuned to the observed hop-depth pattern.

axioms (1)

domain assumption Models treat a single confirmed intermediate fact as a reliable anchor that triggers parametric completion of remaining steps
Invoked in the definition of anchored confabulation and the causal injection setup.

invented entities (1)

Parametric Hallucination Confidence (PHC) no independent evidence
purpose: Quantitative measure of the confident-wrong-answer rate under partial evidence
New metric introduced to track the non-monotonic effect

pith-pipeline@v0.9.0 · 5545 in / 1448 out tokens · 52537 ms · 2026-05-13T22:09:30.942012+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

[1]

, a(k) K are sampled at temperature τ=0.85

∑ i<j F1(a(k) i ,a (k) j )(3) where a(k) 1 , . . ., a(k) K are sampled at temperature τ=0.85. SSC is a model-internal certainty measure independent of lexical hedge phrases, prompt phrasing, or calibration training. High SSC means the model is committed to a specific completion; low SSC means it is genuinely uncertain. Results (N=160,K=3, GPT-4o). Oracle ...

work page arXiv 2023
[2]

Partial retrieval is the worst operating point

All-or-nothing retrieval: either retrieve sufficient context to exceed k∗(n) on the first pass, or retrieve nothing and admit uncertainty. Partial retrieval is the worst operating point

work page
[3]

escalate iff C(A)<τ

PHC-calibrated confidence: scale expressed confidence by (1 −PHC risk(n, k)) where PHC risk(n, k) =1[k=k ∗(n)]· ˆγ and ˆγ is estimated from a small calibration set. We evaluate this calibration via the Epistemic Humility Prompt (Section G). B.10 IRCoT Per-Iteration PHC: Completed Test The Anchoring Threshold Law predicts a structural vulnerability for ite...

work page 2023
[4]

Signals must measureretrieval adequacy, not answer surface

work page
[5]

For single-hop questions, passage relevance is a good proxy for retrieval adequacy (GraphRAG helps when top-kpassages miss the answer)

work page
[6]

For multi-hop questions, passage relevance is insufficient: individually relevant passages may collectively lack thebridge factsconnecting entity chains

work page
[7]

Bridge entity detec- tion (Section 6) addresses finding (3)

This motivatestype-aware signals: one signal for retrieval failure (Type C), another for reasoning gap (Type A/B) Our Grounded Self-Rating (Section 5) addresses finding (2) directly. Bridge entity detec- tion (Section 6) addresses finding (3). The formal condition for Theorem 3’s bound to be achievable in practice is that U must capture the structure of t...

work page
[8]

I believe

Hedging phrases( w= 0.30): presence of “I believe”, “approximately”, “it is possible that”, etc

work page
[9]

Specificity( w= 0.25): fraction of named entities and numerals in the answer (low specificity⇒uncertain)

work page
[10]

I cannot determine

Reasoning struggle( w= 0.20): phrases like “I cannot determine”, “based on the context”, “it’s unclear”

work page
[11]

Under review

Length anomaly( w= 0.10): very short or very long answers relative to dataset mean 5.Entity coverage(w=0.15): fraction of question entities mentioned in the answer 20 Preprint. Under review. Final confidence= ∑i wi ·s i, thresholded atτ=0.65 to trigger escalation. Why do lexical signals fail?Lexical features measurehow the model expresses uncertainty, not...

work page 2023
[12]

Generate K= 3 answers usingonlythe retrieved passages (passage-restricted, with explicit instruction not to use external knowledge)

work page
[13]

Measure pairwise token-F1 agreement between all (K

work page
[14]

cannot determine from passages

SSC confidence=mean pairwise token-F1 Passage-restriction is critical.Without passage restriction, the model uses parametric knowledge, producing highly consistent (but potentially wrong) answers even when pas- sages are insufficient. A question about a bridge entity in HotpotQA generates the same confident answer three times at temperature T= 0.9 because...

work page 2023
[15]

This comparison is fully fair: same protocol, same data, only feature set differs

Combined GB beats pre-gen GB (5-fold): ∆= + 0.024, 95% CI [+0.012, +0.036], p< 0.0001. This comparison is fully fair: same protocol, same data, only feature set differs. Post-gen signals are the source of improvement

work page
[16]

Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

BGE embedding router (768-dim) achieves the same F1 as HybridRouter (0.295), confirming that richer pre-gen representations do not close the gap. Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

work page
[17]

pre-gen GB), directly validating Theorem 3 in downstream F1

Post-gen alone (0.314) beats all pre-gen baselines( p= 0.0285 vs. pre-gen GB), directly validating Theorem 3 in downstream F1

work page
[18]

HybridRouter)

MuSiQue: the largest gain (+0.090 vs. HybridRouter). Query features cannot distinguish answerable from unanswerable multi-hop questions before generation; post-gen signals observe the model’s actual reasoning failure

work page
[19]

Who is the director of the film starring Actor X?

2Wiki: LearnedRouter −0.041vs. HybridRouter, where over-escalation of compar- ison questions hurts; type-aware suppression is the fix (Section F.20). Oracle gap closed: 45.2% (Combined GB) vs. 36.7% (pre-gen GB, 5-fold OOF) vs. 35.1% (HybridRouter).The post-gen combined signal closes 8.5 percentage points more of the ora- cle gap than the strongest fairly...

work page arXiv 2025
[20]

With retrieval (VanillaRAG):standard top- k= 5 ChromaDB passages (existing results)

work page
[21]

partial-chain amplifica- tion

No retrieval (pure parametric):identical prompt with passages removed. The model answers from parametric knowledge only. Result: retrieval amplifies PHC inversion by differentiating the two groups.Retrieved passages reduce confidence inbothgroups ( ∆< 0, p< 10−6 for both, paired t-test), but with a critical asymmetry: the reduction is larger for oracle-Fa...

work page arXiv 2000
[22]

At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

Multi-hop chain confabulation(MuSiQue 3-hop): 3-hop bridge chains are dense enough in parametric memory to fabricate confidently but exceed typical retrieved- passage coverage. At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

work page
[23]

Is X taller/older/faster than Y?

Comparative judgment confabulation(HotpotQA comparison): Questions of the form “Is X taller/older/faster than Y?” require explicit comparison that retrieved passages rarely state directly. The model draws on parametric comparative priors and expresses the result with high confidence—whether or not the prior is correct. Both mechanisms instantiate the same...

work page 1904
[24]

compare/versus/both

Context integration loss:Re-generation from flat KG text fails to leverage graph- structured relationships. Direct routing uses the GraphRAG pipeline’s structured prompt (which already formats KG paths and entity links explicitly). Implementation. ReasonRAGPipeline(direct routing=True) in src/reason rag.py: when should escalate=True, call self.graph.run(q...

work page
[25]

For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops

Sub-question extraction loss.ReasonRAG generates a sub-question from the uncertain initial answer. For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops. The most immediate fix: drop sub-question extraction; re-query GraphRAG with the original question directly (this is exactly what direct routing evaluates in...

work page
[26]

Rate your confidence [CON- FIDENCE: X/5]

Context integration failure.The re-generation prompt receives KG passages as flat text, failing to leverage graph-structured relationships (paths, entity links). Structured prompting that encodes the KG path explicitly may improve multi-hop answer synthesis. The direct routing result (Section E.4) demonstrates that fixing bottleneck (1)—dropping sub-quest...

work page 1994
[27]

Better escalation signals(our Grounded Self-Rating, Section 5) help route correctly but have limited impact on macro F1 given current re-generation quality

work page
[28]

The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration

Better re-generation(full GraphRAG pipeline reuse, multi-hop sub-question chain- ing) is the primary lever for macro F1 improvement H.4 Calibration of Confidence Scores Figure 5 shows reliability diagrams for ReasonRAG confidence scores across datasets. The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration. Confi- dence scores...

work page
[29]

Themaximum achievable gainfrom routing is +0.285 F1 (macro), assuming perfect escalation and perfect regeneration

work page
[30]

Thelexical baseline gap closedis 10.4%, leaving 89.6% of potential improvement on the table

work page
[31]

ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

Thepre-generation baseline(HybridRouter) closes approximately 35.1% of the oracle gap (direct routing)—better than ReasonRAG lexical despite using only query features; the gap vs. ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

work page
[32]

low confidence⇒escalate

Both escalationandregeneration must improve to approach the oracle; improving one without the other yields diminishing returns These findings motivate our Grounded Self-Rating signal (Section 5): a single additional LLM call that directly measuresretrieval adequacy—the root cause of escalation need—rather than answer style. I DSPy Integration ReasonRAG ca...

work page arXiv 2024

[1] [1]

, a(k) K are sampled at temperature τ=0.85

∑ i<j F1(a(k) i ,a (k) j )(3) where a(k) 1 , . . ., a(k) K are sampled at temperature τ=0.85. SSC is a model-internal certainty measure independent of lexical hedge phrases, prompt phrasing, or calibration training. High SSC means the model is committed to a specific completion; low SSC means it is genuinely uncertain. Results (N=160,K=3, GPT-4o). Oracle ...

work page arXiv 2023

[2] [2]

Partial retrieval is the worst operating point

All-or-nothing retrieval: either retrieve sufficient context to exceed k∗(n) on the first pass, or retrieve nothing and admit uncertainty. Partial retrieval is the worst operating point

work page

[3] [3]

escalate iff C(A)<τ

PHC-calibrated confidence: scale expressed confidence by (1 −PHC risk(n, k)) where PHC risk(n, k) =1[k=k ∗(n)]· ˆγ and ˆγ is estimated from a small calibration set. We evaluate this calibration via the Epistemic Humility Prompt (Section G). B.10 IRCoT Per-Iteration PHC: Completed Test The Anchoring Threshold Law predicts a structural vulnerability for ite...

work page 2023

[4] [4]

Signals must measureretrieval adequacy, not answer surface

work page

[5] [5]

For single-hop questions, passage relevance is a good proxy for retrieval adequacy (GraphRAG helps when top-kpassages miss the answer)

work page

[6] [6]

For multi-hop questions, passage relevance is insufficient: individually relevant passages may collectively lack thebridge factsconnecting entity chains

work page

[7] [7]

Bridge entity detec- tion (Section 6) addresses finding (3)

This motivatestype-aware signals: one signal for retrieval failure (Type C), another for reasoning gap (Type A/B) Our Grounded Self-Rating (Section 5) addresses finding (2) directly. Bridge entity detec- tion (Section 6) addresses finding (3). The formal condition for Theorem 3’s bound to be achievable in practice is that U must capture the structure of t...

work page

[8] [8]

I believe

Hedging phrases( w= 0.30): presence of “I believe”, “approximately”, “it is possible that”, etc

work page

[9] [9]

Specificity( w= 0.25): fraction of named entities and numerals in the answer (low specificity⇒uncertain)

work page

[10] [10]

I cannot determine

Reasoning struggle( w= 0.20): phrases like “I cannot determine”, “based on the context”, “it’s unclear”

work page

[11] [11]

Under review

Length anomaly( w= 0.10): very short or very long answers relative to dataset mean 5.Entity coverage(w=0.15): fraction of question entities mentioned in the answer 20 Preprint. Under review. Final confidence= ∑i wi ·s i, thresholded atτ=0.65 to trigger escalation. Why do lexical signals fail?Lexical features measurehow the model expresses uncertainty, not...

work page 2023

[12] [12]

Generate K= 3 answers usingonlythe retrieved passages (passage-restricted, with explicit instruction not to use external knowledge)

work page

[13] [13]

Measure pairwise token-F1 agreement between all (K

work page

[14] [14]

cannot determine from passages

SSC confidence=mean pairwise token-F1 Passage-restriction is critical.Without passage restriction, the model uses parametric knowledge, producing highly consistent (but potentially wrong) answers even when pas- sages are insufficient. A question about a bridge entity in HotpotQA generates the same confident answer three times at temperature T= 0.9 because...

work page 2023

[15] [15]

This comparison is fully fair: same protocol, same data, only feature set differs

Combined GB beats pre-gen GB (5-fold): ∆= + 0.024, 95% CI [+0.012, +0.036], p< 0.0001. This comparison is fully fair: same protocol, same data, only feature set differs. Post-gen signals are the source of improvement

work page

[16] [16]

Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

BGE embedding router (768-dim) achieves the same F1 as HybridRouter (0.295), confirming that richer pre-gen representations do not close the gap. Question semantics alone cannot predict when GraphRAG will outperform VanillaRAG

work page

[17] [17]

pre-gen GB), directly validating Theorem 3 in downstream F1

Post-gen alone (0.314) beats all pre-gen baselines( p= 0.0285 vs. pre-gen GB), directly validating Theorem 3 in downstream F1

work page

[18] [18]

HybridRouter)

MuSiQue: the largest gain (+0.090 vs. HybridRouter). Query features cannot distinguish answerable from unanswerable multi-hop questions before generation; post-gen signals observe the model’s actual reasoning failure

work page

[19] [19]

Who is the director of the film starring Actor X?

2Wiki: LearnedRouter −0.041vs. HybridRouter, where over-escalation of compar- ison questions hurts; type-aware suppression is the fix (Section F.20). Oracle gap closed: 45.2% (Combined GB) vs. 36.7% (pre-gen GB, 5-fold OOF) vs. 35.1% (HybridRouter).The post-gen combined signal closes 8.5 percentage points more of the ora- cle gap than the strongest fairly...

work page arXiv 2025

[20] [20]

With retrieval (VanillaRAG):standard top- k= 5 ChromaDB passages (existing results)

work page

[21] [21]

partial-chain amplifica- tion

No retrieval (pure parametric):identical prompt with passages removed. The model answers from parametric knowledge only. Result: retrieval amplifies PHC inversion by differentiating the two groups.Retrieved passages reduce confidence inbothgroups ( ∆< 0, p< 10−6 for both, paired t-test), but with a critical asymmetry: the reduction is larger for oracle-Fa...

work page arXiv 2000

[22] [22]

At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

Multi-hop chain confabulation(MuSiQue 3-hop): 3-hop bridge chains are dense enough in parametric memory to fabricate confidently but exceed typical retrieved- passage coverage. At 2-hop, passages cover the chain (HotpotQA bridge, 2Wiki: n.s.); at 3-hop, they do not

work page

[23] [23]

Is X taller/older/faster than Y?

Comparative judgment confabulation(HotpotQA comparison): Questions of the form “Is X taller/older/faster than Y?” require explicit comparison that retrieved passages rarely state directly. The model draws on parametric comparative priors and expresses the result with high confidence—whether or not the prior is correct. Both mechanisms instantiate the same...

work page 1904

[24] [24]

compare/versus/both

Context integration loss:Re-generation from flat KG text fails to leverage graph- structured relationships. Direct routing uses the GraphRAG pipeline’s structured prompt (which already formats KG paths and entity links explicitly). Implementation. ReasonRAGPipeline(direct routing=True) in src/reason rag.py: when should escalate=True, call self.graph.run(q...

work page

[25] [25]

For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops

Sub-question extraction loss.ReasonRAG generates a sub-question from the uncertain initial answer. For 4-hop MuSiQue questions, a single sub-question captures only one hop, missing 2–3 required hops. The most immediate fix: drop sub-question extraction; re-query GraphRAG with the original question directly (this is exactly what direct routing evaluates in...

work page

[26] [26]

Rate your confidence [CON- FIDENCE: X/5]

Context integration failure.The re-generation prompt receives KG passages as flat text, failing to leverage graph-structured relationships (paths, entity links). Structured prompting that encodes the KG path explicitly may improve multi-hop answer synthesis. The direct routing result (Section E.4) demonstrates that fixing bottleneck (1)—dropping sub-quest...

work page 1994

[27] [27]

Better escalation signals(our Grounded Self-Rating, Section 5) help route correctly but have limited impact on macro F1 given current re-generation quality

work page

[28] [28]

The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration

Better re-generation(full GraphRAG pipeline reuse, multi-hop sub-question chain- ing) is the primary lever for macro F1 improvement H.4 Calibration of Confidence Scores Figure 5 shows reliability diagrams for ReasonRAG confidence scores across datasets. The Expected Calibration Error (ECE) is 0.1069, indicating moderate miscalibration. Confi- dence scores...

work page

[29] [29]

Themaximum achievable gainfrom routing is +0.285 F1 (macro), assuming perfect escalation and perfect regeneration

work page

[30] [30]

Thelexical baseline gap closedis 10.4%, leaving 89.6% of potential improvement on the table

work page

[31] [31]

ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

Thepre-generation baseline(HybridRouter) closes approximately 35.1% of the oracle gap (direct routing)—better than ReasonRAG lexical despite using only query features; the gap vs. ReasonRAG cascaded (10.4%) reflects the re-generation quality bottleneck, not routing quality

work page

[32] [32]

low confidence⇒escalate

Both escalationandregeneration must improve to approach the oracle; improving one without the other yields diminishing returns These findings motivate our Grounded Self-Rating signal (Section 5): a single additional LLM call that directly measuresretrieval adequacy—the root cause of escalation need—rather than answer style. I DSPy Integration ReasonRAG ca...

work page arXiv 2024