Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Huan Xu; Pin Qian; Shuhuai Lin; Sipeng Zhang; Su Wang; Xinpeng Wei; Yihang Chen

arxiv: 2605.14473 · v3 · pith:77ND325Onew · submitted 2026-05-14 · 💻 cs.CL · cs.AI

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

Yihang Chen , Pin Qian , Su Wang , Sipeng Zhang , Huan Xu , Shuhuai Lin , Xinpeng Wei This is my paper

Pith reviewed 2026-05-19 16:33 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords retrieval-augmented generationknowledge conflictcontext compliancebelief decompositionadversarial evaluationtemporal robustnessinference-time intervention

0 comments

The pith

Context-Driven Decomposition diagnoses when RAG follows conflicting retrieved context instead of its own knowledge and intervenes to improve robustness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation can let external documents override a model's built-in facts when the two sources disagree. The paper introduces Context-Driven Decomposition as an inference-time method to separate and measure each source's influence on the answer. In controlled tests with injected misconceptions, standard RAG drops to 15 percent accuracy while the new decomposition raises performance and holds up better when facts shift over time or include extra noise. Accuracy improvements appear across different model families, yet the underlying causal links between reasoning steps and final answers do not transfer uniformly. The work treats context compliance as a distinct structural feature of RAG that can be probed and adjusted separately from ordinary retrieval quality.

Core claim

The paper establishes that the Context-Compliance Regime occurs when retrieved context dominates the final answer even under direct conflict with parametric knowledge, and shows that Context-Driven Decomposition serves as an inference-time belief-decomposition probe and intervention mechanism. On TruthfulQA misconception injection, standard RAG reaches only 15.0 percent accuracy; CDD lifts results on temporal shifts to 71.3 percent and on distractor evidence to 69.9 percent. Accuracy gains transfer to Gemini-2.5-Flash and Claude variants, but explicit rationale-answer causal sensitivity transfers only partially, indicating that context compliance forms a measurable axis distinct from single-

What carries the argument

Context-Driven Decomposition (CDD), an inference-time belief-decomposition probe that isolates causal influence of context versus parametric knowledge and enables controlled intervention on retrieval conflicts.

If this is right

Standard RAG reaches only 15.0 percent accuracy when context injects misconceptions on TruthfulQA.
CDD accuracy gains transfer across Gemini and Claude model families, though causal sensitivity of rationale to answer does not.
Explicit conflict decomposition raises robustness to 71.3 percent under temporal drift and 69.9 percent with noisy distractors.
Context compliance operates as a structural axis separate from retrieval quality or single-method robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

CDD-style checks could be inserted into existing RAG pipelines as a lightweight conflict detector before final generation.
Model-family differences in causal coupling suggest that internal architecture affects how conflicts are resolved beyond surface accuracy.
The Epi-Scale benchmark release could encourage systematic comparison of conflict-handling methods across retrieval pipelines.
Similar decomposition approaches might extend to other generation tasks that combine external documents with pretrained knowledge.

Load-bearing premise

The CDD decomposition performed at inference time accurately isolates the causal contribution of context versus parametric knowledge without introducing its own artifacts or requiring model-specific tuning.

What would settle it

Measure whether turning CDD on or off on a fixed set of conflicting question-context pairs produces answer changes that match the decomposed context and parametric scores in a held-out test set.

Figures

Figures reproduced from arXiv: 2605.14473 by Huan Xu, Pin Qian, Shuhuai Lin, Sipeng Zhang, Su Wang, Xinpeng Wei, Yihang Chen.

**Figure 1.** Figure 1: CDD pipeline with the CDD-α NLI-gated bypass. High-conflict samples enter the full decomposition probe, while low-conflict samples follow the Standard RAG bypass before converging at the final answer. 4.2 Evaluation Scope We use three evaluation settings, each with a different diagnostic role. Epi-Scale synthetic conflict is a controlled perturbation stress test for isolating specific conflict types. Tr… view at source ↗

**Figure 2.** Figure 2: Adversarial accuracy on the full Epi-Scale adversarial split ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Mistake-injection causal sensitivity across model families (N=100 per cell). On Gemini-2.5- [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Compute–accuracy trade-off on the gemini-2.5-flash-001 adversarial split. CDD-α at τ = 0.7 routes 30% of samples through deep decomposition and reaches 68.5%; the remaining 9.6 pp gap to Full CDD costs roughly 1.4× more compute. Relative compute is approximate; exact ratios depend on token-level prompt and rationale lengths. 6 Findings: Conflict-Aware Robustness 6.1 Error Analysis The perturbation-level p… view at source ↗

read the original abstract

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CDD gives measurable accuracy lifts in RAG conflict settings and releases a new benchmark, but the causal-sensitivity claims look shaky because of the large Gemini-versus-Claude gap.

read the letter

The main thing to know is that this paper introduces Context-Driven Decomposition as an inference-time probe that both diagnoses and intervenes on context compliance in RAG when retrieved material conflicts with parametric knowledge. It reports concrete accuracy gains and releases Epi-Scale for others to use. The cross-model results are the clearest new observation: accuracy improves across Gemini and the Claude family, yet rationale-answer causal coupling only appears strongly in Gemini (64.1% sensitivity) while staying near zero for Claude variants. That split suggests the accuracy benefit and the explicit causal trace may not come from the same mechanism. The work also shows CDD helping on temporal drift (71.3%) and noisy distractors (69.9%) on the adversarial benchmark, and it documents standard RAG dropping to 15% accuracy under misconception injection on TruthfulQA. Those patterns are straightforward to check and move the discussion past raw accuracy. The paper does a reasonable job of running the same probe on multiple models and benchmarks instead of staying with one setup. Releasing the stress-test suite is a concrete service to the community. The soft spot is the assumption that CDD cleanly isolates causal influence without model-specific artifacts. The big divergence in sensitivity between Gemini and Claude is exactly what you would expect if the decomposition step interacts differently with each model's instruction-following or rationale style. Without reported ablations that remove the decomposition itself, hold the prompt template fixed, or test for prompt sensitivity, the causal interpretation rests on thinner ground than the accuracy numbers. The methods section would need to show those controls explicitly. This paper is aimed at researchers and engineers who build or evaluate RAG pipelines and want diagnostics for knowledge conflict rather than another retrieval trick. A reader already working on robustness or calibration would get usable ideas and a benchmark to try. It has enough empirical grounding and a clear practical angle to deserve peer review, even if the causal claims will need tightening.

Referee Report

2 major / 2 minor

Summary. The paper introduces Context-Driven Decomposition (CDD), an inference-time belief-decomposition probe and intervention for RAG systems. It diagnoses context compliance under knowledge conflicts with parametric knowledge via experiments on TruthfulQA misconception injection (N=500) and Epi-Scale adversarial benchmarks across model families. The central claims are three patterns: standard RAG shows low adversarial accuracy (15.0%), CDD yields transferable accuracy gains but non-transferable rationale-answer causal coupling (64.1% sensitivity on Gemini-2.5-Flash vs. [-3%, +7%] on Claude variants), and CDD improves robustness to temporal drift (71.3%) and noisy distractors (69.9%).

Significance. If the CDD probe validly isolates causal context influence without introducing model-specific artifacts, the work identifies context compliance as a distinct structural axis in RAG separate from retrieval quality or single-method robustness. The cross-family transfer analysis and release of Epi-Scale for systematic study would be useful contributions to diagnosing and mitigating knowledge conflicts in retrieval-augmented systems.

major comments (2)

[§3 (CDD method and inference-time decomposition)] The claim that CDD accurately isolates causal influence of context versus parametric knowledge (central to P2) is load-bearing but under-supported: the large divergence in causal sensitivity (64.1% Gemini-2.5-Flash vs. near-zero for all Claude variants) while accuracy gains transfer is consistent with probe interacting differently with each model's instruction-following or rationale behavior. No ablations of the decomposition step itself, fixed prompt templates, or identical intervention formats are described to rule out artifacts.
[§5 (Epi-Scale stress tests and results)] Table or results section reporting Epi-Scale outcomes: the 71.3% on temporal shifts and 69.9% on distractor evidence are presented as robustness gains, but without details on how temporal drift is constructed, statistical significance of improvements over baselines, or controls for post-hoc benchmark choices, it is unclear whether these numbers establish the claimed improvement under drift and noise.

minor comments (2)

[Abstract] Abstract and §4 could more explicitly separate the accuracy metric from the causal sensitivity metric when describing P2 to prevent readers from conflating transferable performance with transferable causal structure.
[§3] The manuscript would benefit from a clearer operational definition or pseudocode for the CDD intervention format to allow replication across model families.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments identify key areas where additional evidence and clarification can strengthen the claims about the CDD probe's ability to isolate causal context influence and the robustness results on Epi-Scale. We address each major comment below, providing our strongest honest defense while noting planned revisions.

read point-by-point responses

Referee: [§3 (CDD method and inference-time decomposition)] The claim that CDD accurately isolates causal influence of context versus parametric knowledge (central to P2) is load-bearing but under-supported: the large divergence in causal sensitivity (64.1% Gemini-2.5-Flash vs. near-zero for all Claude variants) while accuracy gains transfer is consistent with probe interacting differently with each model's instruction-following or rationale behavior. No ablations of the decomposition step itself, fixed prompt templates, or identical intervention formats are described to rule out artifacts.

Authors: The observed divergence in causal sensitivity is not presented as a flaw but as a central finding supporting P2: accuracy gains transfer across families while rationale-answer causal coupling does not, indicating that context compliance can be resolved through distinct mechanisms depending on model architecture. This pattern is consistent with the paper's argument that context compliance is a structural axis separate from general instruction-following. That said, we acknowledge the concern that unablated prompt or intervention choices could introduce model-specific artifacts. In the revision we will add explicit ablations of the decomposition step, including alternative prompt templates and matched intervention formats across models, to more directly test whether the sensitivity differences persist independently of these factors. revision: yes
Referee: [§5 (Epi-Scale stress tests and results)] Table or results section reporting Epi-Scale outcomes: the 71.3% on temporal shifts and 69.9% on distractor evidence are presented as robustness gains, but without details on how temporal drift is constructed, statistical significance of improvements over baselines, or controls for post-hoc benchmark choices, it is unclear whether these numbers establish the claimed improvement under drift and noise.

Authors: We agree that the current presentation leaves the construction details and statistical grounding implicit. The reported figures reflect CDD's performance on the full Epi-Scale adversarial benchmark under the described stress conditions, but additional documentation is required to allow readers to evaluate the improvements. In the revised manuscript we will expand the Epi-Scale section with a precise description of temporal-drift construction, report statistical significance of gains relative to the standard-RAG baseline, and document the controls applied to benchmark selection and evaluation order. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces Context-Driven Decomposition (CDD) as a new inference-time belief-decomposition probe and reports three empirical patterns (P1–P3) from direct measurements on external benchmarks including TruthfulQA misconception injection (15.0% accuracy) and Epi-Scale (71.3% on temporal shifts). These outcomes are obtained via accuracy, sensitivity, and robustness metrics across model families without any equations, fitted parameters, or self-citations that reduce the central claims to quantities defined by construction within the paper. The derivation chain is therefore self-contained and relies on observable experimental results rather than internal redefinitions or imported ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard assumptions in LLM evaluation that accuracy and sensitivity metrics reflect underlying causal mechanisms; no explicit free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Accuracy and causal sensitivity metrics on injected misconceptions accurately reflect context compliance behavior.
Used to interpret the three patterns P1-P3 as evidence of structural properties of RAG.

pith-pipeline@v0.9.0 · 5856 in / 1189 out tokens · 36039 ms · 2026-05-19T16:33:59.176259+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time... five-step reasoning trace (Step 1: Contextual Extraction... Step 5: Resolution)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mistake-injection causal sensitivity... 64.1% on Gemini-2.5-Flash... [-3%, +7%] on Claude variants

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.