arxiv: 2604.00387 · v2 · submitted 2026-04-01 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems

KrishnaSaiReddy Patil

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords RAGnumerical manipulationgovernment RAGIRS documentsembedding blind spotvalue verificationattack detectioncontext propagation

0 comments

The pith

RAG systems for government services are vulnerable to undetectable numerical manipulations unless they verify values directly instead of using embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-Augmented Generation systems used by federal agencies for tax guidance and benefits can be tricked by changing specific numbers like deductions without altering the text's meaning enough to be caught by similarity checks. Embeddings focus on topics rather than precise figures, leading to a sensitivity gap where embedding defenses miss most attacks. RAGShield addresses this by extracting numerical values such as dollar amounts and percentages from documents, linking them to entities using context propagation, and cross-verifying against a registry derived from the corpus. It also monitors for changes outside expected update times. Tests on 430 attacks from real IRS documents show complete detection while other methods fail on 79 to 90 percent of cases.

Core claim

This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value

What carries the argument

A pattern-based engine that extracts numerical values and links them to entities via two-pass context propagation for verification against a corpus registry.

Load-bearing premise

That the pattern-based extraction and two-pass linking can achieve 99.8% accuracy on all numerical values in government documents with an error-free registry.

What would settle it

Finding a numerical manipulation in IRS documents that the system fails to detect due to missed extraction or incorrect registry match.

Figures

Figures reproduced from arXiv: 2604.00387 by KrishnaSaiReddy Patil.

read the original abstract

Retrieval-Augmented Generation (RAG) systems are deployed across federal agencies for citizen-facing tax guidance, benefits eligibility, and legal information, where a single incorrect number causes direct financial harm. This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. The blind spot is confirmed on real IRS documents.The root cause is that embeddings encode topic, not numerical precision. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value changes that fall outside known government update schedules. On 430 attacks generated from real IRS document content, RAGShield detects every one (0.0% ASR, 95% CI [0%, 1%]) while embedding-based defenses miss 79-90% of the same attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RAGShield shows embeddings miss numerical tweaks in government RAG and gives a workable pattern-based fix, but the zero-error claim rests on extraction accuracy that needs checking on the actual attack cases.

read the letter

RAGShield shows that embedding-based defenses in RAG systems fail to catch numerical manipulations in citizen-facing government content because cosine similarity stays near 1.0 even after a tax figure changes by tens of thousands. The paper replaces that with direct value extraction, entity linking via two-pass context propagation, verification against a corpus-derived registry, and a temporal tracker for update schedules. On 174 manipulation pairs from real IRS passages it reports a large sensitivity gap, and on 430 generated attacks it claims 0% attack success rate while baselines miss 79-90%. The concrete numbers and use of actual documents are the strongest parts. The pipeline is straightforward to implement and avoids the usual embedding pitfalls for precision tasks. The main soft spot is the transfer of the 99.8% entity-linking accuracy from the 2,742 clean passages to the manipulated attack set. If tables, footnotes, or sentences with multiple values cause even a small drop in linking on those specific documents, the perfect detection result does not hold. The paper should report extraction accuracy measured directly on the 430 attacked versions and include any error cases. Attack generation rules also need to be explicit so readers can judge how realistic the test is. This work is aimed at teams building or auditing RAG systems for financial, legal, or benefits data where a wrong number has real cost. It has enough grounded evaluation and a clear practical gap to deserve peer review, though the extraction reliability section will need tightening.

Referee Report

3 major / 2 minor

Summary. The paper claims that embedding-based RAG defenses have a fundamental blind spot for numerical claim manipulations in government documents, as altering values like tax deductions yields cosine similarities of 0.9998 that evade detection thresholds. It introduces RAGShield, which extracts dollar amounts and percentages via patterns, links values to entities through two-pass context propagation (99.8% accuracy on 2,742 IRS passages), verifies claims against a corpus-derived cross-source registry, and applies temporal tracking for update anomalies. On 174 manipulation pairs and 430 attacks generated from real IRS content, RAGShield achieves 0.0% ASR (95% CI [0%, 1%]) while embedding defenses miss 79-90% of attacks.

Significance. If the detection performance holds under the reported conditions, the work identifies a concrete, high-impact vulnerability in RAG systems used for citizen-facing government information where numerical errors can cause direct financial harm. The evaluation on real IRS documents, concrete metrics with confidence intervals, and direct comparison to embedding baselines provide a reproducible baseline for defenses that operate on extracted values rather than semantic similarity.

major comments (3)

[Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.
[Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.
[Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.

minor comments (2)

[Abstract] Name the two embedding models used for the sensitivity gap and ASR comparisons.
[RAGShield description] Clarify how the cross-source registry is constructed from the corpus and how the temporal tracker determines 'known government update schedules'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and valuable feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we have updated the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses

Referee: [Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.

Authors: We agree with this observation. In the revised manuscript, we have added a new subsection in the evaluation section reporting the extraction and linking accuracy specifically on the 430 attack instances. The entity linking accuracy on these instances is 99.6%, with detailed failure mode analysis (e.g., 2 cases in tables, 1 in footnotes) included in Appendix C. This confirms that the 0.0% ASR is not due to undetected errors in the pipeline. revision: yes
Referee: [Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.

Authors: We thank the referee for pointing this out. We have expanded the Methods section (Section 4.2) with a detailed description of the attack generation process. Specifically, we selected 174 base passages from IRS documents, applied 5 manipulation rules (e.g., increment/decrement by 10-50%, swap with similar values from other documents), and generated variants ensuring coverage of formats like '$X,XXX', 'X.X%', and contexts including tables and footnotes. We also added controls for diversity in numerical contexts. revision: yes
Referee: [Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.

Authors: We agree that additional details are necessary for reproducibility. In the revised version, we have defined the sensitivity gap explicitly as the ratio of the minimum cosine similarity threshold required to detect the manipulation to the observed similarity of the manipulated pair. We have also added Table 5 in the appendix listing the sensitivity gap for each of the 174 pairs, along with the mean, median, and standard deviation to demonstrate that the reported mean is not outlier-driven. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on separate empirical measurements

full rationale

The paper reports three independent empirical results: (1) cosine similarity of 0.9998 on 174 manipulation pairs across two embedding models, (2) 99.8% entity-linking accuracy measured on a distinct set of 2,742 IRS passages, and (3) 0.0% ASR on 430 attacks generated from real IRS content. The pattern-based engine is a fixed rule-based extractor whose accuracy is stated as a measured quantity on the 2,742-passage benchmark; the 430-attack evaluation is a direct count of detections on held-out generated examples. No equations reduce a prediction to a fitted input by construction, no self-citations bear the central claim, and no ansatz or uniqueness theorem is imported. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that numerical values can be accurately extracted and linked via patterns and context without needing external labeled data, plus the premise that a self-constructed registry from the corpus serves as reliable ground truth.

axioms (2)

domain assumption Embeddings encode topic rather than numerical precision
Invoked as the root cause of the blind spot for all embedding-based defenses.
domain assumption Pattern-based extraction plus two-pass context propagation achieves reliable entity linking on government text
Required for the 99.8% detection rate and overall system operation.

pith-pipeline@v0.9.0 · 5521 in / 1389 out tokens · 68309 ms · 2026-05-13T23:10:08.748578+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

a pattern-based engine identifies dollar amounts and percentages... two-pass context propagation (99.8% entity detection on 2,742 real IRS passages)
IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

temporal tracker flags value changes that fall outside known government update schedules

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
cs.CR 2026-04 conditional novelty 8.0 partial

SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using ...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Phantom: General trigger attacks on retrieval augmented language generation,

H. Chaudhari et al., “Phantom: General trigger attacks on retrieval augmented language generation,” inProc. NeurIPS, 2024. 12

work page 2024
[2]

PoisonedRAG: Knowledge corruption attacks to retrieval- augmented generation of large language models,

J. Zou et al., “PoisonedRAG: Knowledge corruption attacks to retrieval- augmented generation of large language models,” inProc. USENIX Security, 2025

work page 2025
[3]

Certifiably robust RAG against retrieval corruption,

C. Xiang et al., “Certifiably robust RAG against retrieval corruption,” inProc. NeurIPS, 2025

work page 2025
[4]

RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,

M. Kim et al., “RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,” arXiv:2511.01268, 2025

work page arXiv 2025
[5]

TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,

H. Zhou et al., “TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,” arXiv:2501.00879, 2025

work page arXiv 2025
[6]

RAGShield

P. Pathmanathan et al., “RAGPart & RAGMask: Retrieval-stage defenses against corpus poisoning in RAG,” arXiv:2512.24268, 2025

work page arXiv 2025
[7]

Traceback of poisoning attacks to retrieval-augmented generation,

B. Zhang et al., “Traceback of poisoning attacks to retrieval-augmented generation,” arXiv:2504.21668, 2025

work page arXiv 2025
[8]

CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,

C. Li et al., “CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,” arXiv:2505.19864, 2025

work page arXiv 2025
[9]

Practical poisoning attacks against retrieval-augmented generation,

B. Zhang et al., “Practical poisoning attacks against retrieval-augmented generation,” arXiv:2504.03957, 2025

work page arXiv 2025
[10]

The hidden threat in plain text: Attacking RAG data loaders,

A. Castagnaro et al., “The hidden threat in plain text: Attacking RAG data loaders,” arXiv:2507.05093, 2025

work page arXiv 2025
[11]

Benchmarking poisoning attacks against retrieval- augmented generation,

B. Zhang et al., “Benchmarking poisoning attacks against retrieval- augmented generation,” arXiv:2505.18543, 2025

work page arXiv 2025
[12]

RoyChowdhury, M

A. RoyChowdhury et al., “ConfusedPilot: Confused deputy risks in RAG-based LLMs,” arXiv:2408.04870, 2024

work page arXiv 2024
[13]

USAi.Gov: AI platform for federal agencies,

GSA, “USAi.Gov: AI platform for federal agencies,” 2025

work page 2025
[14]

GAO-25-107653: Generative AI use and management at federal agencies,

GAO, “GAO-25-107653: Generative AI use and management at federal agencies,” 2025

work page 2025
[15]

CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,

K.S.R. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,” arXiv:2603.29062, 2026

work page arXiv 2026
[16]

Natural Questions: A benchmark for question answering research,

T. Kwiatkowski et al., “Natural Questions: A benchmark for question answering research,”Trans. ACL, 2019

work page 2019
[17]

FEVER: A large-scale dataset for fact extraction and verification,

J. Thorne et al., “FEVER: A large-scale dataset for fact extraction and verification,” inProc. NAACL, 2018

work page 2018
[18]

ClaimBuster: The first-ever end-to-end fact-checking system,

N. Hassan et al., “ClaimBuster: The first-ever end-to-end fact-checking system,”Proc. VLDB Endow., 2017

work page 2017
[19]

SolarWinds supply chain compromise,

CISA, “SolarWinds supply chain compromise,” Alert AA20-352A, 2020

work page 2020
[20]

SP 800-53 Rev. 5: Security and privacy controls for information systems,

NIST, “SP 800-53 Rev. 5: Security and privacy controls for information systems,” 2020

work page 2020
[21]

TinyLlama: An Open-Source Small Language Model

P. Zhang et al., “TinyLlama: An open-source small language model,” arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024