Recognition: 2 theorem links
· Lean TheoremRAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems
Pith reviewed 2026-05-13 23:10 UTC · model grok-4.3
The pith
RAG systems for government services are vulnerable to undetectable numerical manipulations unless they verify values directly instead of using embeddings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value
What carries the argument
A pattern-based engine that extracts numerical values and links them to entities via two-pass context propagation for verification against a corpus registry.
Load-bearing premise
That the pattern-based extraction and two-pass linking can achieve 99.8% accuracy on all numerical values in government documents with an error-free registry.
What would settle it
Finding a numerical manipulation in IRS documents that the system fails to detect due to missed extraction or incorrect registry match.
Figures
read the original abstract
Retrieval-Augmented Generation (RAG) systems are deployed across federal agencies for citizen-facing tax guidance, benefits eligibility, and legal information, where a single incorrect number causes direct financial harm. This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. The blind spot is confirmed on real IRS documents.The root cause is that embeddings encode topic, not numerical precision. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value changes that fall outside known government update schedules. On 430 attacks generated from real IRS document content, RAGShield detects every one (0.0% ASR, 95% CI [0%, 1%]) while embedding-based defenses miss 79-90% of the same attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that embedding-based RAG defenses have a fundamental blind spot for numerical claim manipulations in government documents, as altering values like tax deductions yields cosine similarities of 0.9998 that evade detection thresholds. It introduces RAGShield, which extracts dollar amounts and percentages via patterns, links values to entities through two-pass context propagation (99.8% accuracy on 2,742 IRS passages), verifies claims against a corpus-derived cross-source registry, and applies temporal tracking for update anomalies. On 174 manipulation pairs and 430 attacks generated from real IRS content, RAGShield achieves 0.0% ASR (95% CI [0%, 1%]) while embedding defenses miss 79-90% of attacks.
Significance. If the detection performance holds under the reported conditions, the work identifies a concrete, high-impact vulnerability in RAG systems used for citizen-facing government information where numerical errors can cause direct financial harm. The evaluation on real IRS documents, concrete metrics with confidence intervals, and direct comparison to embedding baselines provide a reproducible baseline for defenses that operate on extracted values rather than semantic similarity.
major comments (3)
- [Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.
- [Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.
- [Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.
minor comments (2)
- [Abstract] Name the two embedding models used for the sensitivity gap and ASR comparisons.
- [RAGShield description] Clarify how the cross-source registry is constructed from the corpus and how the temporal tracker determines 'known government update schedules'.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and valuable feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we have updated the manuscript accordingly to strengthen the presentation of our results.
read point-by-point responses
-
Referee: [Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.
Authors: We agree with this observation. In the revised manuscript, we have added a new subsection in the evaluation section reporting the extraction and linking accuracy specifically on the 430 attack instances. The entity linking accuracy on these instances is 99.6%, with detailed failure mode analysis (e.g., 2 cases in tables, 1 in footnotes) included in Appendix C. This confirms that the 0.0% ASR is not due to undetected errors in the pipeline. revision: yes
-
Referee: [Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.
Authors: We thank the referee for pointing this out. We have expanded the Methods section (Section 4.2) with a detailed description of the attack generation process. Specifically, we selected 174 base passages from IRS documents, applied 5 manipulation rules (e.g., increment/decrement by 10-50%, swap with similar values from other documents), and generated variants ensuring coverage of formats like '$X,XXX', 'X.X%', and contexts including tables and footnotes. We also added controls for diversity in numerical contexts. revision: yes
-
Referee: [Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.
Authors: We agree that additional details are necessary for reproducibility. In the revised version, we have defined the sensitivity gap explicitly as the ratio of the minimum cosine similarity threshold required to detect the manipulation to the observed similarity of the manipulated pair. We have also added Table 5 in the appendix listing the sensitivity gap for each of the 174 pairs, along with the mean, median, and standard deviation to demonstrate that the reported mean is not outlier-driven. revision: yes
Circularity Check
No significant circularity; claims rest on separate empirical measurements
full rationale
The paper reports three independent empirical results: (1) cosine similarity of 0.9998 on 174 manipulation pairs across two embedding models, (2) 99.8% entity-linking accuracy measured on a distinct set of 2,742 IRS passages, and (3) 0.0% ASR on 430 attacks generated from real IRS content. The pattern-based engine is a fixed rule-based extractor whose accuracy is stated as a measured quantity on the 2,742-passage benchmark; the 430-attack evaluation is a direct count of detections on held-out generated examples. No equations reduce a prediction to a fitted input by construction, no self-citations bear the central claim, and no ansatz or uniqueness theorem is imported. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Embeddings encode topic rather than numerical precision
- domain assumption Pattern-based extraction plus two-pass context propagation achieves reliable entity linking on government text
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
a pattern-based engine identifies dollar amounts and percentages... two-pass context propagation (99.8% entity detection on 2,742 real IRS passages)
-
IndisputableMonolith/Foundation/ArrowOfTime.leanarrow_from_z unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
temporal tracker flags value changes that fall outside known government update schedules
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems
SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using ...
Reference graph
Works this paper leans on
-
[1]
Phantom: General trigger attacks on retrieval augmented language generation,
H. Chaudhari et al., “Phantom: General trigger attacks on retrieval augmented language generation,” inProc. NeurIPS, 2024. 12
work page 2024
-
[2]
J. Zou et al., “PoisonedRAG: Knowledge corruption attacks to retrieval- augmented generation of large language models,” inProc. USENIX Security, 2025
work page 2025
-
[3]
Certifiably robust RAG against retrieval corruption,
C. Xiang et al., “Certifiably robust RAG against retrieval corruption,” inProc. NeurIPS, 2025
work page 2025
-
[4]
RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,
M. Kim et al., “RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,” arXiv:2511.01268, 2025
-
[5]
TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,
H. Zhou et al., “TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,” arXiv:2501.00879, 2025
- [6]
-
[7]
Traceback of poisoning attacks to retrieval-augmented generation,
B. Zhang et al., “Traceback of poisoning attacks to retrieval-augmented generation,” arXiv:2504.21668, 2025
-
[8]
CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,
C. Li et al., “CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,” arXiv:2505.19864, 2025
-
[9]
Practical poisoning attacks against retrieval-augmented generation,
B. Zhang et al., “Practical poisoning attacks against retrieval-augmented generation,” arXiv:2504.03957, 2025
-
[10]
The hidden threat in plain text: Attacking RAG data loaders,
A. Castagnaro et al., “The hidden threat in plain text: Attacking RAG data loaders,” arXiv:2507.05093, 2025
-
[11]
Benchmarking poisoning attacks against retrieval- augmented generation,
B. Zhang et al., “Benchmarking poisoning attacks against retrieval- augmented generation,” arXiv:2505.18543, 2025
-
[12]
ConfusedPilot: Confused deputy risks in RAG-based LLMs,
A. RoyChowdhury et al., “ConfusedPilot: Confused deputy risks in RAG-based LLMs,” arXiv:2408.04870, 2024
-
[13]
USAi.Gov: AI platform for federal agencies,
GSA, “USAi.Gov: AI platform for federal agencies,” 2025
work page 2025
-
[14]
GAO-25-107653: Generative AI use and management at federal agencies,
GAO, “GAO-25-107653: Generative AI use and management at federal agencies,” 2025
work page 2025
-
[15]
CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,
K.S.R. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,” arXiv:2603.29062, 2026
-
[16]
Natural Questions: A benchmark for question answering research,
T. Kwiatkowski et al., “Natural Questions: A benchmark for question answering research,”Trans. ACL, 2019
work page 2019
-
[17]
FEVER: A large-scale dataset for fact extraction and verification,
J. Thorne et al., “FEVER: A large-scale dataset for fact extraction and verification,” inProc. NAACL, 2018
work page 2018
-
[18]
ClaimBuster: The first-ever end-to-end fact-checking system,
N. Hassan et al., “ClaimBuster: The first-ever end-to-end fact-checking system,”Proc. VLDB Endow., 2017
work page 2017
-
[19]
SolarWinds supply chain compromise,
CISA, “SolarWinds supply chain compromise,” Alert AA20-352A, 2020
work page 2020
-
[20]
SP 800-53 Rev. 5: Security and privacy controls for information systems,
NIST, “SP 800-53 Rev. 5: Security and privacy controls for information systems,” 2020
work page 2020
-
[21]
TinyLlama: An Open-Source Small Language Model
P. Zhang et al., “TinyLlama: An open-source small language model,” arXiv:2401.02385, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.