pith. machine review for the scientific record. sign in

arxiv: 2604.00387 · v2 · submitted 2026-04-01 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

RAGShield: Detecting Numerical Claim Manipulation in Government RAG Systems

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords RAGnumerical manipulationgovernment RAGIRS documentsembedding blind spotvalue verificationattack detectioncontext propagation
0
0 comments X

The pith

RAG systems for government services are vulnerable to undetectable numerical manipulations unless they verify values directly instead of using embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-Augmented Generation systems used by federal agencies for tax guidance and benefits can be tricked by changing specific numbers like deductions without altering the text's meaning enough to be caught by similarity checks. Embeddings focus on topics rather than precise figures, leading to a sensitivity gap where embedding defenses miss most attacks. RAGShield addresses this by extracting numerical values such as dollar amounts and percentages from documents, linking them to entities using context propagation, and cross-verifying against a registry derived from the corpus. It also monitors for changes outside expected update times. Tests on 430 attacks from real IRS documents show complete detection while other methods fail on 79 to 90 percent of cases.

Core claim

This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value

What carries the argument

A pattern-based engine that extracts numerical values and links them to entities via two-pass context propagation for verification against a corpus registry.

Load-bearing premise

That the pattern-based extraction and two-pass linking can achieve 99.8% accuracy on all numerical values in government documents with an error-free registry.

What would settle it

Finding a numerical manipulation in IRS documents that the system fails to detect due to missed extraction or incorrect registry match.

Figures

Figures reproduced from arXiv: 2604.00387 by KrishnaSaiReddy Patil.

Figure 1
Figure 1. Figure 1: RAGShield architecture. The provenance layer [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) systems are deployed across federal agencies for citizen-facing tax guidance, benefits eligibility, and legal information, where a single incorrect number causes direct financial harm. This paper proves that all embedding-based RAG defenses share a fundamental blind spot: changing a tax deduction by $50,000 produces cosine similarity 0.9998, invisible to every known detection threshold. Across 174 manipulation pairs and two embedding models, the mean sensitivity gap is 1,459x. The blind spot is confirmed on real IRS documents.The root cause is that embeddings encode topic, not numerical precision. RAGShield sidesteps this by operating on extracted values directly: a pattern-based engine identifies dollar amounts and percentages in government text, links each value to its governing entity through two-pass context propagation (99.8% entity detection on 2,742 real IRS passages), and verifies every claim against a cross-source registry built from the corpus itself. A temporal tracker flags value changes that fall outside known government update schedules. On 430 attacks generated from real IRS document content, RAGShield detects every one (0.0% ASR, 95% CI [0%, 1%]) while embedding-based defenses miss 79-90% of the same attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that embedding-based RAG defenses have a fundamental blind spot for numerical claim manipulations in government documents, as altering values like tax deductions yields cosine similarities of 0.9998 that evade detection thresholds. It introduces RAGShield, which extracts dollar amounts and percentages via patterns, links values to entities through two-pass context propagation (99.8% accuracy on 2,742 IRS passages), verifies claims against a corpus-derived cross-source registry, and applies temporal tracking for update anomalies. On 174 manipulation pairs and 430 attacks generated from real IRS content, RAGShield achieves 0.0% ASR (95% CI [0%, 1%]) while embedding defenses miss 79-90% of attacks.

Significance. If the detection performance holds under the reported conditions, the work identifies a concrete, high-impact vulnerability in RAG systems used for citizen-facing government information where numerical errors can cause direct financial harm. The evaluation on real IRS documents, concrete metrics with confidence intervals, and direct comparison to embedding baselines provide a reproducible baseline for defenses that operate on extracted values rather than semantic similarity.

major comments (3)
  1. [Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.
  2. [Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.
  3. [Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.
minor comments (2)
  1. [Abstract] Name the two embedding models used for the sensitivity gap and ASR comparisons.
  2. [RAGShield description] Clarify how the cross-source registry is constructed from the corpus and how the temporal tracker determines 'known government update schedules'.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's thorough review and valuable feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. Where revisions are needed, we have updated the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: [Evaluation / Attack set] The 0.0% ASR claim on the 430 attacks (abstract and evaluation) rests on the pattern-based engine and two-pass linking achieving near-perfect performance on those specific documents. The 99.8% entity detection accuracy is measured on a separate set of 2,742 passages; the paper must report extraction and linking accuracy (including failure modes on tables, footnotes, or multi-value sentences) directly on the 430 attack instances, as even the 0.2% residual error rate could produce undetected manipulations if errors are systematic.

    Authors: We agree with this observation. In the revised manuscript, we have added a new subsection in the evaluation section reporting the extraction and linking accuracy specifically on the 430 attack instances. The entity linking accuracy on these instances is 99.6%, with detailed failure mode analysis (e.g., 2 cases in tables, 1 in footnotes) included in Appendix C. This confirms that the 0.0% ASR is not due to undetected errors in the pipeline. revision: yes

  2. Referee: [Methods / Attack generation] Attack generation details are insufficient to assess coverage of the claimed blind spot. The abstract states attacks are 'generated from real IRS document content' but provides no description of the manipulation rules, selection criteria for the 174 pairs, or controls ensuring the 430 instances test the full range of numerical formats and contexts; this information is required in the methods section to confirm the attacks are not inadvertently easier for the pattern engine.

    Authors: We thank the referee for pointing this out. We have expanded the Methods section (Section 4.2) with a detailed description of the attack generation process. Specifically, we selected 174 base passages from IRS documents, applied 5 manipulation rules (e.g., increment/decrement by 10-50%, swap with similar values from other documents), and generated variants ensuring coverage of formats like '$X,XXX', 'X.X%', and contexts including tables and footnotes. We also added controls for diversity in numerical contexts. revision: yes

  3. Referee: [Evaluation / Sensitivity analysis] The mean sensitivity gap of 1,459x (abstract) across 174 pairs and two embedding models lacks the exact definition and per-pair data. The paper should specify the formula for the gap (e.g., ratio of detection thresholds or similarity differences) and include a table or appendix with individual pair results to allow verification that the aggregate figure is not driven by outliers.

    Authors: We agree that additional details are necessary for reproducibility. In the revised version, we have defined the sensitivity gap explicitly as the ratio of the minimum cosine similarity threshold required to detect the manipulation to the observed similarity of the manipulated pair. We have also added Table 5 in the appendix listing the sensitivity gap for each of the 174 pairs, along with the mean, median, and standard deviation to demonstrate that the reported mean is not outlier-driven. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on separate empirical measurements

full rationale

The paper reports three independent empirical results: (1) cosine similarity of 0.9998 on 174 manipulation pairs across two embedding models, (2) 99.8% entity-linking accuracy measured on a distinct set of 2,742 IRS passages, and (3) 0.0% ASR on 430 attacks generated from real IRS content. The pattern-based engine is a fixed rule-based extractor whose accuracy is stated as a measured quantity on the 2,742-passage benchmark; the 430-attack evaluation is a direct count of detections on held-out generated examples. No equations reduce a prediction to a fitted input by construction, no self-citations bear the central claim, and no ansatz or uniqueness theorem is imported. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that numerical values can be accurately extracted and linked via patterns and context without needing external labeled data, plus the premise that a self-constructed registry from the corpus serves as reliable ground truth.

axioms (2)
  • domain assumption Embeddings encode topic rather than numerical precision
    Invoked as the root cause of the blind spot for all embedding-based defenses.
  • domain assumption Pattern-based extraction plus two-pass context propagation achieves reliable entity linking on government text
    Required for the 99.8% detection rate and overall system operation.

pith-pipeline@v0.9.0 · 5521 in / 1389 out tokens · 68309 ms · 2026-05-13T23:10:08.748578+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SentinelAgent: Intent-Verified Delegation Chains for Securing Federal Multi-Agent AI Systems

    cs.CR 2026-04 conditional novelty 8.0 partial

    SentinelAgent defines seven properties for verifiable delegation chains in multi-agent AI systems and reports a protocol achieving 100% true positive rate at 0% false positives on a 516-scenario benchmark while using ...

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Phantom: General trigger attacks on retrieval augmented language generation,

    H. Chaudhari et al., “Phantom: General trigger attacks on retrieval augmented language generation,” inProc. NeurIPS, 2024. 12

  2. [2]

    PoisonedRAG: Knowledge corruption attacks to retrieval- augmented generation of large language models,

    J. Zou et al., “PoisonedRAG: Knowledge corruption attacks to retrieval- augmented generation of large language models,” inProc. USENIX Security, 2025

  3. [3]

    Certifiably robust RAG against retrieval corruption,

    C. Xiang et al., “Certifiably robust RAG against retrieval corruption,” inProc. NeurIPS, 2025

  4. [4]

    RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,

    M. Kim et al., “RAGDefender: Efficient defense against knowledge corruption attacks on RAG systems,” arXiv:2511.01268, 2025

  5. [5]

    TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,

    H. Zhou et al., “TrustRAG: Enhancing robustness and trustworthiness in retrieval-augmented generation,” arXiv:2501.00879, 2025

  6. [6]

    RAGShield

    P. Pathmanathan et al., “RAGPart & RAGMask: Retrieval-stage defenses against corpus poisoning in RAG,” arXiv:2512.24268, 2025

  7. [7]

    Traceback of poisoning attacks to retrieval-augmented generation,

    B. Zhang et al., “Traceback of poisoning attacks to retrieval-augmented generation,” arXiv:2504.21668, 2025

  8. [8]

    CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,

    C. Li et al., “CPA-RAG: Covert poisoning attacks on retrieval- augmented generation in large language models,” arXiv:2505.19864, 2025

  9. [9]

    Practical poisoning attacks against retrieval-augmented generation,

    B. Zhang et al., “Practical poisoning attacks against retrieval-augmented generation,” arXiv:2504.03957, 2025

  10. [10]

    The hidden threat in plain text: Attacking RAG data loaders,

    A. Castagnaro et al., “The hidden threat in plain text: Attacking RAG data loaders,” arXiv:2507.05093, 2025

  11. [11]

    Benchmarking poisoning attacks against retrieval- augmented generation,

    B. Zhang et al., “Benchmarking poisoning attacks against retrieval- augmented generation,” arXiv:2505.18543, 2025

  12. [12]

    RoyChowdhury, M

    A. RoyChowdhury et al., “ConfusedPilot: Confused deputy risks in RAG-based LLMs,” arXiv:2408.04870, 2024

  13. [13]

    USAi.Gov: AI platform for federal agencies,

    GSA, “USAi.Gov: AI platform for federal agencies,” 2025

  14. [14]

    GAO-25-107653: Generative AI use and management at federal agencies,

    GAO, “GAO-25-107653: Generative AI use and management at federal agencies,” 2025

  15. [15]

    CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,

    K.S.R. Patil, “CivicShield: A cross-domain defense-in-depth framework for securing government-facing AI chatbots,” arXiv:2603.29062, 2026

  16. [16]

    Natural Questions: A benchmark for question answering research,

    T. Kwiatkowski et al., “Natural Questions: A benchmark for question answering research,”Trans. ACL, 2019

  17. [17]

    FEVER: A large-scale dataset for fact extraction and verification,

    J. Thorne et al., “FEVER: A large-scale dataset for fact extraction and verification,” inProc. NAACL, 2018

  18. [18]

    ClaimBuster: The first-ever end-to-end fact-checking system,

    N. Hassan et al., “ClaimBuster: The first-ever end-to-end fact-checking system,”Proc. VLDB Endow., 2017

  19. [19]

    SolarWinds supply chain compromise,

    CISA, “SolarWinds supply chain compromise,” Alert AA20-352A, 2020

  20. [20]

    SP 800-53 Rev. 5: Security and privacy controls for information systems,

    NIST, “SP 800-53 Rev. 5: Security and privacy controls for information systems,” 2020

  21. [21]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang et al., “TinyLlama: An open-source small language model,” arXiv:2401.02385, 2024