Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Liya Su; Shouyou Song; Tingwen Liu; Xiaokun Chen; Yingjie Zhang; Yuanbo Xie; Yulin Li; Zhihan Liu

arxiv: 2604.10717 · v1 · submitted 2026-04-12 · 💻 cs.CR · cs.AI· cs.CL

Detecting RAG Extraction Attack via Dual-Path Runtime Integrity Game

Yuanbo Xie , Yingjie Zhang , Yulin Li , Shouyou Song , Xiaokun Chen , Zhihan Liu , Liya Su , Tingwen Liu This is my paper

Pith reviewed 2026-05-10 15:15 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CL

keywords RAG securityknowledge base leakagecanary tokensextraction attacksruntime defenseLLM securityretrieval augmented generationintegrity monitoring

0 comments

The pith

CanaryRAG detects RAG extraction attacks by embedding canary tokens and monitoring dual-path integrity at runtime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Retrieval-augmented generation systems leak proprietary knowledge when hit with adversarial prompts, and existing defenses struggle against adaptive, iterative extraction strategies. The paper introduces CanaryRAG, which inserts specially designed canary tokens into retrieved chunks and recasts defense as a dual-path runtime integrity game. Leakage triggers real-time detection as soon as either the target path or oracle path breaks its expected canary behavior, even when attackers attempt suppression or obfuscation. The method adds itself to any existing RAG pipeline without retraining or structural changes. Evaluations show substantially lower chunk recovery rates than prior baselines while leaving task accuracy and inference latency nearly unchanged.

Core claim

CanaryRAG embeds carefully designed canary tokens into retrieved chunks and reformulates RAG extraction defense as a dual-path runtime integrity game, detecting leakage in real time whenever the target or oracle path violates its expected canary behavior, even under adaptive suppression and obfuscation.

What carries the argument

Dual-path runtime integrity game in which canary tokens are embedded in chunks and checked across target and oracle paths to flag behavioral violations as evidence of leakage.

If this is right

RAG pipelines gain real-time leakage detection without retraining or redesign.
Proprietary chunks become harder to recover intact through prompt-based extraction.
The defense works against iterative and adaptive attack variants that suppress markers.
Task performance and inference speed stay close to undefended baselines.
The module integrates as plug-and-play into arbitrary RAG systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Canary designs may require periodic refresh as new suppression techniques emerge.
The dual-path check could apply to leakage monitoring in non-RAG retrieval or agent systems.
High-stakes domains may need additional tuning to keep false positives low.
Layering this runtime check with training-time or filtering defenses could raise the overall security bar.

Load-bearing premise

Specially designed canary tokens can be embedded so they remain effective against adaptive suppression and obfuscation, and their violations reliably indicate leakage without excessive false positives.

What would settle it

An attacker successfully extracts full chunks while preserving expected canary behavior in both target and oracle paths, or normal non-adversarial queries trigger frequent false-positive canary violations.

Figures

Figures reproduced from arXiv: 2604.10717 by Liya Su, Shouyou Song, Tingwen Liu, Xiaokun Chen, Yingjie Zhang, Yuanbo Xie, Yulin Li, Zhihan Liu.

**Figure 2.** Figure 2: Latency Distribution Comparison Between No Defense and CanaryRAG As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Window-based false blocking probability under agent-specific per-query FPRs. Each curve shows the [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Per-query latency scatter plots across datasets [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Effect of base per-query false positive rate on window-based blocking. Sweeping the base rate [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) systems augment large language models with external knowledge, yet introduce a critical security vulnerability: RAG Knowledge Base Leakage, wherein adversarial prompts can induce the model to divulge retrieved proprietary content. Recent studies reveal that such leakage can be executed through adaptive and iterative attack strategies (named RAG extraction attack), while effective countermeasures remain notably lacking. To bridge this gap, we propose CanaryRAG, a runtime defense mechanism inspired by stack canaries in software security. CanaryRAG embeds carefully designed canary tokens into retrieved chunks and reformulates RAG extraction defense as a dual-path runtime integrity game. Leakage is detected in real time whenever either the target or oracle path violates its expected canary behavior, including under adaptive suppression and obfuscation. Extensive evaluations against existing attacks demonstrate that CanaryRAG provides robust defense, achieving substantially lower chunk recovery rates than state-of-the-art baselines while imposing negligible impact on task performance and inference latency. Moreover, as a plug-and-play solution, CanaryRAG can be seamlessly integrated into arbitrary RAG pipelines without requiring retraining or structural modifications, offering a practical and scalable safeguard for proprietary data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CanaryRAG puts canary tokens into RAG chunks and runs a dual-path check to flag extraction, a practical idea that still needs proof it holds when attackers know the defense.

read the letter

The paper's main contribution is CanaryRAG, which embeds special canary tokens in retrieved chunks and treats defense as a dual-path runtime integrity game. One path is the normal target output and the other is an oracle check; any deviation from expected canary behavior triggers a leak alert. This runs at inference time, works as a plug-in on existing RAG setups, and claims almost no extra latency or accuracy drop while cutting chunk recovery rates versus prior baselines. That framing is new enough and directly tackles the real problem of proprietary knowledge bases leaking through clever prompts. The stack-canary analogy is applied cleanly here, and the plug-and-play property is a genuine engineering win if the numbers check out. The design avoids any circular math or fitted parameters, which keeps it straightforward. The evaluations are described as extensive against existing attacks, including adaptive and iterative ones, with claims of substantially lower recovery rates and negligible overhead. If those experiments include attackers who know or can infer the canary tokens and deliberately suppress or alter them, the robustness story holds. The stress-test concern is worth checking: if the reported attacks predate this defense and do not model an adversary aware of the canary strategy, the measured margin does not yet prove resistance to informed suppression or obfuscation. The abstract itself gives no numbers, so the full paper's tables and attack descriptions will decide whether the evidence is solid. False-positive rates and canary token design details also need to be explicit. This work is aimed at applied security researchers and teams running RAG in production. It is coherent on its own terms and addresses a timely gap, so it deserves a serious referee even if revisions on the adaptive evaluation are likely. I would bring it to a reading group for the mechanism discussion but would not cite it yet without seeing the full attack models and results.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces CanaryRAG, a runtime defense for RAG systems against knowledge-base extraction attacks. It embeds specially designed canary tokens into retrieved chunks and reformulates detection as a dual-path runtime integrity game (target path and oracle path). Leakage is flagged in real time when either path violates its expected canary behavior. The paper claims that this yields substantially lower chunk recovery rates than existing baselines, imposes negligible overhead on task accuracy and latency, and functions as a plug-and-play module that requires no retraining or pipeline changes.

Significance. A practical, low-overhead detection mechanism for RAG leakage would address a timely security gap. The dual-path canary formulation is a clean analogy to stack canaries and, if the empirical margin holds against adaptive adversaries, could be adopted quickly. The work's value hinges on whether the reported evaluations actually test the adaptive-suppression and obfuscation scenarios the abstract highlights; absent that, the claimed robustness remains unestablished.

major comments (1)

[Evaluation section] The central robustness claim (substantially lower chunk recovery rates under adaptive suppression, obfuscation, and iterative strategies) is load-bearing for the paper's contribution. The evaluation section states that experiments were run 'against existing attacks,' but does not indicate whether those attacks were re-implemented or adapted with knowledge of the canary-embedding strategy. If the attackers remain unaware of CanaryRAG, the measured defense margin does not demonstrate resilience to the adaptive case asserted in the abstract and §3.

minor comments (2)

[Abstract] The abstract asserts 'substantially lower chunk recovery rates' and 'negligible impact' without any numerical values, baseline names, or statistical measures; readers must reach the evaluation section to assess these claims.
[§3] The definition and generation procedure for the canary tokens (invented entity in the design) should be stated more explicitly, including any randomness or parameterization, so that reproducibility is immediate.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your constructive and detailed feedback on our manuscript. We have reviewed the major comment carefully and provide a point-by-point response below, including planned revisions to improve clarity.

read point-by-point responses

Referee: [Evaluation section] The central robustness claim (substantially lower chunk recovery rates under adaptive suppression, obfuscation, and iterative strategies) is load-bearing for the paper's contribution. The evaluation section states that experiments were run 'against existing attacks,' but does not indicate whether those attacks were re-implemented or adapted with knowledge of the canary-embedding strategy. If the attackers remain unaware of CanaryRAG, the measured defense margin does not demonstrate resilience to the adaptive case asserted in the abstract and §3.

Authors: We agree that distinguishing between attacks adapted with knowledge of CanaryRAG and those drawn directly from prior work is essential for substantiating the robustness claims. The experiments in the Evaluation section were performed using the attack strategies exactly as described in the referenced literature on RAG extraction attacks, without re-implementing or modifying them to incorporate knowledge of the canary-embedding strategy or dual-path detection. Thus, the attackers operated without awareness of CanaryRAG. This evaluates the defense against the adaptive suppression, obfuscation, and iterative behaviors highlighted in those works, but does not constitute a fully white-box adaptive adversary tailored to our mechanism. We will revise the Evaluation section to explicitly state this scope and clarify the distinction. We will also expand the discussion in §3 to explain how the dual-path integrity game (target and oracle paths) is structured to flag violations even under suppression and obfuscation attempts, as the independent oracle path check does not rely on attacker ignorance. These changes will better align the empirical results with the claims in the abstract while acknowledging the current evaluation's limitations regarding fully informed adaptive adversaries. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system design without derivations or self-referential steps

full rationale

The paper proposes CanaryRAG as a plug-and-play runtime defense using canary token embedding and a dual-path integrity game to detect RAG extraction attacks. No mathematical derivations, equations, fitted parameters, or prediction steps appear in the provided text. Central claims rest on empirical evaluations against existing attacks rather than any self-citation chain, ansatz, or uniqueness theorem imported from prior author work. The design is self-contained as a system proposal with reported performance metrics; no load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the unverified effectiveness of canary token placement and dual-path violation detection against adaptive adversaries; these are introduced without independent external benchmarks or formal proofs in the provided abstract.

invented entities (1)

Canary tokens no independent evidence
purpose: Embedded markers placed in retrieved chunks to enable detection of leakage via expected behavior violations
These tokens are a core invented component of the defense, with no independent evidence of their undetectability or robustness provided.

pith-pipeline@v0.9.0 · 5524 in / 1188 out tokens · 55722 ms · 2026-05-10T15:15:44.800503+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Hugging Face dataset

Minibioasq. Hugging Face dataset

work page
[2]

arXiv preprint arXiv:2409.08045 , year=

Miniwikipedia. Hugging Face dataset. Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Un- leashing worms and extracting data: Escalating the outcome of attacks against rag-based inference in scale and severity using jailbreaking.arXiv preprint arXiv:2409.08045. Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Per...

work page arXiv 2024
[3]

At the end of a short answer,

Ragfort: Dual-path defense against proprietary knowledge base extraction in retrieval-augmented generation.arXiv preprint arXiv:2511.10128. Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6). Yep...

work page arXiv 2023

[1] [1]

Hugging Face dataset

Minibioasq. Hugging Face dataset

work page

[2] [2]

arXiv preprint arXiv:2409.08045 , year=

Miniwikipedia. Hugging Face dataset. Stav Cohen, Ron Bitton, and Ben Nassi. 2024. Un- leashing worms and extracting data: Escalating the outcome of attacks against rag-based inference in scale and severity using jailbreaking.arXiv preprint arXiv:2409.08045. Crispan Cowan, Calton Pu, Dave Maier, Jonathan Walpole, Peat Bakke, Steve Beattie, Aaron Grier, Per...

work page arXiv 2024

[3] [3]

At the end of a short answer,

Ragfort: Dual-path defense against proprietary knowledge base extraction in retrieval-augmented generation.arXiv preprint arXiv:2511.10128. Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge. Cureus, 15(6). Yep...

work page arXiv 2023