RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.
Machine against the rag: Jamming retrieval-augmented generation with blocker documents
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
AuthChain poisons a single document to achieve high-success attacks on RAG systems for multi-hop queries across six LLMs while evading defenses.
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
PRA-RAG is a new aggregation algorithm for RAG that claims provable robustness bounds against poisoned retrieved texts and reduces attack success rate to 1% while keeping 71% accuracy.
citing papers explorer
-
Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs
RAG models exhibit a monitoring-control gap: they acknowledge epistemic conflicts in accumulating documents yet fail to constrain unsafe recommendations, with single-turn tests overestimating safety.
-
One Shot Dominance: Knowledge Poisoning Attack on Retrieval-Augmented Generation Systems
AuthChain poisons a single document to achieve high-success attacks on RAG systems for multi-hop queries across six LLMs while evading defenses.
-
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested scenarios while preserving utility.
-
PRA-RAG: Provably Robust Aggregation in Retrieval-Augmented Generation against Retrieval Corruption
PRA-RAG is a new aggregation algorithm for RAG that claims provable robustness bounds against poisoned retrieved texts and reduces attack success rate to 1% while keeping 71% accuracy.