FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks

Chaowei Xiao; Edward Suh; Fangzhou Wu; Jinsheng Pan; Jiongxiao Wang; Muhao Chen; Wendi Li; Z. Morley Mao

arxiv: 2410.21492 · v2 · pith:U5DIXMVVnew · submitted 2024-10-28 · 💻 cs.CR · cs.CL

FATH: Authentication-based Test-time Defense against Indirect Prompt Injection Attacks

Jiongxiao Wang , Fangzhou Wu , Wendi Li , Jinsheng Pan , Edward Suh , Z. Morley Mao , Muhao Chen , Chaowei Xiao This is my paper

classification 💻 cs.CR cs.CL

keywords attacksinstructionsdefensellmsmethodstest-timeauthenticationexternal

0 comments

read the original abstract

Large language models (LLMs) have been widely deployed as the backbone with additional tools and text information for real-world applications. However, integrating external information into LLM-integrated applications raises significant security concerns. Among these, prompt injection attacks are particularly threatening, where malicious instructions injected in the external text information can exploit LLMs to generate answers as the attackers desire. While both training-time and test-time defense methods have been developed to mitigate such attacks, the unaffordable training costs associated with training-time methods and the limited effectiveness of existing test-time methods make them impractical. This paper introduces a novel test-time defense strategy, named Formatting AuThentication with Hash-based tags (FATH). Unlike existing approaches that prevent LLMs from answering additional instructions in external text, our method implements an authentication system, requiring LLMs to answer all received instructions with a security policy and selectively filter out responses to user instructions as the final output. To achieve this, we utilize hash-based authentication tags to label each response, facilitating accurate identification of responses according to the user's instructions and improving the robustness against adaptive attacks. Comprehensive experiments demonstrate that our defense method can effectively defend against indirect prompt injection attacks, achieving state-of-the-art performance under Llama3 and GPT3.5 models across various attack methods. Our code is released at: https://github.com/Jayfeather1024/FATH

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory via relation-channel conflicts using an AIR pipeline, achieving 93.8% average attack success rate on Mem0 and three real-world datasets while bypassing existing defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge poisons graph-based agent memory by creating relation-channel conflicts that get extracted and retrieved, achieving 93.8% attack success rate on Mem0 and datasets like PubMedQA while evading prior defenses.
ShadowMerge: A Novel Poisoning Attack on Graph-Based Agent Memory via Relation-Channel Conflicts
cs.CR 2026-05 unverdicted novelty 8.0

ShadowMerge exploits relation-channel conflicts to poison graph-based agent memory, achieving 93.8% average attack success rate on Mem0 and real-world datasets while bypassing existing defenses.
PsychoPass: Geometric Profiling of Multi-Turn Adversarial LLM Conversations
cs.CR 2026-06 unverdicted novelty 6.0

PsychoPass shows adversarial LLM conversations exhibit an early geometric fingerprint in representation space that persists after removing length confounds and is detectable from short prefixes.
Robustness via Referencing: Defending against Prompt Injection Attacks by Referencing the Executed Instruction
cs.CR 2025-04 unverdicted novelty 6.0

The method prompts LLMs to output both answers and references to the executed instructions, then filters out any answers not linked to the original input instructions, reducing attack success rates to zero in tested s...
Reframing LLM Agent Security as an Agent-Human Interaction Problem
cs.CR 2026-05 unverdicted novelty 5.0

LLM agent security is reframed as an agent-human interaction issue, supported by a survey showing industry preference for human-centric mechanisms over academic favorites and proposing a new research agenda.