NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
Title resolution pending
17 Pith papers cite this work, alongside 37 external citations. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
baseline 1polarities
baseline 1representative citing papers
ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of suspicious messages.
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
Agentic memory is lookup-based retrieval, not weight-based consolidation, creating a generalization ceiling on novel tasks and structural vulnerability to memory poisoning.
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
STARS fuses static priors and contextual risk scoring for agent skill invocations, achieving modest AUPRC gains on prompt injection attacks in a new SIA-Bench but concluding it supplements rather than replaces static auditing.
citing papers explorer
-
Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
-
Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios
ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
-
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
-
Many-Tier Instruction Hierarchy in LLM Agents
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
-
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
-
Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems
Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.
-
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
-
AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
-
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
-
LoopTrap: Termination Poisoning Attacks on LLM Agents
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
-
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems
Embedding-based defenses fail against attacks that align malicious message embeddings with benign ones in LLM multi-agent systems, but token-level confidence scores improve robustness by enabling better pruning of suspicious messages.
-
Alignment Contracts for Agentic Security Systems
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
-
Contextual Agentic Memory is a Memo, Not True Memory
Agentic memory is lookup-based retrieval, not weight-based consolidation, creating a generalization ceiling on novel tasks and structural vulnerability to memory poisoning.
-
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
-
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
-
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
-
STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems
STARS fuses static priors and contextual risk scoring for agent skill invocations, achieving modest AUPRC gains on prompt injection attacks in a new SIA-Bench but concluding it supplements rather than replaces static auditing.