Title resolution pending

Qiusi Zhan, Zhixiang Liang, Zifan Ying, Daniel Kang · 2024 · Findings of the Association for Computational Linguistics ACL 2024 · DOI 10.18653/v1/2024.findings-acl.624

30 Pith papers cite this work, alongside 37 external citations. Polarity classification is still indexing.

30 Pith papers citing it

37 external citations · Crossref

open at publisher browse 30 citing papers

Title metadata for this work has not finished resolving. The hub is built from the citation graph; the title resolver retries DOI and OpenAlex on its next pass.

citation-role summary

background 1 baseline 1

citation-polarity summary

baseline 1 unclear 1

representative citing papers

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents

cs.CR · 2026-04-25 · unverdicted · novelty 8.0

NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.

SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment

cs.AI · 2026-06-21 · unverdicted · novelty 7.0

SkillAudit is an automated framework that generates capability-aligned tasks from skill packages, executes them in sandboxes, and produces reports on utility, cost, and safety via baseline comparisons and two-stage risk detection.

SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents

cs.CR · 2026-06-16 · accept · novelty 7.0

SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.

GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines

cs.CR · 2026-06-07 · unverdicted · novelty 7.0

GitInject is an open-source framework that runs live GitHub workflows to demonstrate prompt injection attacks on AI agents in CI/CD pipelines, finding all four tested providers vulnerable in default configurations due to structural issues in credential and config handling.

From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents

cs.CR · 2026-06-03 · unverdicted · novelty 7.0

The study identifies four memory write channels and nine structural vulnerabilities in LLM agents, proposes a taxonomy of six attack classes, introduces MPBench, and finds that aggressive memory use increases exploitability while existing defenses fail.

AIRGuard: Guarding Agent Actions with Runtime Authority Control

cs.CR · 2026-05-27 · unverdicted · novelty 7.0

AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

cs.CR · 2026-05-14 · unverdicted · novelty 7.0

A new 507-leaf taxonomy and 4x6 Target x Technique matrix audits six LLM attack benchmarks and finds they cover at most 25% of the threat surface with entire STRIDE categories untested.

Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.

When Alignment Isn't Enough: Response-Path Attacks on LLM Agents

cs.CR · 2026-05-04 · unverdicted · novelty 7.0

A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.

When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

cs.CR · 2026-05-01 · unverdicted · novelty 7.0

Embedding-based defenses fail against crafted attacks in LLM MAS; confidence scores from logits improve robustness but decay over communication rounds.

Many-Tier Instruction Hierarchy in LLM Agents

cs.CL · 2026-04-10 · unverdicted · novelty 7.0

ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.

Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents

cs.CR · 2024-10-03 · unverdicted · novelty 7.0

ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.

Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense

cs.CR · 2026-06-29 · unverdicted · novelty 6.0

Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.

Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents

cs.AI · 2026-06-19 · unverdicted · novelty 6.0

Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.

MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents

cs.CR · 2026-06-09 · unverdicted · novelty 6.0

MemVenom poisons multimodal memories in web agents via a two-stage trigger-conditioned retrieval and post-retrieval induction attack, achieving up to 99.15% success on GPT-5-family agents while preserving benign performance.

From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors

cs.CR · 2026-05-29 · unverdicted · novelty 6.0

Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

cs.CR · 2026-05-21 · conditional · novelty 6.0

Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

cs.CR · 2026-05-17 · conditional · novelty 6.0

Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.

AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills

cs.CR · 2026-05-13 · conditional · novelty 6.0

AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.

When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents

cs.AI · 2026-05-09 · unverdicted · novelty 6.0 · 2 refs

EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.

LoopTrap: Termination Poisoning Attacks on LLM Agents

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.

Alignment Contracts for Agentic Security Systems

cs.CR · 2026-04-30 · conditional · novelty 6.0

Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.

Contextual Agentic Memory is a Memo, Not True Memory

cs.AI · 2026-04-30 · unverdicted · novelty 6.0

Agentic memory is lookup-based retrieval, not weight-based consolidation, creating a generalization ceiling on novel tasks and structural vulnerability to memory poisoning.

RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents

cs.CR · 2026-04-24 · unverdicted · novelty 6.0

RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.

citing papers explorer

Showing 30 of 30 citing papers.

Ghost in the Agent: Redefining Information Flow Tracking for LLM Agents cs.CR · 2026-04-25 · unverdicted · none · ref 38
NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
SkillAudit: From Fixed-Suite Benchmarking to Skill-Centered Assessment cs.AI · 2026-06-21 · unverdicted · none · ref 13
SkillAudit is an automated framework that generates capability-aligned tasks from skill packages, executes them in sandboxes, and produces reports on utility, cost, and safety via baseline comparisons and two-stage risk detection.
SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents cs.CR · 2026-06-16 · accept · none · ref 38
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
GitInject: Real-World Prompt Injection Attacks in AI-Powered CI/CD Pipelines cs.CR · 2026-06-07 · unverdicted · none · ref 5
GitInject is an open-source framework that runs live GitHub workflows to demonstrate prompt injection attacks on AI agents in CI/CD pipelines, finding all four tested providers vulnerable in default configurations due to structural issues in credential and config handling.
From Untrusted Input to Trusted Memory: A Systematic Study of Memory Poisoning Attacks in LLM Agents cs.CR · 2026-06-03 · unverdicted · none · ref 24
The study identifies four memory write channels and nine structural vulnerabilities in LLM agents, proposes a taxonomy of six attack classes, introduces MPBench, and finds that aggressive memory use increases exploitability while existing defenses fail.
AIRGuard: Guarding Agent Actions with Runtime Authority Control cs.CR · 2026-05-27 · unverdicted · none · ref 24
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks cs.CR · 2026-05-14 · unverdicted · none · ref 24
A new 507-leaf taxonomy and 4x6 Target x Technique matrix audits six LLM attack benchmarks and finds they cover at most 25% of the threat surface with entire STRIDE categories untested.
Enhancing Agent Safety Judgment: Controlled Benchmark Rewriting and Analogical Reasoning for Deceptive Out-of-Distribution Scenarios cs.AI · 2026-05-05 · unverdicted · none · ref 53
ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
When Alignment Isn't Enough: Response-Path Attacks on LLM Agents cs.CR · 2026-05-04 · unverdicted · none · ref 69
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems cs.CR · 2026-05-01 · unverdicted · none · ref 36
Embedding-based defenses fail against crafted attacks in LLM MAS; confidence scores from logits improve robustness but decay over communication rounds.
Many-Tier Instruction Hierarchy in LLM Agents cs.CL · 2026-04-10 · unverdicted · none · ref 32
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
Agent Security Bench (ASB): Formalizing and Benchmarking Attacks and Defenses in LLM-based Agents cs.CR · 2024-10-03 · unverdicted · none · ref 160
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Security--Fidelity Tradeoffs: The Hidden Cost of Prompt Injection Defense cs.CR · 2026-06-29 · unverdicted · none · ref 79
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Don't Blindly Trust It: How Unreliable Feedback Breaks Tool-Using LLM Agents cs.AI · 2026-06-19 · unverdicted · none · ref 61
Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.
MemVenom: Triggered Poisoning of Multimodal Memories in Web Agents cs.CR · 2026-06-09 · unverdicted · none · ref 26
MemVenom poisons multimodal memories in web agents via a two-stage trigger-conditioned retrieval and post-retrieval induction attack, achieving up to 99.15% success on GPT-5-family agents while preserving benign performance.
From Prompt Injection to Persistent Control: Defending Agentic Harness Against Trojan Backdoors cs.CR · 2026-05-29 · unverdicted · none · ref 31
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems cs.CR · 2026-05-21 · conditional · none · ref 1
Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.
ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents cs.CR · 2026-05-17 · conditional · none · ref 13
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
AgentTrap: Measuring Runtime Trust Failures in Third-Party Agent Skills cs.CR · 2026-05-13 · conditional · none · ref 11
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
When Agents Overtrust Environmental Evidence: An Extensible Agentic Framework for Benchmarking Evidence-Grounding Defects in LLM Agents cs.AI · 2026-05-09 · unverdicted · none · ref 27 · 2 links
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
LoopTrap: Termination Poisoning Attacks on LLM Agents cs.CR · 2026-05-07 · unverdicted · none · ref 52
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Alignment Contracts for Agentic Security Systems cs.CR · 2026-04-30 · conditional · full · ref 48
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
Contextual Agentic Memory is a Memo, Not True Memory cs.AI · 2026-04-30 · unverdicted · none · ref 4
Agentic memory is lookup-based retrieval, not weight-based consolidation, creating a generalization ceiling on novel tasks and structural vulnerability to memory poisoning.
RouteGuard: Internal-Signal Detection of Skill Poisoning in LLM Agents cs.CR · 2026-04-24 · unverdicted · none · ref 15
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents cs.CL · 2025-09-26 · conditional · none · ref 16
ChatInject exploits LLM chat template structures to boost indirect prompt injection success rates on agents from ~5-15% to 32-52% across benchmarks, with multi-turn persuasion variants performing best.
Agent libOS: A Runtime Substrate for Capability-Controlled Self-Evolving LLM Agents cs.OS · 2026-06-02 · unverdicted · none · ref 35
Agent libOS is a runtime substrate for capability-controlled self-evolving LLM agents that completed 27 deterministic tasks without unauthorized side effects while maintaining a 7% false-denial rate.
ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree cs.CR · 2026-05-31 · accept · none · ref 47
Analysis of 67,453 OpenClaw skills shows three scanners overlap on at most 10.4% of combined positives, with 81.9% flagged by only one scanner and distinct profiles for malicious versus suspicious skills.
Ghost in the Context: Policy-Carriage Integrity in LLM Agents cs.CR · 2026-05-02 · unverdicted · none · ref 40 · 3 links
Protected policy placements in LLM agents maintain integrity under replay pressure on AutoGen and OpenHands traces, unlike task-local placements which show eviction or weakening.
STARS: Skill-Triggered Audit for Request-Conditioned Invocation Safety in Agent Systems cs.AI · 2026-04-11 · unverdicted · none · ref 4
STARS fuses static priors and contextual risk scoring for agent skill invocations, achieving modest AUPRC gains on prompt injection attacks in a new SIA-Bench but concluding it supplements rather than replaces static auditing.
Toward Secure LLM Agents: Threat Surfaces, Attacks, Defenses, and Evaluation cs.CR · 2026-06-09 · unverdicted · none · ref 235
A synthesis of 247 papers on LLM agent security identifies prompt injection and tool hijacking as dominant threats, notes weakly compositional defenses, and argues for trust boundaries and realistic evaluations.

Title resolution pending

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer