NeuroTaint is the first taint tracking framework for LLM agents that uses offline auditing of semantic, causal, and persistent context to detect flows from untrusted sources to privileged sinks.
Title resolution pending
30 Pith papers cite this work, alongside 37 external citations. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
SkillAudit is an automated framework that generates capability-aligned tasks from skill packages, executes them in sandboxes, and produces reports on utility, cost, and safety via baseline comparisons and two-stage risk detection.
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
GitInject is an open-source framework that runs live GitHub workflows to demonstrate prompt injection attacks on AI agents in CI/CD pipelines, finding all four tested providers vulnerable in default configurations due to structural issues in credential and config handling.
The study identifies four memory write channels and nine structural vulnerabilities in LLM agents, proposes a taxonomy of six attack classes, introduces MPBench, and finds that aggressive memory use increases exploitability while existing defenses fail.
AIRGuard is a runtime authority-control layer for tool-using agents that reduces attack success on AgentTrap from 36.3% to 5.5% while retaining higher benign utility than ARGUS or MELON on DTAP-150.
A new 507-leaf taxonomy and 4x6 Target x Technique matrix audits six LLM attack benchmarks and finds they cover at most 25% of the threat surface with entire STRIDE categories untested.
ROME generates deceptive safety benchmarks that degrade LLM agent judgment performance, while ARISE uses analogical retrieval to improve safety decisions at inference time without retraining.
A malicious relay can strategically rewrite aligned LLM outputs in BYOK agent architectures to achieve up to 99.1% attack success on benchmarks like AgentDojo and ASB.
Embedding-based defenses fail against crafted attacks in LLM MAS; confidence scores from logits improve robustness but decay over communication rounds.
ManyIH and ManyIH-Bench address instruction conflicts in LLM agents with up to 12 privilege levels across 853 tasks, revealing frontier models achieve only ~40% accuracy.
ASB is a new benchmark that tests 10 prompt injection attacks, memory poisoning, a novel Plan-of-Thought backdoor attack, and 11 defenses on LLM agents across 13 models, finding attack success rates up to 84.3% and limited defense effectiveness.
Prompt injection defenses create a security-fidelity tradeoff with no model or defense achieving both high security and high fidelity on the SecFid benchmark across 1,168 examples.
Misleading tool feedback produces value inversion in LLM agents, with performance dropping below matched no-feedback baselines on HotpotQA and similar tasks.
MemVenom poisons multimodal memories in web agents via a two-stage trigger-conditioned retrieval and post-retrieval induction attack, achieving up to 99.15% success on GPT-5-family agents while preserving benign performance.
Introduces ClawTrojan benchmark achieving 95.5% ASR for multi-step trojan attacks in agentic harnesses and DASGuard defense that sanitizes control content from untrusted sources.
Domain-camouflaged injection attacks reduce detection rates from 93.8% to 9.7% on Llama 3.1 8B and 100% to 55.6% on Gemini 2.0 Flash, with the gap persisting in production classifiers and multi-agent debate setups.
Clarification-seeking in LLM agents amplifies prompt injection attack success from ~2% to over 30% across ten frontier models in a new 728-scenario benchmark.
AgentTrap shows that current LLM agents typically complete user tasks while silently accepting unsafe side effects from malicious third-party skills rather than refusing them.
EnvTrustBench is a new agentic benchmark that measures evidence-grounding defects where LLM agents overtrust faulty environmental observations and take incorrect actions.
LoopTrap is an automated red-teaming framework that crafts termination-poisoning prompts to amplify LLM agent steps by 3.57x on average (up to 25x) across 8 agents.
Alignment contracts define scope, allowed effects, budgets and disclosure rules as safety properties over finite effect traces, with decidable admissibility, refinement rules, and Lean-verified soundness under an observability assumption.
Agentic memory is lookup-based retrieval, not weight-based consolidation, creating a generalization ceiling on novel tasks and structural vulnerability to memory poisoning.
RouteGuard uses response-conditioned attention and hidden-state alignment to detect skill poisoning in LLM agents, achieving 0.8834 F1 on Skill-Inject benchmarks and recovering 90.51% of attacks missed by lexical screening.
citing papers explorer
-
SafeClawBench: Separating Semantic, Audit-Evidence, and Sandbox Harm in Tool-Using LLM Agents
SafeClawBench supplies 600 staged adversarial tasks and three separate endpoints that show semantic acceptance, audit evidence, and sandbox-observed harm are distinct failure modes in tool-using LLM agents.
-
ClawHub Security Signals: When VirusTotal, Static Analysis, and SkillSpector Disagree
Analysis of 67,453 OpenClaw skills shows three scanners overlap on at most 10.4% of combined positives, with 81.9% flagged by only one scanner and distinct profiles for malicious versus suspicious skills.