pith. sign in

arxiv: 2605.22321 · v1 · pith:KWHJNEDMnew · submitted 2026-05-21 · 💻 cs.CR · cs.AI· cs.SE

Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions

Pith reviewed 2026-05-22 05:45 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.SE
keywords autonomous agentsevasion attacksLLM securitymulti-turn interactionssecurity benchmarkstemporal evasionspatial evasionsemantic evasion
0
0 comments X

The pith

A multi-dimensional evasion framework using temporal, spatial, and semantic vectors raises the average risk trigger rate in autonomous agent systems from 28.3% to 52.6%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that single-turn analyses miss key vulnerabilities in stateful autonomous agents that run with deep system access and make repeated tool calls. It develops three coordinated attack approaches that break malicious content across interaction turns, hide it inside external files, and wrap it in ordinary-looking context. A new benchmark built from thousands of actual agent runs then measures how often these approaches cause risk events to fire. The measured jump in success rate shows that standard agent setups lack built-in protection against ongoing, multi-part attacks.

Core claim

The authors construct an evasion framework whose three vectors—temporal fragmentation of payloads over sequential turns, spatial concealment inside complex external artifacts, and semantic masking under benign noise—collectively drive the average risk trigger rate from a 28.3 percent baseline to 52.6 percent. This result is obtained by evaluating a standard agent framework paired with ten mainstream LLM backbones against twenty practical threat scenarios drawn from 2,254 real-world execution trajectories, demonstrating that existing defenses leave systemic gaps at the architecture level.

What carries the argument

The three stealthy attack vectors—temporal evasion that fragments payloads across sequential turns, spatial evasion that embeds them in external artifacts, and semantic evasion that obscures intent with benign context—serve as the central mechanism for bypassing LLM parsing in multi-turn agent sessions.

If this is right

  • Single-turn defenses are insufficient once agents maintain state across multiple turns and tool calls.
  • Architectures need new checks that track payload fragments and external data sources over time.
  • Mainstream LLM backbones remain consistently susceptible when embedded in agent frameworks.
  • The benchmark supplies a concrete metric for judging whether future defenses close the measured gap.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Live monitoring that reconstructs full interaction histories could catch temporal fragmentation before execution.
  • Similar multi-vector testing may expose comparable weaknesses in other multi-step AI planners beyond agents.
  • Adding lightweight verification steps for external artifacts at each turn would directly counter the spatial vector.
  • Extending the benchmark to include agent frameworks with built-in sandboxing would test whether isolation reduces the observed rates.

Load-bearing premise

The 2,254 real-world agent trajectories and the 20 threat scenarios accurately represent the typical behaviors and attack surfaces of deployed autonomous agents.

What would settle it

Re-running the same three evasion vectors on a fresh collection of agent execution logs or a broader set of threat scenarios and obtaining risk trigger rates at or below the 28.3 percent baseline would falsify the claimed elevation.

Figures

Figures reproduced from arXiv: 2605.22321 by Changhua Meng, Jialuo Chen, Jianan Ma, Jingyi Wang, Ruixiao Lin, Shiwen Cui, Xiaofang Yang, Xiaohu Du, Xinhao Deng, Yaoxiang Bian, Zhen Wang.

Figure 1
Figure 1. Figure 1: A3S-Bench data synthesis pipeline: seed generation, curation, and payload injection with diverse attack surfaces. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Dataset overview. Left: turn-count distribution, with [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: GSS distribution by attack trigger count for four [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Security–utility tradeoff across ten models. Each point [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Per-category vulnerability heatmap. Each cell shows RTR@ [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RTR@k by injection mode across ten models. Basic and advanced sub-datasets are combined. for safety-critical deployments should evaluate category-level performance rather than relying on overall RTR alone. Multi-turn injection dramatically boosts attack success [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Temporal decomposition attack. Each fragment resem [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Poisoned runbook memory injection. Fabricated “team [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GSS distribution by attack trigger count for all ten models (Basic / Advanced). Each column represents one model; [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: RTR@k distribution across six usage scenarios. Each box summarizes ten models; individual model values are shown as colored dots. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: System prompt for seed conversation generation. Each seed LLM additionally receives a user prompt specifying the [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: System prompt for attack injection. The injection LLM additionally receives the seed conversation, the full risk category [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
read the original abstract

As autonomous agents (e.g., OpenClaw) increasingly operate with deep system-level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single-turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi-turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi-dimensional evasion framework targeting LLM-based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S-Bench, a comprehensive benchmark comprising 2,254 real-world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3\% baseline to 52.6\%. These findings reveal systemic, architecture-level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a multi-dimensional evasion framework with temporal, spatial, and semantic attack vectors against LLM-based autonomous agents operating with deep system privileges. It introduces A3S-Bench, built from 2,254 real-world agent execution trajectories and 20 practical threat scenarios, and evaluates a standard agent framework integrated with 10 LLM backbones. The central empirical result is that the proposed evasions increase the average risk trigger rate from a 28.3% baseline to 52.6%, which the authors interpret as evidence of systemic, architecture-level vulnerabilities not addressed by existing defenses.

Significance. If the trajectories and scenarios are representative of deployed agents, the work would usefully extend vulnerability analysis from single-turn to stateful multi-turn interactions and highlight the need for defenses against fragmented or context-obscured payloads. The direct empirical measurement across 10 LLMs and concrete percentages provide a falsifiable starting point, though generalization depends on unstated sampling details.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the 2,254 trajectories and 20 scenarios are representative of the attack surface for agents with deep privileges lacks any description of selection criteria, sampling method, source logs, privilege-level coverage, or diversity metrics. This directly affects the load-bearing interpretation of the 28.3% to 52.6% elevation as evidence of systemic vulnerabilities.
  2. [§4] §4 (Evaluation): The baseline risk trigger rate of 28.3% and the post-evasion rate of 52.6% are reported without clarifying how trajectories were filtered, whether threat scenarios were chosen post-hoc to favor evasion success, or how 'risk trigger' is operationalized across the 20 scenarios. These choices determine whether the measured increase generalizes beyond the specific benchmark.
minor comments (1)
  1. [Abstract] The abstract would benefit from a one-sentence definition of 'risk trigger rate' and 'baseline' to make the 28.3%–52.6% comparison immediately interpretable without reference to later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important areas for improving the clarity and transparency of our benchmark construction and evaluation methodology. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the 2,254 trajectories and 20 scenarios are representative of the attack surface for agents with deep privileges lacks any description of selection criteria, sampling method, source logs, privilege-level coverage, or diversity metrics. This directly affects the load-bearing interpretation of the 28.3% to 52.6% elevation as evidence of systemic vulnerabilities.

    Authors: We agree that additional details on benchmark construction are needed to support claims of representativeness. In the revised manuscript, we will expand §3 with a new subsection detailing: the source logs (drawn from publicly available open-source agent execution traces and controlled simulations of systems like OpenClaw), the sampling method (stratified random sampling ensuring coverage across task types and privilege levels), privilege-level coverage (including both standard user and elevated/root privileges), and diversity metrics (such as distribution of tool calls, scenario categories, and interaction lengths). These additions will provide a clearer foundation for interpreting the risk elevation results. revision: yes

  2. Referee: [§4] §4 (Evaluation): The baseline risk trigger rate of 28.3% and the post-evasion rate of 52.6% are reported without clarifying how trajectories were filtered, whether threat scenarios were chosen post-hoc to favor evasion success, or how 'risk trigger' is operationalized across the 20 scenarios. These choices determine whether the measured increase generalizes beyond the specific benchmark.

    Authors: We acknowledge the need for greater methodological transparency in §4. In the revision, we will clarify that: (i) trajectories were filtered to retain only those involving at least one successful tool invocation and multi-turn state; (ii) the 20 threat scenarios were selected a priori based on documented real-world agent attack reports, independent of evasion performance results, to avoid post-hoc bias; and (iii) 'risk trigger' is operationalized as the agent executing any action matching one of the predefined malicious patterns (e.g., unauthorized privilege escalation or data exfiltration) across the scenarios, with explicit per-scenario definitions to be added in a supplementary table. These changes will better support assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical benchmarking results

full rationale

The paper reports direct empirical measurements of risk trigger rates (28.3% baseline to 52.6% with evasions) on a constructed benchmark of 2,254 real-world trajectories and 20 threat scenarios evaluated across agent frameworks and LLM backbones. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation of the central claim. The results follow from applying the described evasion vectors to the defined scenarios and measuring outcomes, remaining self-contained without reduction to inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central results rest on the assumption that the chosen trajectories and scenarios capture real agent behavior and that the risk trigger metric accurately reflects security impact.

free parameters (1)
  • baseline risk trigger rate
    28.3% measured on standard agent framework without evasions; used as comparison point.
axioms (1)
  • domain assumption The 20 threat scenarios represent practical attacks on autonomous agents.
    Invoked to define the evaluation set in the abstract.
invented entities (3)
  • Temporal evasion no independent evidence
    purpose: Fragment malicious payloads across sequential turns
    New attack vector defined for multi-turn interactions.
  • Spatial evasion no independent evidence
    purpose: Conceal payloads in external artifacts
    New attack vector defined for evading LLM parsing.
  • Semantic evasion no independent evidence
    purpose: Obscure intent with benign contextual noise
    New attack vector defined for multi-turn context.

pith-pipeline@v0.9.0 · 5797 in / 1409 out tokens · 34319 ms · 2026-05-22T05:45:56.836178+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 15 internal anchors

  1. [1]

    The rise and potential of large language model based agents: A survey,

    Z. Xi, W. Chen, X. Guoet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, 2025

  2. [2]

    ReAct: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR), 2023

  3. [3]

    InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents,

    Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024

  4. [4]

    AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses in LLM agents,

    E. Debenedetti, G. Severi, N. Carlini, C. A. Choquette-Choo, M. Jagiel- ski, M. Kunduet al., “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses in LLM agents,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  5. [5]

    Identifying the risks of LM agents with an LM-emulated sandbox,

    Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of LM agents with an LM-emulated sandbox,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2024

  6. [6]

    Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents,

    H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents,” inProceedings of the 13th International Conference on Learning Representations (ICLR), 2025

  7. [7]

    SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “SoK: Agentic skills–beyond tool use in LLM agents,”arXiv preprint arXiv:2602.20867, 2026

  8. [8]

    Fermionic Casimir effect in an axial Lorentz-violating background

    Y . Qu, Y . Lu, T. Geng, G. Deng, Y . Li, L. Y . Zhang, Y . Zhang, and L. Ma, “Supply-chain poisoning attacks against LLM coding agent skill ecosystems,”arXiv preprint arXiv:2604.09381, 2026

  9. [9]

    Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks

    D. Schmotz, L. Beurer-Kellner, S. Abdelnabi, and M. Andriushchenko, “Skill-Inject: Measuring agent vulnerability to skill file attacks,”arXiv preprint arXiv:2602.20156, 2026

  10. [10]

    AgentPoison: Red- teaming LLM agents via poisoning memory or knowledge bases,

    Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “AgentPoison: Red- teaming LLM agents via poisoning memory or knowledge bases,” in Advances in Neural Information Processing Systems 37 (NeurIPS), 2024

  11. [11]

    OpenClaw: Personal ai assistant,

    P. Steinberger and OpenClaw Contributors, “OpenClaw: Personal ai assistant,” 2026, GitHub repository. Accessed: 2026-05-07. [Online]. Available: https://github.com/openclaw/openclaw

  12. [12]

    A Systematic Security Evaluation of OpenClaw and Its Variants

    Y . Wang, H. Gao, Z. Niu, Z. Liu, W. Zhang, X. Wang, and S. Lian, “A systematic security evaluation of OpenClaw and its variants,”arXiv preprint arXiv:2604.03131, 2026

  13. [13]

    ClawSafety: "Safe" LLMs, Unsafe Agents

    B. Wei, Y . Zhang, J. Pan, K. Mei, X. Wang, J. Hamm, Z. Zhu, and Y . Ge, “ClawSafety: “safe” LLMs, unsafe agents,”arXiv preprint arXiv:2604.01438, 2026

  14. [14]

    Agentic AI security: Threats, defences, evaluation, and open challenges,

    A. Chhabra, S. K. N. Datta, and P. Mohapatra, “Agentic AI security: Threats, defences, evaluation, and open challenges,”IEEE Access, vol. 14, pp. 49 955–49 962, 2026

  15. [15]

    Taming OpenClaw: Security analysis and mitigation of autonomous LLM agent threats,

    X. Deng, Y . Zhang, J. Wu, J. Bai, S. Yi, Z. Zou, Y . Xiao, R. Qiu, J. Ma, J. Chenet al., “Taming OpenClaw: Security analysis and mitigation of autonomous LLM agent threats,”arXiv preprint arXiv:2603.11619, 2026

  16. [16]

    Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents

    W. Zou, M. Dong, M. Romero Calvo, S. Chang, J. Guo, D. Lee, X. Niu, X. Ma, Y . Qi, and J. Jiang, “Poison once, exploit forever: Environment-injected memory poisoning attacks on web agents,”arXiv preprint arXiv:2604.02623, 2026

  17. [17]

    From assistant to double agent: Formalizing and benchmarking attacks on OpenClaw for personalized local AI agent,

    Y . Wang, F. Xu, Z. Lin, G. He, Y . Huang, H. Gao, Z. Niu, S. Lian, and Z. Liu, “From assistant to double agent: Formalizing and benchmarking attacks on OpenClaw for personalized local AI agent,”arXiv preprint arXiv:2602.08412, 2026

  18. [18]

    MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval,

    S. S. Srivastava and H. He, “MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval,”arXiv preprint arXiv:2512.16962, 2025

  19. [19]

    HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,

    M. Mazeika, L. Phan, X. Yinet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” in Proceedings of the 41st International Conference on Machine Learning (ICML), 2024

  20. [20]

    JailbreakBench: An open robustness benchmark for jailbreaking large language models,

    P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Se- hwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tram`er, H. Hassani, and E. Wong, “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024

  21. [21]

    Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents

    B. Ye, R. Li, Q. Yang, Y . Liu, L. Yao, H. Lv, Z. Xie, C. An, L. Li, L. Kong, Q. Liu, Z. Sui, and T. Yang, “Claw-Eval: Toward trustworthy evaluation of autonomous agents,”arXiv preprint arXiv:2604.06132, 2026

  22. [22]

    ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,

    S. Liu, C. Li, C. Wang, J. Hou, Z. Chen, L. Zhang, Z. Liu, Q. Ye, Y . Hei, X. Zhang, and Z. Wang, “ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,” arXiv preprint arXiv:2603.24414, 2026

  23. [23]

    ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

    X. Li, K. W. Choe, Y . Liu, X. Chen, C. Tao, B. You, W. Chen, Z. Di, J. Sun, S. Zheng, J. Bao, Y . Wang, W. Yan, Y . Li, and H.-c. Lee, “ClawsBench: Evaluating capability and safety of LLM productivity agents in simulated workspaces,”arXiv preprint arXiv:2604.05172, 2026

  24. [24]

    Don’t let the claw grip your hand: A security analysis and defense framework for OpenClaw,

    Z. Shan, J. Xin, Y . Zhang, and M. Xu, “Don’t let the claw grip your hand: A security analysis and defense framework for OpenClaw,”arXiv preprint arXiv:2603.10387, 2026

  25. [25]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

  26. [26]

    Jailbroken: How does LLM safety training fail?

    A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

  27. [27]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023

  28. [28]

    Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,

    K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90

  29. [29]

    Agen- tHarm: A benchmark for measuring harmfulness of LLM agents,

    M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, and M. Fredrikson, “Agen- tHarm: A benchmark for measuring harmfulness of LLM agents,” in Proceedings of the 13th International Conference on Learning Repre- sentations (ICLR), 2025

  30. [30]

    BadAgent: Inserting and activating backdoor attacks in LLM agents,

    Y . Wang, D. Xue, S. Zhang, and S. Qian, “BadAgent: Inserting and activating backdoor attacks in LLM agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024

  31. [31]

    R-Judge: Benchmarking safety risk awareness for LLM agents,

    T. Yuan, Z. He, L. Dong, Y . Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhanget al., “R-Judge: Benchmarking safety risk awareness for LLM agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024

  32. [32]

    Uncovering security threats and architecting defenses in autonomous agents: A case study of OpenClaw,

    Z. Ying, X. Yang, S. Wu, Y . Song, Y . Qu, H. Li, T. Li, J. Wang, A. Liu, and X. Liu, “Uncovering security threats and architecting defenses in autonomous agents: A case study of OpenClaw,”arXiv preprint arXiv:2603.12644, 2026

  33. [33]

    Ignore Previous Prompt: Attack Techniques For Language Models

    F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,”arXiv preprint arXiv:2211.09527, 2022

  34. [34]

    Formalizing and benchmarking prompt injection attacks and defenses,

    Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” inProceedings of the 33rd USENIX Security Symposium, 2024

  35. [35]

    MASTERKEY: Automated jailbreaking of large language model chatbots,

    G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: Automated jailbreaking of large language model chatbots,” inProceedings of the Network and Distributed System Security Symposium (NDSS), 2024

  36. [36]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,

    M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,” inProceed- ings of the 34th USENIX Security Symposium, 2025

  37. [37]

    PAPILLON: Efficient and stealthy fuzz testing-powered jailbreaks for LLMs,

    X. Gong, M. Li, Y . Zhang, H. Tabrizchi, and T. Li, “PAPILLON: Efficient and stealthy fuzz testing-powered jailbreaks for LLMs,” in Proceedings of the 34th USENIX Security Symposium, 2025

  38. [38]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems 35 (NeurIPS), 2022

  39. [39]

    Towards understanding sycophancy in language models,

    M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman et al., “Towards understanding sycophancy in language models,” in Proceedings of the 12th International Conference on Learning Repre- sentations (ICLR), 2024

  40. [40]

    Claude opus 4.6 system card,

    Anthropic, “Claude opus 4.6 system card,” https://www.anthropic.com, 2026. 15

  41. [41]

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,

    L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023

  42. [42]

    Kimi K2.5 technical report,

    Moonshot AI, “Kimi K2.5 technical report,” https://github.com/ MoonshotAI/Kimi-K2.5/blob/master/tech report.pdf, 2025

  43. [43]

    MiniMax-M2.5: Faster, stronger, and smarter for real-world productivity,

    MiniMax, “MiniMax-M2.5: Faster, stronger, and smarter for real-world productivity,” https://minimaxi.com/news/minimax-m25, 2026

  44. [44]

    GLM-5: from Vibe Coding to Agentic Engineering

    GLM-5 Team, “GLM-5: from vibe coding to agentic engineering,”arXiv preprint arXiv:2602.15763, 2026

  45. [45]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, “DeepSeek-V3 technical report,”arXiv preprint arXiv:2412.19437, 2024

  46. [46]

    DeepSeek-V4-flash: Towards highly efficient million- token context intelligence,

    DeepSeek-AI, “DeepSeek-V4-flash: Towards highly efficient million- token context intelligence,”Technical Report, 2026, https://huggingface. co/deepseek-ai/DeepSeek-V4-Flash

  47. [47]

    Qwen3 Technical Report

    Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

  48. [48]

    Claude Sonnet 4.5 system card,

    Anthropic, “Claude Sonnet 4.5 system card,” https://assets.anthropic. com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card. pdf, 2025

  49. [49]

    Update to GPT-5 system card: GPT-5.2,

    OpenAI, “Update to GPT-5 system card: GPT-5.2,” https://openai.com/ index/gpt-5-system-card-update-gpt-5-2/, 2025

  50. [50]

    Qwen3Guard Technical Report

    H. Zhao, C. Yuan, F. Huang, X. Hu, Y . Zhang, A. Yang, B. Yu, D. Liu, J. Zhou, J. Lin, B. Yanget al., “Qwen3Guard technical report,”arXiv preprint arXiv:2510.14276, 2025

  51. [51]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhriet al., “The Llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024

  52. [52]

    Prompt injection defenses: Delimiters, data marking, and instruction hierarchy,

    S. Willison, “Prompt injection defenses: Delimiters, data marking, and instruction hierarchy,” https://simonwillison.net/, 2023, blog post

  53. [53]

    Defending against indirect prompt injection attacks with spotlighting,

    K. Hines, G. Lopez, M. Hall, F. Zarfati, Y . Zunger, and S. Emam, “Defending against indirect prompt injection attacks with spotlighting,” inProceedings of the Conference on Applied Machine Learning for Information Security (CAMLIS), 2024

  54. [54]

    StruQ: Defending against prompt injection with structured queries,

    S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” inProceedings of the 34th USENIX Security Symposium, 2025. APPENDIXA RISKCATEGORYDETAILS Table VII provides the complete specification of A3S- Bench’s risk taxonomy, listing all 10 categories, 20 subcate- gories, and 34 attack techniques wit...