Recognition: no theorem link
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Pith reviewed 2026-05-14 21:58 UTC · model grok-4.3
The pith
ATBench supplies 1,000 multi-step agent trajectories organized by risk source, failure mode, and real-world harm to evaluate safety in LLM deployments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ATBench is a trajectory-level benchmark that organizes agentic risk along three dimensions—risk source, failure mode, and real-world harm—and constructs 1,000 trajectories (503 safe, 497 unsafe) via heterogeneous tool pools and a long-context delayed-trigger protocol. These trajectories average 9.01 turns and 3.95k tokens while invoking 1,954 tools drawn from pools of 2,084 available tools. Experiments show that frontier LLMs, open-source models, and specialized guard systems remain challenged by the benchmark, which in turn enables taxonomy-stratified analysis and diagnosis of long-horizon failure patterns.
What carries the argument
The three-dimensional taxonomy of risk source, failure mode, and real-world harm together with the long-context delayed-trigger protocol that forces realistic multi-stage risk emergence in constructed trajectories.
If this is right
- Enables stratified diagnosis of which risk sources or failure modes current models handle worst.
- Supports direct comparison of new guard systems against the same long-horizon cases.
- Reveals patterns of delayed-trigger failures that single-turn tests overlook.
- Provides a fixed set of audited trajectories for reproducible safety reporting.
Where Pith is reading between the lines
- Agent developers could prioritize training data that targets the specific delayed-trigger scenarios the benchmark isolates.
- The taxonomy offers a reusable structure that future work could extend to new domains such as embodied or enterprise agents.
- Widespread adoption might shift industry safety testing from isolated prompt checks toward full trajectory auditing.
Load-bearing premise
The trajectories built from the taxonomy and delayed-trigger protocol accurately represent the multi-step safety risks that arise in real agent deployments.
What would settle it
A controlled study showing that safety scores on ATBench do not correlate with observed failure rates when the same models are deployed in live multi-turn agent environments would falsify the benchmark's realism claim.
Figures
read the original abstract
Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ATBench, a trajectory-level benchmark with 1,000 agent trajectories (503 safe, 497 unsafe) for evaluating LLM-based agent safety. Risks are organized along three dimensions (risk source, failure mode, real-world harm). Trajectories are generated from heterogeneous tool pools (2,084 available) using a long-context delayed-trigger protocol, with quality ensured by rule-based/LLM filtering and full human audit. The benchmark averages 9.01 turns and 3.95k tokens with 1,954 tool invocations. Experiments on frontier LLMs, open-source models, and guard systems demonstrate that ATBench is challenging and supports taxonomy-stratified analysis plus diagnosis of long-horizon failure patterns.
Significance. If the constructed trajectories accurately capture realistic multi-step agent risks, ATBench would fill a clear gap in existing benchmarks by enabling structured, diverse, and long-horizon safety evaluation. The three-axis taxonomy and scale provide a foundation for fine-grained diagnosis that prior work lacks. The inclusion of human audit and heterogeneous tools is a strength for internal consistency.
major comments (2)
- [Construction / Methodology] Construction section (inferred from abstract and methodology description): The central claim that trajectories reflect 'realistic' multi-step safety risks rests on internal generation plus human audit, but no comparison is provided against real deployed agent logs, incident reports, or observed failure distributions. This leaves external fidelity unverified and is load-bearing for the 'realistic evaluation' and 'diagnosis' claims.
- [Experiments] Experiments section: While the abstract states that ATBench is 'challenging even for strong evaluators' and enables 'diagnosis of long-horizon failure patterns,' the reported results lack quantitative metrics, error analysis, or stratified performance breakdowns by taxonomy dimension that would substantiate the diagnostic utility.
minor comments (2)
- [Abstract / Results] Report the standard deviation alongside the mean for turns (9.01) and tokens (3.95k) to better characterize trajectory length distribution.
- [Data Quality] Clarify the exact criteria and inter-annotator agreement for the 'full human audit' to strengthen the quality claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, with clear indications of planned revisions.
read point-by-point responses
-
Referee: [Construction / Methodology] The central claim that trajectories reflect 'realistic' multi-step safety risks rests on internal generation plus human audit, but no comparison is provided against real deployed agent logs, incident reports, or observed failure distributions. This leaves external fidelity unverified and is load-bearing for the 'realistic evaluation' and 'diagnosis' claims.
Authors: We acknowledge that direct external validation against real deployed agent logs or incident reports would provide stronger evidence of fidelity. Such logs are not publicly available due to privacy, security, and proprietary restrictions in real deployments. Our construction instead emphasizes internal validity through a heterogeneous pool of 2,084 tools, a long-context delayed-trigger protocol explicitly designed to model staged risk emergence, and full human audit of all 1,000 trajectories. In the revised manuscript we will add a dedicated Limitations section that transparently discusses the absence of external benchmarks, explains the rationale for our protocol, and provides qualitative alignment with publicly documented real-world agent incidents where available. revision: partial
-
Referee: [Experiments] While the abstract states that ATBench is 'challenging even for strong evaluators' and enables 'diagnosis of long-horizon failure patterns,' the reported results lack quantitative metrics, error analysis, or stratified performance breakdowns by taxonomy dimension that would substantiate the diagnostic utility.
Authors: We agree that the experimental results would be strengthened by additional quantitative detail. The current version reports aggregate performance but does not include the requested breakdowns. In the revision we will expand the Experiments section with: (i) concrete metrics including safety violation detection rates, false-positive/false-negative rates, and F1 scores across frontier, open-source, and guard models; (ii) qualitative and quantitative error analysis of misclassified trajectories; and (iii) performance tables and figures stratified by each taxonomy axis (risk source, failure mode, real-world harm). These additions will directly demonstrate the benchmark's diagnostic value for long-horizon patterns. revision: yes
Circularity Check
No circularity: benchmark constructed via explicit external protocols and audit
full rationale
The paper defines a three-axis taxonomy, heterogeneous tool pools, and delayed-trigger protocol as inputs, then applies them to generate 1,000 trajectories followed by rule/LLM filtering and human audit. No equations, fitted parameters, or predictions are present; the realism claim rests on the stated construction process rather than any reduction where outputs equal inputs by definition. No self-citation chains or uniqueness theorems are invoked to justify the central construction. The work is self-contained as a benchmark artifact with no derivation chain to inspect.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The taxonomy of risk source, failure mode, and real-world harm covers the relevant safety issues for agent trajectories.
Forward citations
Cited by 6 Pith papers
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
-
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
-
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
-
MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic
MCPShield models MCP tool-call sessions as graphs with SBERT embeddings and shows that content features raise AUROC above 0.89 while tree ensembles on pooled embeddings reach 0.975, outperforming GNNs and exposing inf...
-
MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic
MCPShield detects attacks on LLM agent tool-call traffic by encoding sessions as graphs enriched with SBERT content embeddings, achieving AUROC above 0.89 with content features versus 0.64 for metadata alone.
-
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.
Reference graph
Works this paper leans on
-
[1]
Shieldagent: Shielding agents via verifiable safety policy reasoning
Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning. arXiv preprint arXiv:2503.22738, 2025 a
-
[2]
Shieldagent: Shielding llm agents via verifiable safety policy reasoning
Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding llm agents via verifiable safety policy reasoning. In Proceedings of ICML 2025, 2025 b
work page 2025
-
[3]
Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37: 0 82895--82920, 2024
work page 2024
-
[4]
Embodied one-shot video recognition: Learning from actions of a virtual embodied agent
Yuqian Fu, Chengrong Wang, Yanwei Fu, Yu-Xiong Wang, Cong Bai, Xiangyang Xue, and Yu-Gang Jiang. Embodied one-shot video recognition: Learning from actions of a virtual embodied agent. In Proceedings of the 27th ACM International Conference on Multimedia, pp.\ 411--419, 2019. doi:10.1145/3343031.3351015
-
[5]
Google DeepMind . Gemini 3 flash - model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025. Official model card, accessed 2026-03-31
work page 2025
-
[6]
Google DeepMind . Gemini 3.1 pro - model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, February 2026. Official model card, accessed 2026-03-31
work page 2026
-
[7]
Are your agents upward deceivers? arXiv preprint arXiv:2512.04864, 2025
Dadi Guo, Qingyu Liu, Dongrui Liu, Qihan Ren, Shuai Shao, Tianyi Qiu, Haoran Li, Yi R Fung, Zhongjie Ba, Juntao Dai, et al. Are your agents upward deceivers? arXiv preprint arXiv:2512.04864, 2025
-
[8]
Building a foundational guardrail for general agentic systems via synthetic data
Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data. arXiv preprint arXiv:2510.09781, 2025
-
[9]
Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations
Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[10]
Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh
Mintong Kang, Zhaorun Chen, Chejian Xu, Jiawei Zhang, Chengquan Guo, Minzhou Pan, Ivan Revilla, Yu Sun, and Bo Li. Guardset-x: Massive multi-domain safety policy-grounded guardrail dataset. arXiv preprint arXiv:2506.19054, 2025
-
[11]
Os-harm: A benchmark for measuring safety of computer use agents
Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents. arXiv preprint arXiv:2506.14866, 2025
-
[12]
Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, et al. Domain-rag: Retrieval-guided compositional image generation for cross-domain few-shot object detection. arXiv preprint arXiv:2506.05872, 2025
-
[13]
AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security
Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security. arXiv preprint arXiv:2601.18491, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Toolace: Winning the points of llm function calling
Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920, 2024
-
[15]
Agentauditor: Human-level safety and security evaluation for llm agents
Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-level safety and security evaluation for llm agents. arXiv preprint arXiv:2506.00641, 2025
-
[16]
Meta . Meta-llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, July 2024 a . Hugging Face model card, accessed 2026-03-31
work page 2024
-
[17]
Meta . Llama-guard-3-8b. https://huggingface.co/meta-llama/Llama-Guard-3-8B, July 2024 b . Hugging Face model card, accessed 2026-03-31
work page 2024
-
[18]
Meta . Llama-guard-4-12b. https://huggingface.co/meta-llama/Llama-Guard-4-12B, April 2025. Hugging Face model card, accessed 2026-03-31
work page 2025
-
[19]
OpenAI . Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/, December 2025. Official release page, accessed 2026-03-31
work page 2025
-
[20]
OpenAI . Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/, March 2026. Official release page, accessed 2026-03-31
work page 2026
-
[22]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023 b
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
Qwen Team . Qwen2.5-7b-instruct. https://huggingface.co/Qwen/Qwen2.5-7B-Instruct, 2024 a . Hugging Face model card, accessed 2026-03-31
work page 2024
-
[24]
Qwen Team . Qwq-32b. https://huggingface.co/Qwen/QwQ-32B, 2024 b . Hugging Face model card, accessed 2026-03-31
work page 2024
-
[25]
Qwen Team . Qwen3-235b-a22b-instruct-2507. https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507, 2025 a . Hugging Face model card, accessed 2026-03-31
work page 2025
-
[26]
Qwen Team . Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507, 2025 b . Hugging Face model card, accessed 2026-03-31
work page 2025
-
[27]
Qwen Team . Qwen3 technical report, 2025 c . URL https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[28]
Qwen3.5 : Towards native multimodal agents, February 2026
Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5
work page 2026
-
[30]
Toolalpaca: Generalized tool learning for language models with 3000 simulated cases
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023 b . URL https://arxiv.org/abs/2306.05301
-
[31]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Customizable runtime enforcement for safe and reliable llm agents. arXiv preprint arXiv:2503.18666, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,
Haoyu Wang, Christopher M Poskitt, Jun Sun, and Jiali Wei. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking. arXiv preprint arXiv:2508.00500, 2025 b
-
[33]
Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms
Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, and Haifeng Wang. Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms. arXiv preprint arXiv:2509.07315, 2025
-
[34]
Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations
Yuejin Xie, Youliang Yuan, Wenxuan Wang, Fan Mo, Jianmin Guo, and Pinjia He. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 14146--14167, 2025
work page 2025
-
[35]
R-judge: Benchmarking safety risk awareness for llm agents,
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019, 2024 a
-
[36]
R-judge: Benchmarking safety risk awareness for llm agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-judge: Benchmarking safety risk awareness for llm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1467--1490, 2024 b
work page 2024
-
[37]
Egonight: Towards egocentric vision understanding at night with a challenging benchmark
Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, et al. Egonight: Towards egocentric vision understanding at night with a challenging benchmark. arXiv preprint arXiv:2510.06218, 2025
-
[38]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[40]
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report. arXiv preprint arXiv:2510.14276, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[42]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[43]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.