arxiv: 2604.02022 · v3 · submitted 2026-04-02 · 💻 cs.AI

Recognition: no theorem link

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li , Haoyu Luo , Yuejin Xie , Yuqian Fu , Zhonghao Yang , Shuai Shao , Qihan Ren , Wanying Qu

show 5 more authors

Yanwei Fu Yujiu Yang Jing Shao Xia Hu Dongrui Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:58 UTC · model grok-4.3

classification 💻 cs.AI

keywords agent safetyLLM agentstrajectory benchmarksafety evaluationrisk taxonomymulti-step interactionsfailure diagnosisdelayed trigger

0 comments

The pith

ATBench supplies 1,000 multi-step agent trajectories organized by risk source, failure mode, and real-world harm to evaluate safety in LLM deployments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ATBench to overcome gaps in existing benchmarks that fail to capture how safety risks in LLM-based agents unfold across multiple interaction turns. It builds trajectories using heterogeneous tool sets and a delayed-trigger protocol so that failures emerge gradually in long contexts. The benchmark supplies roughly equal numbers of safe and unsafe cases with full human auditing to support reliable measurement. A sympathetic reader cares because current single-prompt or final-response tests miss the staged, tool-using dynamics that appear in actual agent use.

Core claim

ATBench is a trajectory-level benchmark that organizes agentic risk along three dimensions—risk source, failure mode, and real-world harm—and constructs 1,000 trajectories (503 safe, 497 unsafe) via heterogeneous tool pools and a long-context delayed-trigger protocol. These trajectories average 9.01 turns and 3.95k tokens while invoking 1,954 tools drawn from pools of 2,084 available tools. Experiments show that frontier LLMs, open-source models, and specialized guard systems remain challenged by the benchmark, which in turn enables taxonomy-stratified analysis and diagnosis of long-horizon failure patterns.

What carries the argument

The three-dimensional taxonomy of risk source, failure mode, and real-world harm together with the long-context delayed-trigger protocol that forces realistic multi-stage risk emergence in constructed trajectories.

If this is right

Enables stratified diagnosis of which risk sources or failure modes current models handle worst.
Supports direct comparison of new guard systems against the same long-horizon cases.
Reveals patterns of delayed-trigger failures that single-turn tests overlook.
Provides a fixed set of audited trajectories for reproducible safety reporting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agent developers could prioritize training data that targets the specific delayed-trigger scenarios the benchmark isolates.
The taxonomy offers a reusable structure that future work could extend to new domains such as embodied or enterprise agents.
Widespread adoption might shift industry safety testing from isolated prompt checks toward full trajectory auditing.

Load-bearing premise

The trajectories built from the taxonomy and delayed-trigger protocol accurately represent the multi-step safety risks that arise in real agent deployments.

What would settle it

A controlled study showing that safety scores on ATBench do not correlate with observed failure rates when the same models are deployed in live multi-turn agent environments would falsify the benchmark's realism claim.

Figures

Figures reproduced from arXiv: 2604.02022 by Dongrui Liu, Haoyu Luo, Jing Shao, Qihan Ren, Shuai Shao, Wanying Qu, Xia Hu, Yanwei Fu, Yuejin Xie, Yujiu Yang, Yu Li, Yuqian Fu, Zhonghao Yang.

**Figure 2.** Figure 2: Data generation engine for synthesizing multi-step agent trajectories in ATBench. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Representative case studies for failure-mode misidentification (a) and risk-source [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Cross-benchmark comparison of model performance on representative agent [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Category-wise accuracy on the fine-grained ATBench taxonomy. Accuracies are [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the three orthogonal dimensions of the agentic safety taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: The prompt template of trajectory-level evaluation. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: The prompt template of fine-grained risk diagnosis. [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: The prompt template for ShieldAgent. Task: Check if there is unsafe content in '{{ role }}' messages in conversations according to our safety policy with the below categories. <BEGIN UNSAFE CONTENT CATEGORIES> {{ unsafe_categories }} <END UNSAFE CONTENT CATEGORIES> <BEGIN CONVERSATION> User: {{ user_message_1 }} Agent: {{ model_answer_1 }} <END CONVERSATION> Provide your safety assessment for ONLY THE LAST… view at source ↗

**Figure 10.** Figure 10: The prompt template for Llama Guard. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: The prompt template for Qwen3Guard. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: An unsafe trajectory where corrupted tool feedback drives a false risk escalation, [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: An unsafe trajectory where a high-stakes medical record update is committed [PITH_FULL_IMAGE:figures/full_fig_p023_13.png] view at source ↗

**Figure 14.** Figure 14: An unsafe trajectory where a direct prompt injection causes wasteful execution, [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

**Figure 15.** Figure 15: An unsafe trajectory where a supposedly read-only diagnostic call hides an [PITH_FULL_IMAGE:figures/full_fig_p024_15.png] view at source ↗

**Figure 16.** Figure 16: An unsafe trajectory where the agent over-weights unreliable community anec [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

**Figure 17.** Figure 17: An unsafe long-context trajectory where an earlier support-setup episode creates [PITH_FULL_IMAGE:figures/full_fig_p025_17.png] view at source ↗

**Figure 18.** Figure 18: A safe trajectory where the agent halts after detecting a suspicious side effect in [PITH_FULL_IMAGE:figures/full_fig_p025_18.png] view at source ↗

**Figure 19.** Figure 19: A safe trajectory where the agent refuses to broadcast safety guidance derived [PITH_FULL_IMAGE:figures/full_fig_p026_19.png] view at source ↗

**Figure 20.** Figure 20: An unsafe trajectory where tool-description injection normalizes unsolicited [PITH_FULL_IMAGE:figures/full_fig_p026_20.png] view at source ↗

**Figure 21.** Figure 21: An unsafe trajectory where the agent proceeds with a transfer despite a critical [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: A safe trajectory where the agent stays within public-data boundaries during [PITH_FULL_IMAGE:figures/full_fig_p027_22.png] view at source ↗

**Figure 23.** Figure 23: An unsafe trajectory where indirect prompt injection turns an unverified enrich [PITH_FULL_IMAGE:figures/full_fig_p028_23.png] view at source ↗

**Figure 24.** Figure 24: An unsafe trajectory where the agent performs an over-privileged external data [PITH_FULL_IMAGE:figures/full_fig_p028_24.png] view at source ↗

**Figure 25.** Figure 25: An unsafe trajectory where correctly extracted hazardous-cargo information [PITH_FULL_IMAGE:figures/full_fig_p029_25.png] view at source ↗

**Figure 26.** Figure 26: An unsafe trajectory where scraped external content supplies a destructive [PITH_FULL_IMAGE:figures/full_fig_p029_26.png] view at source ↗

read the original abstract

Evaluating the safety of LLM-based agents is increasingly important because risks in realistic deployments often emerge over multi-step interactions rather than isolated prompts or final responses. Existing trajectory-level benchmarks remain limited by insufficient interaction diversity, coarse observability of safety failures, and weak long-horizon realism. We introduce ATBench, a trajectory-level benchmark for structured, diverse, and realistic evaluation of agent safety. ATBench organizes agentic risk along three dimensions: risk source, failure mode, and real-world harm. Based on this taxonomy, we construct trajectories with heterogeneous tool pools and a long-context delayed-trigger protocol that captures realistic risk emergence across multiple stages. The benchmark contains 1,000 trajectories (503 safe and 497 unsafe), averaging 9.01 turns and 3.95k tokens, with 1,954 invoked tools drawn from pools spanning 2,084 available tools. Data quality is supported by rule-based and LLM-based filtering plus full human audit. Experiments on frontier LLMs, open-source models, and specialized guard systems show that ATBench is challenging even for strong evaluators, while enabling taxonomy-stratified analysis, cross-benchmark comparison, and diagnosis of long-horizon failure patterns.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ATBench adds a three-axis taxonomy and delayed-trigger trajectories to agent safety benchmarks, but the realism claim rests on internal construction without external data checks.

read the letter

The main thing here is that ATBench introduces a three-dimension taxonomy for agent risks along with a delayed-trigger protocol for building trajectories, which moves past the coarser benchmarks mentioned in the abstract. They generated 1,000 trajectories from pools of over 2,000 tools, applied rule and LLM filtering plus full human audit, and split them roughly evenly between safe and unsafe cases, with averages of 9 turns and 4k tokens. Experiments on frontier models and guard systems show the set is challenging and supports breaking results down by risk source, failure mode, and harm type, plus cross-benchmark comparisons and long-horizon diagnosis. That structured approach and the scale are clear positives for anyone doing agent evaluation work. The construction details look careful on paper, and the human audit adds a layer of quality control that many benchmarks skip. The soft spot is the realism angle. The trajectories rely on the internal taxonomy and delayed-trigger method, but the paper gives no comparison to real deployed agent logs, incident reports, or observed failure distributions. Human audit can confirm internal consistency and labels, yet it cannot show whether the 9-turn lengths or trigger patterns actually match what happens in practice. If those elements are artifacts of the generation process, the benchmark's value for realistic risk assessment weakens. This is a moderate rather than fatal issue, since the taxonomy itself still offers a useful way to organize evaluation. The paper is aimed at researchers building or testing LLM agents on safety. Anyone running agent benchmarks or working on long-horizon risk would get concrete value from the taxonomy and the dataset for stratified analysis. It deserves a serious referee because new benchmarks with this level of structure can push evaluation standards forward, even if revisions are needed on external validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces ATBench, a trajectory-level benchmark with 1,000 agent trajectories (503 safe, 497 unsafe) for evaluating LLM-based agent safety. Risks are organized along three dimensions (risk source, failure mode, real-world harm). Trajectories are generated from heterogeneous tool pools (2,084 available) using a long-context delayed-trigger protocol, with quality ensured by rule-based/LLM filtering and full human audit. The benchmark averages 9.01 turns and 3.95k tokens with 1,954 tool invocations. Experiments on frontier LLMs, open-source models, and guard systems demonstrate that ATBench is challenging and supports taxonomy-stratified analysis plus diagnosis of long-horizon failure patterns.

Significance. If the constructed trajectories accurately capture realistic multi-step agent risks, ATBench would fill a clear gap in existing benchmarks by enabling structured, diverse, and long-horizon safety evaluation. The three-axis taxonomy and scale provide a foundation for fine-grained diagnosis that prior work lacks. The inclusion of human audit and heterogeneous tools is a strength for internal consistency.

major comments (2)

[Construction / Methodology] Construction section (inferred from abstract and methodology description): The central claim that trajectories reflect 'realistic' multi-step safety risks rests on internal generation plus human audit, but no comparison is provided against real deployed agent logs, incident reports, or observed failure distributions. This leaves external fidelity unverified and is load-bearing for the 'realistic evaluation' and 'diagnosis' claims.
[Experiments] Experiments section: While the abstract states that ATBench is 'challenging even for strong evaluators' and enables 'diagnosis of long-horizon failure patterns,' the reported results lack quantitative metrics, error analysis, or stratified performance breakdowns by taxonomy dimension that would substantiate the diagnostic utility.

minor comments (2)

[Abstract / Results] Report the standard deviation alongside the mean for turns (9.01) and tokens (3.95k) to better characterize trajectory length distribution.
[Data Quality] Clarify the exact criteria and inter-annotator agreement for the 'full human audit' to strengthen the quality claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point-by-point below, with clear indications of planned revisions.

read point-by-point responses

Referee: [Construction / Methodology] The central claim that trajectories reflect 'realistic' multi-step safety risks rests on internal generation plus human audit, but no comparison is provided against real deployed agent logs, incident reports, or observed failure distributions. This leaves external fidelity unverified and is load-bearing for the 'realistic evaluation' and 'diagnosis' claims.

Authors: We acknowledge that direct external validation against real deployed agent logs or incident reports would provide stronger evidence of fidelity. Such logs are not publicly available due to privacy, security, and proprietary restrictions in real deployments. Our construction instead emphasizes internal validity through a heterogeneous pool of 2,084 tools, a long-context delayed-trigger protocol explicitly designed to model staged risk emergence, and full human audit of all 1,000 trajectories. In the revised manuscript we will add a dedicated Limitations section that transparently discusses the absence of external benchmarks, explains the rationale for our protocol, and provides qualitative alignment with publicly documented real-world agent incidents where available. revision: partial
Referee: [Experiments] While the abstract states that ATBench is 'challenging even for strong evaluators' and enables 'diagnosis of long-horizon failure patterns,' the reported results lack quantitative metrics, error analysis, or stratified performance breakdowns by taxonomy dimension that would substantiate the diagnostic utility.

Authors: We agree that the experimental results would be strengthened by additional quantitative detail. The current version reports aggregate performance but does not include the requested breakdowns. In the revision we will expand the Experiments section with: (i) concrete metrics including safety violation detection rates, false-positive/false-negative rates, and F1 scores across frontier, open-source, and guard models; (ii) qualitative and quantitative error analysis of misclassified trajectories; and (iii) performance tables and figures stratified by each taxonomy axis (risk source, failure mode, real-world harm). These additions will directly demonstrate the benchmark's diagnostic value for long-horizon patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark constructed via explicit external protocols and audit

full rationale

The paper defines a three-axis taxonomy, heterogeneous tool pools, and delayed-trigger protocol as inputs, then applies them to generate 1,000 trajectories followed by rule/LLM filtering and human audit. No equations, fitted parameters, or predictions are present; the realism claim rests on the stated construction process rather than any reduction where outputs equal inputs by definition. No self-citation chains or uniqueness theorems are invoked to justify the central construction. The work is self-contained as a benchmark artifact with no derivation chain to inspect.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the custom taxonomy and trajectory construction protocol produce realistic and diverse safety scenarios, supported by filtering and human audit but without external validation data shown.

axioms (1)

domain assumption The taxonomy of risk source, failure mode, and real-world harm covers the relevant safety issues for agent trajectories.
Invoked to organize benchmark construction and analysis.

pith-pipeline@v0.9.0 · 5548 in / 1082 out tokens · 48666 ms · 2026-05-14T21:58:27.952767+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment
cs.AI 2026-05 unverdicted novelty 6.0

FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.
MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic
cs.CR 2026-05 unverdicted novelty 6.0

MCPShield models MCP tool-call sessions as graphs with SBERT embeddings and shows that content features raise AUROC above 0.89 while tree ensembles on pooled embeddings reach 0.975, outperforming GNNs and exposing inf...
MCPShield: Content-Aware Attack Detection for LLM Agent Tool-Call Traffic
cs.CR 2026-05 conditional novelty 6.0

MCPShield detects attacks on LLM agent tool-call traffic by encoding sessions as graphs enriched with SBERT content embeddings, achieving AUROC above 0.89 with content features versus 0.64 for metadata alone.
Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-Codex
cs.AI 2026-04 unverdicted novelty 5.0

ATBench-Claw and ATBench-Codex extend the ATBench framework by customizing a three-dimensional safety taxonomy for trajectory evaluation in OpenClaw and Codex agent settings.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 4 Pith papers · 7 internal anchors

[1]

Shieldagent: Shielding agents via verifiable safety policy reasoning

Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning. arXiv preprint arXiv:2503.22738, 2025 a

work page arXiv 2025
[2]

Shieldagent: Shielding llm agents via verifiable safety policy reasoning

Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding llm agents via verifiable safety policy reasoning. In Proceedings of ICML 2025, 2025 b

work page 2025
[3]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tram \`e r. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents. Advances in Neural Information Processing Systems, 37: 0 82895--82920, 2024

work page 2024
[4]

Embodied one-shot video recognition: Learning from actions of a virtual embodied agent

Yuqian Fu, Chengrong Wang, Yanwei Fu, Yu-Xiong Wang, Cong Bai, Xiangyang Xue, and Yu-Gang Jiang. Embodied one-shot video recognition: Learning from actions of a virtual embodied agent. In Proceedings of the 27th ACM International Conference on Multimedia, pp.\ 411--419, 2019. doi:10.1145/3343031.3351015

work page doi:10.1145/3343031.3351015 2019
[5]

Gemini 3 flash - model card

Google DeepMind . Gemini 3 flash - model card. https://storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Flash-Model-Card.pdf, December 2025. Official model card, accessed 2026-03-31

work page 2025
[6]

Gemini 3.1 pro - model card

Google DeepMind . Gemini 3.1 pro - model card. https://deepmind.google/models/model-cards/gemini-3-1-pro/, February 2026. Official model card, accessed 2026-03-31

work page 2026
[7]

Are your agents upward deceivers? arXiv preprint arXiv:2512.04864, 2025

Dadi Guo, Qingyu Liu, Dongrui Liu, Qihan Ren, Shuai Shao, Tianyi Qiu, Haoran Li, Yi R Fung, Zhongjie Ba, Juntao Dai, et al. Are your agents upward deceivers? arXiv preprint arXiv:2512.04864, 2025

work page arXiv 2025
[8]

Building a foundational guardrail for general agentic systems via synthetic data

Yue Huang, Hang Hua, Yujun Zhou, Pengcheng Jing, Manish Nagireddy, Inkit Padhi, Greta Dolcetti, Zhangchen Xu, Subhajit Chaudhury, Ambrish Rawat, et al. Building a foundational guardrail for general agentic systems via synthetic data. arXiv preprint arXiv:2510.09781, 2025

work page arXiv 2025
[9]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Siwon Kim, Sangdoo Yun, Hwaran Lee, Martin Gubri, Sungroh Yoon, and Seong Joon Oh

Mintong Kang, Zhaorun Chen, Chejian Xu, Jiawei Zhang, Chengquan Guo, Minzhou Pan, Ivan Revilla, Yu Sun, and Bo Li. Guardset-x: Massive multi-domain safety policy-grounded guardrail dataset. arXiv preprint arXiv:2506.19054, 2025

work page arXiv 2025
[11]

Os-harm: A benchmark for measuring safety of computer use agents

Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents. arXiv preprint arXiv:2506.14866, 2025

work page arXiv 2025
[12]

Domain-rag: Retrieval-guided compositional image generation for cross-domain few-shot object detection

Yu Li, Xingyu Qiu, Yuqian Fu, Jie Chen, Tianwen Qian, Xu Zheng, Danda Pani Paudel, Yanwei Fu, Xuanjing Huang, Luc Van Gool, et al. Domain-rag: Retrieval-guided compositional image generation for cross-domain few-shot object detection. arXiv preprint arXiv:2506.05872, 2025

work page arXiv 2025
[13]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security. arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[14]

Toolace: Winning the points of llm function calling

Weiwen Liu, Xu Huang, Xingshan Zeng, Xinlong Hao, Shuai Yu, Dexun Li, Shuai Wang, Weinan Gan, Zhengying Liu, Yuanqing Yu, et al. Toolace: Winning the points of llm function calling. arXiv preprint arXiv:2409.00920, 2024

work page arXiv 2024
[15]

Agentauditor: Human-level safety and security evaluation for llm agents

Hanjun Luo, Shenyu Dai, Chiming Ni, Xinfeng Li, Guibin Zhang, Kun Wang, Tongliang Liu, and Hanan Salam. Agentauditor: Human-level safety and security evaluation for llm agents. arXiv preprint arXiv:2506.00641, 2025

work page arXiv 2025
[16]

Meta-llama-3.1-8b-instruct

Meta . Meta-llama-3.1-8b-instruct. https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct, July 2024 a . Hugging Face model card, accessed 2026-03-31

work page 2024
[17]

Llama-guard-3-8b

Meta . Llama-guard-3-8b. https://huggingface.co/meta-llama/Llama-Guard-3-8B, July 2024 b . Hugging Face model card, accessed 2026-03-31

work page 2024
[18]

Llama-guard-4-12b

Meta . Llama-guard-4-12b. https://huggingface.co/meta-llama/Llama-Guard-4-12B, April 2025. Hugging Face model card, accessed 2026-03-31

work page 2025
[19]

Introducing gpt-5.2

OpenAI . Introducing gpt-5.2. https://openai.com/index/introducing-gpt-5-2/, December 2025. Official release page, accessed 2026-03-31

work page 2025
[20]

Introducing gpt-5.4

OpenAI . Introducing gpt-5.4. https://openai.com/index/introducing-gpt-5-4/, March 2026. Official release page, accessed 2026-03-31

work page 2026
[22]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. arXiv preprint arXiv:2307.16789, 2023 b

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Qwen2.5-7b-instruct

Qwen Team . Qwen2.5-7b-instruct. https://huggingface.co/Qwen/Qwen2.5-7B-Instruct, 2024 a . Hugging Face model card, accessed 2026-03-31

work page 2024
[24]

Qwen Team . Qwq-32b. https://huggingface.co/Qwen/QwQ-32B, 2024 b . Hugging Face model card, accessed 2026-03-31

work page 2024
[25]

Qwen3-235b-a22b-instruct-2507

Qwen Team . Qwen3-235b-a22b-instruct-2507. https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507, 2025 a . Hugging Face model card, accessed 2026-03-31

work page 2025
[26]

Qwen3-4b-instruct-2507

Qwen Team . Qwen3-4b-instruct-2507. https://huggingface.co/Qwen/Qwen3-4B-Instruct-2507, 2025 b . Hugging Face model card, accessed 2026-03-31

work page 2025
[27]

Qwen3 Technical Report

Qwen Team . Qwen3 technical report, 2025 c . URL https://arxiv.org/abs/2505.09388

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Qwen3.5 : Towards native multimodal agents, February 2026

Qwen Team . Qwen3.5 : Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id=qwen3.5

work page 2026
[30]

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases

Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases, 2023 b . URL https://arxiv.org/abs/2306.05301

work page arXiv 2023
[31]

AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Customizable runtime enforcement for safe and reliable llm agents. arXiv preprint arXiv:2503.18666, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking,

Haoyu Wang, Christopher M Poskitt, Jun Sun, and Jiali Wei. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking. arXiv preprint arXiv:2508.00500, 2025 b

work page arXiv 2025
[33]

Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms

Hongfei Xia, Hongru Wang, Zeming Liu, Qian Yu, Yuhang Guo, and Haifeng Wang. Safetoolbench: Pioneering a prospective benchmark to evaluating tool utilization safety in llms. arXiv preprint arXiv:2509.07315, 2025

work page arXiv 2025
[34]

Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations

Yuejin Xie, Youliang Yuan, Wenxuan Wang, Fan Mo, Jianmin Guo, and Pinjia He. Toolsafety: A comprehensive dataset for enhancing safety in llm-based agent tool invocations. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 14146--14167, 2025

work page 2025
[35]

R-judge: Benchmarking safety risk awareness for llm agents,

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, Rui Wang, and Gongshen Liu. R-judge: Benchmarking safety risk awareness for llm agents. arXiv preprint arXiv:2401.10019, 2024 a

work page arXiv 2024
[36]

R-judge: Benchmarking safety risk awareness for llm agents

Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-judge: Benchmarking safety risk awareness for llm agents. In Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1467--1490, 2024 b

work page 2024
[37]

Egonight: Towards egocentric vision understanding at night with a challenging benchmark

Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, et al. Egonight: Towards egocentric vision understanding at night with a challenging benchmark. arXiv preprint arXiv:2510.06218, 2025

work page arXiv 2025
[38]

Agent-SafetyBench: Evaluating the Safety of LLM Agents

Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents. arXiv preprint arXiv:2412.14470, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report. arXiv preprint arXiv:2510.14276, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[42]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[43]

tool\_used

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2023