pith. sign in

arxiv: 2605.14859 · v2 · pith:4NLJ5O2Snew · submitted 2026-05-14 · 💻 cs.CR · cs.AI

Do Coding Agents Understand Least-Privilege Authorization?

Pith reviewed 2026-05-19 16:30 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords least-privilege authorizationcoding agentspermission inferenceAI safetyAuthBenchterminal tasksauthorization attractorsSufficiency-Tightness Decomposition
0
0 comments X

The pith

Coding agents often omit necessary permissions or grant sensitive unused ones, and a two-stage decomposition improves the balance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether current coding agents can infer least-privilege file permissions for realistic terminal tasks. It shows that models tend to miss accesses required by the full execution chain while also allowing unnecessary or risky ones, and that simply adding more reasoning pushes each model toward its own consistent failure pattern. To fix the bottleneck of generating a single policy that must do both discovery and rejection at once, the authors introduce Sufficiency-Tightness Decomposition: first forward-simulate the task to list every needed access, then audit each one for grounding and sensitivity. This change raises success on sensitive tasks and lowers attack rates across the models tested. Readers should care because agents with shell and file access cannot be deployed safely until they can draw this boundary correctly.

Core claim

Authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses. Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle. This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones. We therefore propose Sufficiency-Tightness Decomposition, which firstgenera

What carries the argument

Sufficiency-Tightness Decomposition: first generate a coverage-oriented policy by forward-simulating the task, then audit each granted entry for grounding and sensitivity.

If this is right

  • Sensitive-task success rises by up to 15.8 percent on models that previously leaned too tight.
  • Attack success drops across all tested models because unnecessary sensitive accesses are removed.
  • More reasoning steps make each model more consistent in its own authorization failure mode rather than correcting it.
  • Direct single-pass policy generation cannot reliably handle both discovering required accesses and rejecting unused ones.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same separation of coverage discovery from sensitivity audit could be applied to other agent permissions such as API scopes or network access.
  • Training objectives that explicitly reward balanced authorization rather than task completion alone might reduce reliance on inference-time decomposition.
  • Independent verification tools could be inserted after the audit stage to further harden the resulting policies without changing the agent itself.

Load-bearing premise

The human-reviewed permission labels in AuthBench correctly identify the minimal set of file accesses required by each task's full execution chain without systematic bias or omission of edge cases.

What would settle it

Re-label the minimal permissions for the AuthBench tasks with a fresh set of independent reviewers and re-run the decomposition experiments; if the reported gains in task success and attack reduction disappear, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.14859 by Carl Che, Charles Chen, Dengyun Peng, Ethan Qin, Fanqing Meng, Jiannan Guan, Jingxiang Weng, Jinhao Liu, Mengkang Hu, Qiming Yu, Yixin Yuan, Zheng Yan.

Figure 1
Figure 1. Figure 1: Motivation and overview of our work. 80 standard and 40 sensitive tasks, with human-reviewed permission specifications and executable validators. Given a task instruction and environment, an authorization model must produce a file-level read/write/execute policy before task execution begins ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Permission-boundary inference and dual-axis evaluation in AuthBench. Given a task instruction 𝐼 and terminal environment 𝐸, the authorization model 𝑓𝜃 generates a file-level permission policy 𝜋. AuthBench evaluates whether 𝜋 is tight enough to exclude non-essential files while preserving task execution, and whether it avoids sensitive surfaces that can enable attacks. 3.1. Data Sources and Task Constructio… view at source ↗
Figure 3
Figure 3. Figure 3: Task success under full access versus generated-policy execution. Solid bars show each model executing with full access; dotted bars show a fixed GPT-5 agent executing under that model’s generated policy. The gap measures the difficulty of permission-boundary inference beyond ordinary task execution. 4. Experimental Setup We evaluate a range of frontier models on the permission-boundary inference task: GPT… view at source ↗
Figure 4
Figure 4. Figure 4: Permission generation fails on both sides. Panels (a) and (b) compare executability with safety. Panel (c) shows low permission utilization. Panel (d) separates missing permissions from sensitive exposure. Reasoning Effort Flow Strength 0.0 0.2 0.4 0.6 0.8 1.0 (a) Claude Opus 4.6 0 0.5 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 Over-Authorization Burden Under-Authorization Burden (b) Gemini 3.1 Pro 0 0.5 1 2 3 0.0 0.2 … view at source ↗
Figure 5
Figure 5. Figure 5: Reasoning-effort vector field in the sufficiency–tightness space for three models. Arrows show the average displacement from low to high reasoning effort. Instead of moving toward the ideal point (0, 0), the fields converge to model-specific attractors. The over-authorization axis is log-compressed for display. low over-grant. If the task itself creates a conflicting objective, more reasoning may instead m… view at source ↗
Figure 6
Figure 6. Figure 6: Authorization safety and execution safety are decoupled. Panel (a) shows the distribution of sensitive-task instances across four outcomes: whether the generated authorization policy is safe or unsafe, and whether the model’s full-access execution behavior is safe or unsafe. The off-diagonal cells are mismatch cases. Panel (b) summarizes mismatch bias: positive values indicate execution-riskier models, whe… view at source ↗
Figure 7
Figure 7. Figure 7: AuthBench task distribution across 10 professional domains. D.1. Data Annotation Details Annotation Target. AuthBench does not annotate a free-form golden answer. Instead, each task is paired with a task-level permission specification and validation metadata that define how the task-facing boundary is scored and checked. The main label is 𝑆gold, a static required-permission proxy calibrated to a safe oracl… view at source ↗
Figure 8
Figure 8. Figure 8: Aggregated distribution of earliest execution-attributed problem events across permission-generation reasoning progress. Reasoning steps 1–5 correspond to five equal progress bins (0–20%, 20–40%, 40–60%, 60–80%, and 80–100%) rather than interaction-turn IDs. The dominant seeded problems appear in the first two reasoning steps and are concentrated in Tool and Path Resolution, Execution and Output Planning, … view at source ↗
Figure 9
Figure 9. Figure 9: Illustrative case study for home-server-https. GPT-5 identifies the high-level HTTPS setup workflow, but compresses the execution closure to three core executables (openssl, nginx, and mkdir). Execution then fails before certificate generation and server reload because the broader shell- and verification-facing execution chain is missing. Case Selection. We use home-server-https as an illustrative standard… view at source ↗
Figure 10
Figure 10. Figure 10: Illustrative case study for ar-static-library-creation. GPT-5 identifies the visible build workflow and grants permissions for the top-level shell script and driver binaries, but it omits the compiler backend closure that gcc actually spawns during compilation. Execution therefore fails before producing /workspace/solution.txt. Case Selection. We use ar-static-library-creation because it exposes a clean t… view at source ↗
Figure 11
Figure 11. Figure 11: Illustrative case study for blind-maze-explorer-5x5. GPT-5 progressively enriches the boundary around the maze launcher and server, but then trims away the PTY helper path and the concrete interpreter path that the execution agent actually uses. Execution fails before it can create /app/maze_map.txt. Case Selection. We use blind-maze-explorer-5x5 because it shows a multi-axis collapse in an interactive ta… view at source ↗
Figure 12
Figure 12. Figure 12: Illustrative case study for sqlite-with-gcov. GPT-5 rewrites the task around a direct-gcc compilation path instead of the provided build route, and then still fails to close over the compiler backend and final executable exposure required by that alternative plan. Execution never produces a working sqlite3 binary with coverage artifacts. Case Selection. We use sqlite-with-gcov because it captures a differ… view at source ↗
Figure 13
Figure 13. Figure 13: Illustrative case study for apache-access-log-forensics. GPT-5 emits a tight, minimal policy for a one-shot Python transformation, but the execution agent externalizes its reasoning into a scratch analyzer script. Because that scratch write is outside the generated boundary, execution fails before any log analysis is completed. Case Selection. We use apache-access-log-forensics because it complements the … view at source ↗
read the original abstract

As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces. To study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a model maps a task instruction and terminal environment to a file-level read/write/execute policy, and AuthBench, a benchmark of 120 realistic terminal tasks with human-reviewed permission labels and executable validators for utility and attack outcomes. AuthBench shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses. Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle. This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones. We therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and sensitivity. Across tested models, this decomposition improves sensitive-task success by up to 15.8% on tightness-biased models while reducing attack success across all evaluated models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces permission-boundary inference as the task of mapping a terminal task instruction and environment to a minimal file-level read/write/execute policy. It presents AuthBench, a benchmark of 120 realistic tasks equipped with human-reviewed permission labels and executable validators for both utility and attack outcomes. The central empirical findings are that frontier models exhibit model-specific authorization attractors (broad-but-exposed or tight-but-brittle) that are not resolved by additional inference-time reasoning, and that the proposed Sufficiency-Tightness Decomposition—forward simulation for coverage followed by grounding/sensitivity auditing—improves sensitive-task success by up to 15.8% on tightness-biased models while lowering attack success across all tested models.

Significance. If the results hold, the work is significant for the security of AI coding agents that receive shell and filesystem access. It supplies a reproducible benchmark grounded in executable validators rather than purely subjective judgment and demonstrates a practical decomposition that separates the discovery of necessary accesses from the rejection of unnecessary ones. The identification of authorization attractors provides a falsifiable characterization of current model behavior that can guide future prompt or training interventions.

major comments (2)
  1. [§3.2] §3.2 and §4.1: The human-reviewed minimal permission labels are load-bearing for every reported success and attack rate, yet the manuscript supplies no inter-annotator agreement statistics, no description of the review protocol for conditional execution paths, and no audit of omitted edge-case accesses. If reviewers systematically under-labeled files required only on rare branches, both the baseline failure rates and the measured 15.8% gain from Sufficiency-Tightness Decomposition could be artifacts of label error rather than evidence of improved least-privilege inference.
  2. [§5.3] §5.3, Table 4: The attack-success reduction is reported as consistent across models, but the paper does not show whether the executable attack validators cover the full space of privilege-escalation paths that would be enabled by the over-granted permissions; incomplete validator coverage would weaken the claim that the decomposition reliably tightens the policy without sacrificing utility.
minor comments (2)
  1. [§2.1] The notation for read/write/execute triples is introduced without an explicit grammar or example in the main text; a small table would improve readability.
  2. [Figure 3] Figure 3 caption should state the exact number of tasks per model and whether error bars represent standard deviation across runs or across tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, explaining our position and the revisions we will make to improve clarity and transparency.

read point-by-point responses
  1. Referee: [§3.2] §3.2 and §4.1: The human-reviewed minimal permission labels are load-bearing for every reported success and attack rate, yet the manuscript supplies no inter-annotator agreement statistics, no description of the review protocol for conditional execution paths, and no audit of omitted edge-case accesses. If reviewers systematically under-labeled files required only on rare branches, both the baseline failure rates and the measured 15.8% gain from Sufficiency-Tightness Decomposition could be artifacts of label error rather than evidence of improved least-privilege inference.

    Authors: We agree that transparency in the human review process is essential given the central role of the permission labels. The labels were produced via an iterative, collaborative review by the author team (with systems security expertise), in which each task was executed in the benchmark environment to enumerate all file accesses, including conditional branches addressed through input variation and path simulation. Inter-annotator agreement was not computed because the process was not designed as independent parallel annotations. In the revised manuscript we will expand §3.2 with a complete description of the review protocol, concrete examples of conditional-path handling, and the results of a post-hoc audit for omitted edge cases. These additions should directly address the possibility of systematic under-labeling. revision: yes

  2. Referee: [§5.3] §5.3, Table 4: The attack-success reduction is reported as consistent across models, but the paper does not show whether the executable attack validators cover the full space of privilege-escalation paths that would be enabled by the over-granted permissions; incomplete validator coverage would weaken the claim that the decomposition reliably tightens the policy without sacrificing utility.

    Authors: The attack validators target the concrete privilege-escalation surfaces that become reachable precisely when the over-granted permissions identified in each task are present (e.g., reading protected configuration files or executing scripts from unauthorized locations). While we cannot claim exhaustive coverage of every theoretical escalation path in an arbitrary filesystem, the validators are scoped to the over-grants actually observed in AuthBench. We will revise §5.3 to add an explicit discussion of validator scope, the attack classes they cover, and acknowledged limitations, thereby clarifying the evidential basis for the reported reductions in attack success. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical measurements on external benchmark and models

full rationale

The paper constructs AuthBench as a new benchmark with human-reviewed file-level permission labels and executable validators, then reports empirical improvements from Sufficiency-Tightness Decomposition on model task success and attack rates. No equations, fitted parameters, or self-citations are invoked to derive the 15.8% gain; the result is a direct measurement against the benchmark and tested models. The derivation chain is self-contained and does not reduce any claimed outcome to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The evaluation depends on the assumption that human reviewers can reliably determine minimal necessary permissions and that the 120 tasks plus validators faithfully represent real terminal usage and attack surfaces.

axioms (1)
  • domain assumption Human reviewers can accurately determine the minimal necessary permissions for each task without bias or omission of execution-chain dependencies.
    The benchmark construction and all reported improvements rest on these human-generated labels.

pith-pipeline@v0.9.0 · 5809 in / 1308 out tokens · 55937 ms · 2026-05-19T16:30:26.407405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · 20 internal anchors

  1. [1]

    From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence.arXiv preprint arXiv:2511.18538, 2025

    Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, et al. From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence.arXiv preprint arXiv:2511.18538, 2025

  2. [2]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  3. [3]

    Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  4. [4]

    A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

    Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

  5. [5]

    Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent.arXiv preprint arXiv:2510.06607, 2025

    Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, et al. Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent.arXiv preprint arXiv:2510.06607, 2025

  6. [6]

    Breaking the code: Security assessment of AI code agents through systematic jailbreaking attacks, 2025

    Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, and Varun Kumar. Breaking the code: Security assessment of ai code agents through systematic jailbreaking attacks.arXiv preprint arXiv:2510.01359, 2025

  7. [7]

    Confusedpilot: Confused deputy risks in rag-based llms.arXiv preprint arXiv:2408.04870, 2024

    Ayush RoyChowdhury, Mulong Luo, Prateek Sahu, Sarbartha Banerjee, and Mohit Tiwari. Confusedpilot: Confused deputy risks in rag-based llms.arXiv preprint arXiv:2408.04870, 2024

  8. [8]

    Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914, 2025

    Guanquan Shi, Haohua Du, Zhiqiang Wang, Xiaoyu Liang, Weiwenpei Liu, Song Bian, and Zhenyu Guan. Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914, 2025

  9. [9]

    Human-in-the-loop software development agents

    Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human-in-the-loop software development agents. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 342–352. IEEE, 2025

  10. [10]

    The protection of information in computer systems.Proceedings of the IEEE, 63(9):1278–1308, 1975

    Jerome H Saltzer and Michael D Schroeder. The protection of information in computer systems.Proceedings of the IEEE, 63(9):1278–1308, 1975

  11. [11]

    Saltzer & schroeder for 2030: Security engineering principles in a world of ai.arXiv preprint arXiv:2407.05710, 2024

    Nikhil Patnaik, Joseph Hallett, and Awais Rashid. Saltzer & schroeder for 2030: Security engineering principles in a world of ai.arXiv preprint arXiv:2407.05710, 2024

  12. [12]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  13. [13]

    Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

    Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025

  14. [14]

    Identifying the Risks of LM Agents with an LM-Emulated Sandbox

    Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023. evolvent.co 11 AuthBench

  15. [15]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents.arXiv preprint arXiv:2410.09024, 2024

  16. [16]

    Agent-SafetyBench: Evaluating the Safety of LLM Agents

    Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024

  17. [17]

    Progent: Securing AI Agents with Privilege Control

    Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable privilege control for llm agents.arXiv preprint arXiv:2504.11703, 2025

  18. [18]

    AgentBound: Securing Execution Boundaries of AI Agents

    Christoph Bühler, Matteo Biagiola, Luca Di Grazia, and Guido Salvaneschi. Securing ai agent execution.arXiv preprint arXiv:2510.21236, 2025

  19. [19]

    Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought.Advances in Neural Information Processing Systems, 37:54872–54904, 2024

  20. [20]

    OpenThoughts-TBLite

    OpenThoughts Team. OpenThoughts-TBLite. https://github.com/open-thoughts/ OpenThoughts-TBLite, 2025. GitHub repository

  21. [21]

    OpenAI GPT-5 System Card

    OpenAI. OpenAI GPT-5 System Card, 2026. URLhttps://arxiv.org/abs/2601.03267

  22. [22]

    GPT-5.3-Codex System Card

    OpenAI. GPT-5.3-Codex System Card. https://deploymentsafety.openai.com/gpt-5-3-codex , February

  23. [23]

    Published February 5, 2026

  24. [24]

    GPT-5.4 Thinking System Card

    OpenAI. GPT-5.4 Thinking System Card. https://deploymentsafety.openai.com/gpt-5-4-thinking , March 2026. Published March 5, 2026

  25. [25]

    System Card: Claude Opus 4.6

    Anthropic. System Card: Claude Opus 4.6. https://www-cdn.anthropic.com/ 6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf, February 2026

  26. [26]

    Gemini 3.1 Pro Model Card

    Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, February 2026

  27. [27]

    Kimi K2.5: Visual Agentic Intelligence

    Kimi Team. Kimi K2.5: Visual Agentic Intelligence, 2026. URLhttps://arxiv.org/abs/2602.02276

  28. [28]

    MiniMax M2.7: Early Echoes of Self-Evolution

    MiniMax AI. MiniMax M2.7: Early Echoes of Self-Evolution. https://www.minimax.io/news/ minimax-m27-en, March 2026. Published March 18, 2026

  29. [29]

    Qwen3 Technical Report

    Qwen Team. Qwen3 Technical Report, 2025. URLhttps://arxiv.org/abs/2505.09388

  30. [30]

    Qwen3.5: Towards native multimodal agents, February 2026

    Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5

  31. [31]

    The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models

    Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025

  32. [32]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. evolvent.co 12 AuthBench

  33. [33]

    AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents

    Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352, 2024

  34. [34]

    Confine: Automated system call policy generation for container attack surface reduction

    Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated system call policy generation for container attack surface reduction. In23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID), pages 443–458. USENIX Association, 2020

  35. [35]

    Llm-assisted static analysis for detecting security vulnerabilities,

    Ziyang Li, Saikat Dutta, and Mayur Naik. Iris: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238, 2024

  36. [36]

    Alps: Automated least-privilege enforcement for securing serverless functions.arXiv preprint arXiv:2603.25393, 2026

    Changhee Shin, Bom Kim, and Seungsoo Lee. Alps: Automated least-privilege enforcement for securing serverless functions.arXiv preprint arXiv:2603.25393, 2026

  37. [37]

    Rabin, J

    Rafiqul Rabin, Jesse Hostetler, Sean McGregor, Brett Weir, and Nick Judd. Sandboxeval: Towards securing test environment for untrusted code.arXiv preprint arXiv:2504.00018, 2025

  38. [38]

    Quantifying frontier llm capabilities for container sandbox escape.arXiv preprint arXiv:2603.02277, 2026

    Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, and Harry Coppock. Quantifying frontier llm capabilities for container sandbox escape.arXiv preprint arXiv:2603.02277, 2026

  39. [39]

    AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents

    Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Customizable runtime enforcement for safe and reliable llm agents.arXiv preprint arXiv:2503.18666, 2025

  40. [40]

    Pro2Guard: Proactive runtime enforcement of LLM agent safety via probabilistic model checking,

    Haoyu Wang, Christopher M Poskitt, Jun Sun, and Jiali Wei. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking.arXiv preprint arXiv:2508.00500, 2025

  41. [41]

    Authenticated delegation and authorized ai agents,

    Tobin South, Samuele Marro, Thomas Hardjono, Robert Mahari, Cedric Deslandes Whitney, Dazza Green- wood, Alan Chan, and Alex Pentland. Authenticated delegation and authorized ai agents.arXiv preprint arXiv:2501.09674, 2025

  42. [42]

    Delegated authorization for agents constrained to semantic task-to-scope matching.arXiv preprint arXiv:2510.26702, 2025

    Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, and Marcelo Yannuzzi. Delegated authorization for agents constrained to semantic task-to-scope matching.arXiv preprint arXiv:2510.26702, 2025

  43. [43]

    A vision for access control in llm-based agent systems.arXiv preprint arXiv:2510.11108, 2025

    Xinfeng Li, Dong Huang, Jie Li, Hongyi Cai, Zhenhong Zhou, Wei Dong, XiaoFeng Wang, and Yang Liu. A vision for access control in llm-based agent systems.arXiv preprint arXiv:2510.11108, 2025

  44. [44]

    The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv (Cornell University), 2024

    Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv (Cornell University), 2024

  45. [45]

    Ai agentic programming: A survey of techniques, challenges, and opportunities

    Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126, 2025

  46. [46]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025

  47. [47]

    Rbf++: Quantifying and optimizing reasoning boundaries across measurable and unmeasurable capabilities for chain- of-thought reasoning.arXiv preprint arXiv:2505.13307, 2025

    Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, and Wanxiang Che. Rbf++: Quantifying and optimizing reasoning boundaries across measurable and unmeasurable capabilities for chain- of-thought reasoning.arXiv preprint arXiv:2505.13307, 2025. evolvent.co 13 AuthBench

  48. [48]

    R-judge: Benchmarking safety risk awareness for llm agents

    Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-judge: Benchmarking safety risk awareness for llm agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1467–1490, 2024

  49. [49]

    arXiv preprint arXiv:2506.15253 , year=

    Yuchuan Fu, Xiaohan Yuan, and Dongxia Wang. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments.arXiv preprint arXiv:2506.15253, 2025

  50. [50]

    ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

    Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

  51. [51]

    Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

  52. [52]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  53. [53]

    Guardians and offenders: A survey on harmful content generation and safety mitigation of llm.arXiv preprint arXiv:2508.05775, 2025

    Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, and Zhuo Lu. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm.arXiv preprint arXiv:2508.05775, 2025

  54. [54]

    What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026

    Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026

  55. [55]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

  56. [56]

    Os-harm: A benchmark for measuring safety of computer use agents.arXiv preprint arXiv:2506.14866, 2025

    Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents.arXiv preprint arXiv:2506.14866, 2025

  57. [57]

    AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents

    Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: A benchmark for evaluating harmful behavior in computer-use agents.arXiv preprint arXiv:2604.02947, 2026

  58. [58]

    Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents

    Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024

  59. [59]

    Evaluating Privilege Usage of Agents with Real-World Tools

    Quan Zhang, Lianhang Fu, Lvsi Lian, Gwihwan Go, Yujue Wang, Chijin Zhou, Yu Jiang, and Geguang Pu. Evaluating privilege usage of agents on real-world tools.arXiv preprint arXiv:2603.28166, 2026

  60. [60]

    Securing the model context protocol (mcp): Risks, controls, and governance,

    Herman Errico, Jiquan Ngiam, and Shanita Sojan. Securing the model context protocol (mcp): Risks, controls, and governance.arXiv preprint arXiv:2511.20920, 2025

  61. [61]

    Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems

    Yubin Qu, Yi Liu, Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang, and Lei Ma. Supply- chain poisoning attacks against llm coding agent skill ecosystems.arXiv preprint arXiv:2604.03081, 2026

  62. [62]

    Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510, 2026

    Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510, 2026. evolvent.co 14 AuthBench

  63. [63]

    Levels of autonomy for ai agents.arXiv preprint arXiv:2506.12469, 2025

    Kevin J Feng, David W McDonald, and Amy X Zhang. Levels of autonomy for ai agents.arXiv preprint arXiv:2506.12469, 2025

  64. [64]

    Trism for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems.arXiv preprint arXiv:2506.04133, 2025

    Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Christos Emmanouilidis. Trism for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems.arXiv preprint arXiv:2506.04133, 2025

  65. [65]

    The trust paradox in llm-based multi-agent systems: When collaboration becomes a security vulnerability.arXiv preprint arXiv:2510.18563, 2025

    Zijie Xu, Minfeng Qi, Shiqing Wu, Lefeng Zhang, Qiwen Wei, Han He, and Ningran Li. The trust paradox in llm-based multi-agent systems: When collaboration becomes a security vulnerability.arXiv preprint arXiv:2510.18563, 2025

  66. [66]

    Pierre Peigné, Mikolaj Kniejski, Filip Sondej, Matthieu David, Jason Hoelscher-Obermaier, Christian Schroeder de Witt, and Esben Kran. Multi-agent security tax: Trading off security and collaboration capabilities in multi-agent systems.Proceedings of the AAAI Conference on Artificial Intelligence, 39(26):27573–27581, 2025

  67. [67]

    Zhang, J

    Guibin Zhang, Junhao Wang, Junjie Chen, Wangchunshu Zhou, Kun Wang, and Shuicheng Yan. Agentracer: Who is inducing failure in the llm agentic systems?arXiv preprint arXiv:2509.03312, 2025

  68. [68]

    Zhang, M

    Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025

  69. [69]

    Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

    Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024

  70. [70]

    Knowrl: Teaching language models to know what they know.arXiv preprint arXiv:2510.11407, 2025

    Sahil Kale and Devendra Singh Dhami. Knowrl: Teaching language models to know what they know.arXiv preprint arXiv:2510.11407, 2025

  71. [71]

    read": [

    Jon-Paul Cacioli. Do llms know what they know? measuring metacognitive efficiency with signal detection theory.arXiv preprint arXiv:2603.25112, 2026. evolvent.co 15 AuthBench Appendix A. Authorization Safety, Execution Safety, and Permission-Boundary Awareness This appendix expands the distinction that motivates AuthBench: a model can be safe or unsafe in...