Do Coding Agents Understand Least-Privilege Authorization?
Pith reviewed 2026-05-19 16:30 UTC · model grok-4.3
The pith
Coding agents often omit necessary permissions or grant sensitive unused ones, and a two-stage decomposition improves the balance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses. Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle. This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones. We therefore propose Sufficiency-Tightness Decomposition, which firstgenera
What carries the argument
Sufficiency-Tightness Decomposition: first generate a coverage-oriented policy by forward-simulating the task, then audit each granted entry for grounding and sensitivity.
If this is right
- Sensitive-task success rises by up to 15.8 percent on models that previously leaned too tight.
- Attack success drops across all tested models because unnecessary sensitive accesses are removed.
- More reasoning steps make each model more consistent in its own authorization failure mode rather than correcting it.
- Direct single-pass policy generation cannot reliably handle both discovering required accesses and rejecting unused ones.
Where Pith is reading between the lines
- The same separation of coverage discovery from sensitivity audit could be applied to other agent permissions such as API scopes or network access.
- Training objectives that explicitly reward balanced authorization rather than task completion alone might reduce reliance on inference-time decomposition.
- Independent verification tools could be inserted after the audit stage to further harden the resulting policies without changing the agent itself.
Load-bearing premise
The human-reviewed permission labels in AuthBench correctly identify the minimal set of file accesses required by each task's full execution chain without systematic bias or omission of edge cases.
What would settle it
Re-label the minimal permissions for the AuthBench tasks with a fresh set of independent reviewers and re-run the decomposition experiments; if the reported gains in task success and attack reduction disappear, the central claim is falsified.
Figures
read the original abstract
As coding agents gain access to shells, repositories, and user files, least-privilege authorization becomes a prerequisite for safe deployment: an agent should receive enough authority to complete the task, without unnecessary authority that exposes sensitive surfaces. To study whether current models can infer this boundary themselves, we first introduce permission-boundary inference, where a model maps a task instruction and terminal environment to a file-level read/write/execute policy, and AuthBench, a benchmark of 120 realistic terminal tasks with human-reviewed permission labels and executable validators for utility and attack outcomes. AuthBench shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses. Increasing inference-time reasoning does not resolve this mismatch. Instead, each model moves toward a model-specific authorization attractor: more reasoning makes it more consistent in its own failure mode, whether broad-but-exposed or tight-but-brittle. This suggests that direct policy generation is the bottleneck, because a single generation must both discover all necessary accesses and reject all unnecessary ones. We therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and sensitivity. Across tested models, this decomposition improves sensitive-task success by up to 15.8% on tightness-biased models while reducing attack success across all evaluated models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces permission-boundary inference as the task of mapping a terminal task instruction and environment to a minimal file-level read/write/execute policy. It presents AuthBench, a benchmark of 120 realistic tasks equipped with human-reviewed permission labels and executable validators for both utility and attack outcomes. The central empirical findings are that frontier models exhibit model-specific authorization attractors (broad-but-exposed or tight-but-brittle) that are not resolved by additional inference-time reasoning, and that the proposed Sufficiency-Tightness Decomposition—forward simulation for coverage followed by grounding/sensitivity auditing—improves sensitive-task success by up to 15.8% on tightness-biased models while lowering attack success across all tested models.
Significance. If the results hold, the work is significant for the security of AI coding agents that receive shell and filesystem access. It supplies a reproducible benchmark grounded in executable validators rather than purely subjective judgment and demonstrates a practical decomposition that separates the discovery of necessary accesses from the rejection of unnecessary ones. The identification of authorization attractors provides a falsifiable characterization of current model behavior that can guide future prompt or training interventions.
major comments (2)
- [§3.2] §3.2 and §4.1: The human-reviewed minimal permission labels are load-bearing for every reported success and attack rate, yet the manuscript supplies no inter-annotator agreement statistics, no description of the review protocol for conditional execution paths, and no audit of omitted edge-case accesses. If reviewers systematically under-labeled files required only on rare branches, both the baseline failure rates and the measured 15.8% gain from Sufficiency-Tightness Decomposition could be artifacts of label error rather than evidence of improved least-privilege inference.
- [§5.3] §5.3, Table 4: The attack-success reduction is reported as consistent across models, but the paper does not show whether the executable attack validators cover the full space of privilege-escalation paths that would be enabled by the over-granted permissions; incomplete validator coverage would weaken the claim that the decomposition reliably tightens the policy without sacrificing utility.
minor comments (2)
- [§2.1] The notation for read/write/execute triples is introduced without an explicit grammar or example in the main text; a small table would improve readability.
- [Figure 3] Figure 3 caption should state the exact number of tasks per model and whether error bars represent standard deviation across runs or across tasks.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address each major comment below, explaining our position and the revisions we will make to improve clarity and transparency.
read point-by-point responses
-
Referee: [§3.2] §3.2 and §4.1: The human-reviewed minimal permission labels are load-bearing for every reported success and attack rate, yet the manuscript supplies no inter-annotator agreement statistics, no description of the review protocol for conditional execution paths, and no audit of omitted edge-case accesses. If reviewers systematically under-labeled files required only on rare branches, both the baseline failure rates and the measured 15.8% gain from Sufficiency-Tightness Decomposition could be artifacts of label error rather than evidence of improved least-privilege inference.
Authors: We agree that transparency in the human review process is essential given the central role of the permission labels. The labels were produced via an iterative, collaborative review by the author team (with systems security expertise), in which each task was executed in the benchmark environment to enumerate all file accesses, including conditional branches addressed through input variation and path simulation. Inter-annotator agreement was not computed because the process was not designed as independent parallel annotations. In the revised manuscript we will expand §3.2 with a complete description of the review protocol, concrete examples of conditional-path handling, and the results of a post-hoc audit for omitted edge cases. These additions should directly address the possibility of systematic under-labeling. revision: yes
-
Referee: [§5.3] §5.3, Table 4: The attack-success reduction is reported as consistent across models, but the paper does not show whether the executable attack validators cover the full space of privilege-escalation paths that would be enabled by the over-granted permissions; incomplete validator coverage would weaken the claim that the decomposition reliably tightens the policy without sacrificing utility.
Authors: The attack validators target the concrete privilege-escalation surfaces that become reachable precisely when the over-granted permissions identified in each task are present (e.g., reading protected configuration files or executing scripts from unauthorized locations). While we cannot claim exhaustive coverage of every theoretical escalation path in an arbitrary filesystem, the validators are scoped to the over-grants actually observed in AuthBench. We will revise §5.3 to add an explicit discussion of validator scope, the attack classes they cover, and acknowledged limitations, thereby clarifying the evidential basis for the reported reductions in attack success. revision: partial
Circularity Check
No circularity: empirical measurements on external benchmark and models
full rationale
The paper constructs AuthBench as a new benchmark with human-reviewed file-level permission labels and executable validators, then reports empirical improvements from Sufficiency-Tightness Decomposition on model task success and attack rates. No equations, fitted parameters, or self-citations are invoked to derive the 15.8% gain; the result is a direct measurement against the benchmark and tested models. The derivation chain is self-contained and does not reduce any claimed outcome to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Human reviewers can accurately determine the minimal necessary permissions for each task without bias or omission of execution-chain dependencies.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We therefore propose Sufficiency-Tightness Decomposition, which first generates a coverage-oriented policy by forward-simulating the task and then audits each granted entry for grounding and sensitivity.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AuthBench shows that authorization is not a simple conservative-versus-permissive calibration problem: frontier models often omit permissions required by the execution chain while also granting unused or sensitive accesses.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, et al. From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence.arXiv preprint arXiv:2511.18538, 2025
-
[2]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[3]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[4]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024
work page 2024
-
[5]
Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, et al. Code agent can be an end-to-end system hacker: Benchmarking real-world threats of computer-use agent.arXiv preprint arXiv:2510.06607, 2025
-
[6]
Shoumik Saha, Jifan Chen, Sam Mayers, Sanjay Krishna Gouda, Zijian Wang, and Varun Kumar. Breaking the code: Security assessment of ai code agents through systematic jailbreaking attacks.arXiv preprint arXiv:2510.01359, 2025
-
[7]
Confusedpilot: Confused deputy risks in rag-based llms.arXiv preprint arXiv:2408.04870, 2024
Ayush RoyChowdhury, Mulong Luo, Prateek Sahu, Sarbartha Banerjee, and Mohit Tiwari. Confusedpilot: Confused deputy risks in rag-based llms.arXiv preprint arXiv:2408.04870, 2024
-
[8]
Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914, 2025
Guanquan Shi, Haohua Du, Zhiqiang Wang, Xiaoyu Liang, Weiwenpei Liu, Song Bian, and Zhenyu Guan. Sok: Trust-authorization mismatch in llm agent interactions.arXiv preprint arXiv:2512.06914, 2025
-
[9]
Human-in-the-loop software development agents
Wannita Takerngsaksiri, Jirat Pasuksmit, Patanamon Thongtanunam, Chakkrit Tantithamthavorn, Ruixiong Zhang, Fan Jiang, Jing Li, Evan Cook, Kun Chen, and Ming Wu. Human-in-the-loop software development agents. In2025 IEEE/ACM 47th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), pages 342–352. IEEE, 2025
work page 2025
-
[10]
The protection of information in computer systems.Proceedings of the IEEE, 63(9):1278–1308, 1975
Jerome H Saltzer and Michael D Schroeder. The protection of information in computer systems.Proceedings of the IEEE, 63(9):1278–1308, 1975
work page 1975
-
[11]
Nikhil Patnaik, Joseph Hallett, and Awais Rashid. Saltzer & schroeder for 2030: Security engineering principles in a world of ai.arXiv preprint arXiv:2407.05710, 2024
-
[12]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Muhammad Shihab Rashid, Christian Bock, Yuan Zhuang, Alexander Buchholz, Tim Esler, Simon Valentin, Luca Franceschi, Martin Wistuba, Prabhu Teja Sivaprasad, Woo Jung Kim, et al. Swe-polybench: A multi-language benchmark for repository level evaluation of coding agents.arXiv preprint arXiv:2504.08703, 2025
-
[14]
Identifying the Risks of LM Agents with an LM-Emulated Sandbox
Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023. evolvent.co 11 AuthBench
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[15]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents.arXiv preprint arXiv:2410.09024, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Agent-SafetyBench: Evaluating the Safety of LLM Agents
Zhexin Zhang, Shiyao Cui, Yida Lu, Jingzhuo Zhou, Junxiao Yang, Hongning Wang, and Minlie Huang. Agent-safetybench: Evaluating the safety of llm agents.arXiv preprint arXiv:2412.14470, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Progent: Securing AI Agents with Privilege Control
Tianneng Shi, Jingxuan He, Zhun Wang, Hongwei Li, Linyu Wu, Wenbo Guo, and Dawn Song. Progent: Programmable privilege control for llm agents.arXiv preprint arXiv:2504.11703, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
AgentBound: Securing Execution Boundaries of AI Agents
Christoph Bühler, Matteo Biagiola, Luca Di Grazia, and Guido Salvaneschi. Securing ai agent execution.arXiv preprint arXiv:2510.21236, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Qiguang Chen, Libo Qin, Jiaqi Wang, Jinxuan Zhou, and Wanxiang Che. Unlocking the capabilities of thought: A reasoning boundary framework to quantify and optimize chain-of-thought.Advances in Neural Information Processing Systems, 37:54872–54904, 2024
work page 2024
-
[20]
OpenThoughts Team. OpenThoughts-TBLite. https://github.com/open-thoughts/ OpenThoughts-TBLite, 2025. GitHub repository
work page 2025
-
[21]
OpenAI. OpenAI GPT-5 System Card, 2026. URLhttps://arxiv.org/abs/2601.03267
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
OpenAI. GPT-5.3-Codex System Card. https://deploymentsafety.openai.com/gpt-5-3-codex , February
-
[23]
Published February 5, 2026
work page 2026
-
[24]
OpenAI. GPT-5.4 Thinking System Card. https://deploymentsafety.openai.com/gpt-5-4-thinking , March 2026. Published March 5, 2026
work page 2026
-
[25]
Anthropic. System Card: Claude Opus 4.6. https://www-cdn.anthropic.com/ 6a5fa276ac68b9aeb0c8b6af5fa36326e0e166dd.pdf, February 2026
work page 2026
-
[26]
Google DeepMind. Gemini 3.1 Pro Model Card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-1-Pro-Model-Card.pdf, February 2026
work page 2026
-
[27]
Kimi K2.5: Visual Agentic Intelligence
Kimi Team. Kimi K2.5: Visual Agentic Intelligence, 2026. URLhttps://arxiv.org/abs/2602.02276
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[28]
MiniMax M2.7: Early Echoes of Self-Evolution
MiniMax AI. MiniMax M2.7: Early Echoes of Self-Evolution. https://www.minimax.io/news/ minimax-m27-en, March 2026. Published March 18, 2026
work page 2026
-
[29]
Qwen Team. Qwen3 Technical Report, 2025. URLhttps://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Qwen3.5: Towards native multimodal agents, February 2026
Qwen Team. Qwen3.5: Towards native multimodal agents, February 2026. URL https://qwen.ai/blog?id= qwen3.5
work page 2026
-
[31]
Shishir G Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E Gonzalez. The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[32]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023. evolvent.co 12 AuthBench
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunović, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents.arXiv preprint arXiv:2406.13352, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Confine: Automated system call policy generation for container attack surface reduction
Seyedhamed Ghavamnia, Tapti Palit, Azzedine Benameur, and Michalis Polychronakis. Confine: Automated system call policy generation for container attack surface reduction. In23rd International Symposium on Research in Attacks, Intrusions and Defenses (RAID), pages 443–458. USENIX Association, 2020
work page 2020
-
[35]
Llm-assisted static analysis for detecting security vulnerabilities,
Ziyang Li, Saikat Dutta, and Mayur Naik. Iris: Llm-assisted static analysis for detecting security vulnerabilities. arXiv preprint arXiv:2405.17238, 2024
-
[36]
Changhee Shin, Bom Kim, and Seungsoo Lee. Alps: Automated least-privilege enforcement for securing serverless functions.arXiv preprint arXiv:2603.25393, 2026
- [37]
-
[38]
Rahul Marchand, Art O Cathain, Jerome Wynne, Philippos Maximos Giavridis, Sam Deverett, John Wilkinson, Jason Gwartz, and Harry Coppock. Quantifying frontier llm capabilities for container sandbox escape.arXiv preprint arXiv:2603.02277, 2026
-
[39]
AgentSpec: Customizable Runtime Enforcement for Safe and Reliable LLM Agents
Haoyu Wang, Christopher M Poskitt, and Jun Sun. Agentspec: Customizable runtime enforcement for safe and reliable llm agents.arXiv preprint arXiv:2503.18666, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Pro2Guard: Proactive runtime enforcement of LLM agent safety via probabilistic model checking,
Haoyu Wang, Christopher M Poskitt, Jun Sun, and Jiali Wei. Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking.arXiv preprint arXiv:2508.00500, 2025
-
[41]
Authenticated delegation and authorized ai agents,
Tobin South, Samuele Marro, Thomas Hardjono, Robert Mahari, Cedric Deslandes Whitney, Dazza Green- wood, Alan Chan, and Alex Pentland. Authenticated delegation and authorized ai agents.arXiv preprint arXiv:2501.09674, 2025
-
[42]
Majed El Helou, Chiara Troiani, Benjamin Ryder, Jean Diaconu, Hervé Muyal, and Marcelo Yannuzzi. Delegated authorization for agents constrained to semantic task-to-scope matching.arXiv preprint arXiv:2510.26702, 2025
-
[43]
A vision for access control in llm-based agent systems.arXiv preprint arXiv:2510.11108, 2025
Xinfeng Li, Dong Huang, Jie Li, Hongyi Cai, Zhenhong Zhou, Wei Dong, XiaoFeng Wang, and Yang Liu. A vision for access control in llm-based agent systems.arXiv preprint arXiv:2510.11108, 2025
-
[44]
Tula Masterman, Sandi Besen, Mason Sawtell, and Alex Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey.arXiv (Cornell University), 2024
work page 2024
-
[45]
Ai agentic programming: A survey of techniques, challenges, and opportunities
Huanting Wang, Jingzhi Gong, Huawei Zhang, Jie Xu, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126, 2025
-
[46]
Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models
Qiguang Chen, Libo Qin, Jinhao Liu, Dengyun Peng, Jiannan Guan, Peng Wang, Mengkang Hu, Yuhang Zhou, Te Gao, and Wanxiang Che. Towards reasoning era: A survey of long chain-of-thought for reasoning large language models.arXiv preprint arXiv:2503.09567, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Qiguang Chen, Libo Qin, Jinhao Liu, Yue Liao, Jiaqi Wang, Jingxuan Zhou, and Wanxiang Che. Rbf++: Quantifying and optimizing reasoning boundaries across measurable and unmeasurable capabilities for chain- of-thought reasoning.arXiv preprint arXiv:2505.13307, 2025. evolvent.co 13 AuthBench
-
[48]
R-judge: Benchmarking safety risk awareness for llm agents
Tongxin Yuan, Zhiwei He, Lingzhong Dong, Yiming Wang, Ruijie Zhao, Tian Xia, Lizhen Xu, Binglin Zhou, Fangqi Li, Zhuosheng Zhang, et al. R-judge: Benchmarking safety risk awareness for llm agents. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 1467–1490, 2024
work page 2024
-
[49]
arXiv preprint arXiv:2506.15253 , year=
Yuchuan Fu, Xiaohan Yuan, and Dongxia Wang. Ras-eval: A comprehensive benchmark for security evaluation of llm agents in real-world environments.arXiv preprint arXiv:2506.15253, 2025
-
[50]
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[52]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Chi Zhang, Changjia Zhu, Junjie Xiong, Xiaoran Xu, Lingyao Li, Yao Liu, and Zhuo Lu. Guardians and offenders: A survey on harmful content generation and safety mitigation of llm.arXiv preprint arXiv:2508.05775, 2025
-
[54]
What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026
Xing Li, Hui-Ling Zhen, Lihao Yin, Xianzhi Yu, Zhenhua Dong, and Mingxuan Yuan. What matters for safety alignment?arXiv preprint arXiv:2601.03868, 2026
-
[55]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024
work page 2024
-
[56]
Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents.arXiv preprint arXiv:2506.14866, 2025
-
[57]
AgentHazard: A Benchmark for Evaluating Harmful Behavior in Computer-Use Agents
Yunhao Feng, Yifan Ding, Yingshui Tan, Xingjun Ma, Yige Li, Yutao Wu, Yifeng Gao, Kun Zhai, and Yanming Guo. Agenthazard: A benchmark for evaluating harmful behavior in computer-use agents.arXiv preprint arXiv:2604.02947, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents
Qiusi Zhan, Zhixiang Liang, Zifan Ying, and Daniel Kang. Injecagent: Benchmarking indirect prompt injections in tool-integrated large language model agents. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10471–10506, 2024
work page 2024
-
[59]
Evaluating Privilege Usage of Agents with Real-World Tools
Quan Zhang, Lianhang Fu, Lvsi Lian, Gwihwan Go, Yujue Wang, Chijin Zhou, Yu Jiang, and Geguang Pu. Evaluating privilege usage of agents on real-world tools.arXiv preprint arXiv:2603.28166, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[60]
Securing the model context protocol (mcp): Risks, controls, and governance,
Herman Errico, Jiquan Ngiam, and Shanita Sojan. Securing the model context protocol (mcp): Risks, controls, and governance.arXiv preprint arXiv:2511.20920, 2025
-
[61]
Supply-Chain Poisoning Attacks Against LLM Coding Agent Skill Ecosystems
Yubin Qu, Yi Liu, Tongcheng Geng, Gelei Deng, Yuekang Li, Leo Yu Zhang, Ying Zhang, and Lei Ma. Supply- chain poisoning attacks against llm coding agent skill ecosystems.arXiv preprint arXiv:2604.03081, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[62]
Faouzi El Yagoubi, Godwin Badu-Marfo, and Ranwa Al Mallah. Agentleak: A full-stack benchmark for privacy leakage in multi-agent llm systems.arXiv preprint arXiv:2602.11510, 2026. evolvent.co 14 AuthBench
-
[63]
Levels of autonomy for ai agents.arXiv preprint arXiv:2506.12469, 2025
Kevin J Feng, David W McDonald, and Amy X Zhang. Levels of autonomy for ai agents.arXiv preprint arXiv:2506.12469, 2025
-
[64]
Shaina Raza, Ranjan Sapkota, Manoj Karkee, and Christos Emmanouilidis. Trism for agentic ai: A review of trust, risk, and security management in llm-based agentic multi-agent systems.arXiv preprint arXiv:2506.04133, 2025
-
[65]
Zijie Xu, Minfeng Qi, Shiqing Wu, Lefeng Zhang, Qiwen Wei, Han He, and Ningran Li. The trust paradox in llm-based multi-agent systems: When collaboration becomes a security vulnerability.arXiv preprint arXiv:2510.18563, 2025
-
[66]
Pierre Peigné, Mikolaj Kniejski, Filip Sondej, Matthieu David, Jason Hoelscher-Obermaier, Christian Schroeder de Witt, and Esben Kran. Multi-agent security tax: Trading off security and collaboration capabilities in multi-agent systems.Proceedings of the AAAI Conference on Artificial Intelligence, 39(26):27573–27581, 2025
work page 2025
- [67]
-
[68]
Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, et al. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems.arXiv preprint arXiv:2505.00212, 2025
-
[69]
Sanyam Kapoor, Nate Gruver, Manley Roberts, Katherine Collins, Arka Pal, Umang Bhatt, Adrian Weller, Samuel Dooley, Micah Goldblum, and Andrew G Wilson. Large language models must be taught to know what they don’t know.Advances in Neural Information Processing Systems, 37:85932–85972, 2024
work page 2024
-
[70]
Knowrl: Teaching language models to know what they know.arXiv preprint arXiv:2510.11407, 2025
Sahil Kale and Devendra Singh Dhami. Knowrl: Teaching language models to know what they know.arXiv preprint arXiv:2510.11407, 2025
-
[71]
Jon-Paul Cacioli. Do llms know what they know? measuring metacognitive efficiency with signal detection theory.arXiv preprint arXiv:2603.25112, 2026. evolvent.co 15 AuthBench Appendix A. Authorization Safety, Execution Safety, and Permission-Boundary Awareness This appendix expands the distinction that motivates AuthBench: a model can be safe or unsafe in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.