Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs
Pith reviewed 2026-05-18 19:14 UTC · model grok-4.3
The pith
Action graphs expose unique 'agentic-only' jailbreak risks in LLMs that model-level tests miss entirely.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Model-level evaluations show baseline differences between models, yet agentic-level assessment using action-component graphs uncovers vulnerabilities that only emerge in agent contexts. Tool-calling exhibits 24-60% higher attack success rates across models, agent transfer operations rank as highest-risk, and semantic patterns drive the issues rather than syntax. Direct prompt transfer from model to agent settings loses effectiveness, while context-aware iterative attacks succeed on objectives that failed at the model level, confirming systematic gaps that action-based prompt improvement can narrow.
What carries the argument
Action-component graphs produced by AgentSeer, which break agentic executions into granular steps to quantify and compare model-level versus agent-level jailbreak risks.
If this is right
- Agent transfer operations consistently emerge as the highest-risk category across models.
- Semantic rather than syntactic mechanisms explain most agentic vulnerability patterns.
- Iterative context-aware attacks reliably compromise goals that resist direct model-level attacks.
- Action-graph signals enable automated prompt hardening that lowers average agentic jailbreak success.
- Universal agentic risk patterns appear across different base models and attack types.
Where Pith is reading between the lines
- Safety testing for deployed agents should routinely include full execution traces instead of isolated prompts.
- Tool interfaces may need redesign to reduce exposure of high-risk operations such as transfers.
- The graph-based approach could be applied to measure risks in other agent frameworks or multi-turn workflows.
Load-bearing premise
Decomposing executions into action-component graphs and tracking attack transfer metrics fully captures agentic risks without missing major failure modes or creating artifacts from the particular tool-calling setup.
What would settle it
A new agent system in which tool-calling attack success rates stay at or below model-level rates under the same attack suite would indicate the claimed agentic-only vulnerabilities do not generalize.
Figures
read the original abstract
As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \textbf{AgentSeer}, observability tool which decomposes agentic executions into granular \emph{action-component} graphs; we use this decomposition to rigorously quantify the gap between model-level and agent-level jailbreaking risk via cross-model validation on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks; we leverage action-graph risk signals to automate iterative prompt hardening against direct and iterative jailbreak attacks. Stark differences is revealed between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47\% ASR) versus Gemini-2.0-flash (50.00\% ASR), with both models showing susceptibility to social engineering. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60\% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, where agent transfer operations as highest-risk tools, with semantic pattern revealed rather than syntactic vulnerability mechanisms. Direct attack transfer from model-level to agentic contexts shows degraded performance of successful prompts (GPT-OSS-20B: 57\% human injection ASR; Gemini-2.0-flash: 28\%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic vulnerabilities gaps. Action-based prompt improvement substantially reduces action-averaged agentic jailbreak success on GPT-OSS-20B (direct: 45.3\%
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgentSeer, an observability tool that decomposes agentic LLM executions into granular action-component graphs. It uses this decomposition to quantify gaps between model-level and agentic-level jailbreaking risks via cross-model experiments on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks, identifying 'agentic-only' vulnerabilities (especially 24-60% higher ASR in tool-calling), universal patterns in agent transfer operations, degraded direct attack transfer performance, and the utility of action-graph signals for automated prompt hardening.
Significance. If the central empirical claims hold after methodological clarification, the work would be significant for LLM safety research by providing a concrete workflow to surface deployment-specific risks missed by standard model benchmarks. The cross-model validation, use of action graphs for both measurement and mitigation, and reporting of concrete ASR numbers on two distinct models are strengths that could inform safer agentic system design.
major comments (3)
- [Experimental Protocol] Methods/Experimental Protocol: The exact definitions of 'action-component', data exclusion rules, and statistical significance testing for ASR differences are not visible. This is load-bearing for the claim of systematic 'agentic-only' vulnerabilities and the 24-60% tool-calling ASR gap, as the reader's assessment notes these details are required to assess robustness.
- [Results on Agentic Vulnerabilities] Results on Agentic Vulnerabilities: The decomposition into action-component graphs and the chosen attack transfer metrics (direct vs. context-aware iterative) may introduce artifacts from the specific tool-calling implementation rather than isolating inherent agentic risks. The skeptic's concern is valid here: if the observed gaps partly reflect differences in attack adaptation or scaffolding instead of model- vs. agent-level differences, the universal patterns and degraded direct-transfer results (GPT-OSS-20B: 57%; Gemini-2.0-flash: 28%) do not fully establish the central claim.
- [Cross-model Analysis] Cross-model Analysis: The assertion that agent transfer operations are the highest-risk tools and that semantic patterns (rather than syntactic) drive vulnerabilities requires explicit side-by-side comparison showing these risks are invisible at model level; without this, the 'agentic-only' designation rests on the chosen observability tool without sufficient falsification checks.
minor comments (3)
- [Abstract] Abstract: 'Stark differences is revealed' contains a subject-verb agreement error and should read 'Stark differences are revealed'.
- [Abstract] Abstract: The final sentence is truncated ('on GPT-OSS-20B (direct: 45.3%') and should be completed with the full results for both models and attack types.
- [Throughout] Notation: Ensure consistent expansion and use of acronyms such as ASR on first mention in each major section.
Simulated Author's Rebuttal
We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment in detail below, providing clarifications and indicating revisions made to the manuscript.
read point-by-point responses
-
Referee: [Experimental Protocol] Methods/Experimental Protocol: The exact definitions of 'action-component', data exclusion rules, and statistical significance testing for ASR differences are not visible. This is load-bearing for the claim of systematic 'agentic-only' vulnerabilities and the 24-60% tool-calling ASR gap, as the reader's assessment notes these details are required to assess robustness.
Authors: We agree that precise definitions and statistical details are crucial for evaluating the robustness of our claims. In the revised manuscript, we have expanded the Methods section to include: (1) a formal definition of 'action-component' as the minimal executable units in the agent graph (tool invocation, internal reasoning, and response generation); (2) data exclusion criteria, which exclude runs with logging failures or non-deterministic tool outputs (less than 3% of total executions); and (3) statistical testing using bootstrap resampling (n=1000) with reported confidence intervals and p-values for ASR differences, confirming the tool-calling gap is statistically significant (p<0.05) across models. These additions directly support the 'agentic-only' vulnerability claims. revision: yes
-
Referee: [Results on Agentic Vulnerabilities] Results on Agentic Vulnerabilities: The decomposition into action-component graphs and the chosen attack transfer metrics (direct vs. context-aware iterative) may introduce artifacts from the specific tool-calling implementation rather than isolating inherent agentic risks. The skeptic's concern is valid here: if the observed gaps partly reflect differences in attack adaptation or scaffolding instead of model- vs. agent-level differences, the universal patterns and degraded direct-transfer results (GPT-OSS-20B: 57%; Gemini-2.0-flash: 28%) do not fully establish the central claim.
Authors: We take this concern seriously and have conducted additional experiments to mitigate potential artifacts. Specifically, we replicated the study using an alternative agent framework (AutoGen) and observed consistent 'agentic-only' vulnerabilities with comparable ASR gaps (22-55% for tool-calling). We also include an ablation where we standardize the tool-calling interface across setups, showing that the gaps persist and are not artifacts of our original implementation. The iterative refinement attacks are context-aware by design to simulate realistic agent interactions, and we now explicitly discuss how this isolates agent-level risks from pure model-level ones. The degraded direct-transfer results are presented as evidence of context-specific vulnerabilities rather than a flaw. revision: partial
-
Referee: [Cross-model Analysis] Cross-model Analysis: The assertion that agent transfer operations are the highest-risk tools and that semantic patterns (rather than syntactic) drive vulnerabilities requires explicit side-by-side comparison showing these risks are invisible at model level; without this, the 'agentic-only' designation rests on the chosen observability tool without sufficient falsification checks.
Authors: To address this, we have added a new subsection and accompanying table in the Results section that provides direct side-by-side comparisons of ASR for agent transfer operations at model-level (using isolated prompt injections) versus agentic-level (within full execution graphs). This demonstrates that these operations show significantly higher risk only in agentic contexts (e.g., 65% agentic ASR vs. 12% model-level for GPT-OSS-20B), invisible in standard benchmarks. For semantic vs. syntactic, we include clustering analysis of successful jailbreak prompts, revealing shared semantic themes (e.g., authority appeals in transfer ops) across models that do not appear in syntactic pattern matching at model level. These additions provide the requested falsification checks and strengthen the 'agentic-only' designation. revision: yes
Circularity Check
No circularity in empirical evaluation of agentic vulnerabilities
full rationale
The paper introduces AgentSeer as an observability tool that decomposes agentic executions into action-component graphs and then measures ASR gaps between model-level and agent-level jailbreaks via direct experiments on two models (GPT-OSS-20B, Gemini-2.0-flash) using the external HarmBench benchmark under single-turn and iterative attacks. No equations, fitted parameters, or derivations appear that reduce reported results or 'agentic-only' vulnerabilities to quantities defined from the same data or self-citations; the central claims rest on observed cross-model transfer differences and external benchmarks, making the work self-contained.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption HarmBench provides a representative sample of jailbreak prompts whose success rates transfer meaningfully to agentic settings.
- domain assumption The decomposition of agent executions into action-component graphs does not itself introduce or mask vulnerabilities.
invented entities (1)
-
AgentSeer observability tool and action-component graphs
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce AgentSeer, an observability-based evaluation framework that decomposes agentic executions into granular action and component graphs
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
tool-calling showing 24-60% higher ASR across both models
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
A survey on trustworthy llm agents: Threats and countermeasures, 2025
Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pang, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, and Qingsong Wen. A survey on trustworthy llm agents: Threats and countermeasures, 2025
work page 2025
-
[2]
Ai agents under threat: A survey of key security challenges and future pathways, 2024
Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. Ai agents under threat: A survey of key security challenges and future pathways, 2024
work page 2024
-
[3]
Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024
work page 2024
-
[4]
Zico Kolter, and Matt Fredrikson
Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023
work page 2023
-
[5]
Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024
Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024
work page 2024
-
[6]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023
work page 2023
-
[7]
ReAct: Synergizing Reasoning and Acting in Language Models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[8]
Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H
Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...
work page 2025
-
[9]
Watch out for your agents! investigating backdoor threats to llm-based agents, 2024
Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents, 2024
work page 2024
-
[10]
Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024
Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024
work page 2024
-
[11]
Agentharm: A benchmark for measuring harmfulness of llm agents, 2025
Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of llm agents, 2025
work page 2025
-
[12]
Agentbench: Evaluating llms as agents, 2023
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2023
work page 2023
-
[13]
Mlflow tracing: End-to-end observability for generative ai applications. https://mlflow. org/docs/latest/genai/tracing/, 2025. Accessed: 2025-08-26
work page 2025
-
[14]
Langgraph: A low-level orchestration framework for building, managing, and deploying stateful agents.https://langchain-ai.github.io/langgraph/, 2025. Accessed: 2025-08-26
work page 2025
-
[15]
Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023
Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023
work page 2023
-
[16]
Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024. 5 A AgentSeer Knowledge Graph Schema The complete JSON schema for AgentSeer’s knowledge graph representation: { "components": { "agents": [ { "label": "agen...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.