pith. sign in

arxiv: 2509.04802 · v3 · submitted 2025-09-05 · 💻 cs.CL

Mind the Gap: Evaluating Model- and Agentic-Level Vulnerabilities in LLMs with Action Graphs

Pith reviewed 2026-05-18 19:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords agentic vulnerabilitiesLLM jailbreakingaction graphsmodel versus agent evaluationtool calling risksobservability for AI agentsprompt hardening
0
0 comments X

The pith

Action graphs expose unique 'agentic-only' jailbreak risks in LLMs that model-level tests miss entirely.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a workflow to decompose how LLM agents execute tasks into detailed action-component graphs, then uses those graphs to measure how much more vulnerable the systems become once they start calling tools and managing context. Standard tests on the base models show moderate attack success rates, but the agentic versions reveal risks that only appear during actual execution, including much higher success when tools are involved. This gap matters because real deployments are increasingly agentic rather than simple chat models, so evaluations that ignore the execution layer leave important failure modes unmeasured. The authors also demonstrate that prompts can be iteratively hardened using signals from the same graphs to reduce those agentic risks. Cross-model checks confirm that the patterns hold for different underlying models and attack styles.

Core claim

Model-level evaluations show baseline differences between models, yet agentic-level assessment using action-component graphs uncovers vulnerabilities that only emerge in agent contexts. Tool-calling exhibits 24-60% higher attack success rates across models, agent transfer operations rank as highest-risk, and semantic patterns drive the issues rather than syntax. Direct prompt transfer from model to agent settings loses effectiveness, while context-aware iterative attacks succeed on objectives that failed at the model level, confirming systematic gaps that action-based prompt improvement can narrow.

What carries the argument

Action-component graphs produced by AgentSeer, which break agentic executions into granular steps to quantify and compare model-level versus agent-level jailbreak risks.

If this is right

  • Agent transfer operations consistently emerge as the highest-risk category across models.
  • Semantic rather than syntactic mechanisms explain most agentic vulnerability patterns.
  • Iterative context-aware attacks reliably compromise goals that resist direct model-level attacks.
  • Action-graph signals enable automated prompt hardening that lowers average agentic jailbreak success.
  • Universal agentic risk patterns appear across different base models and attack types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety testing for deployed agents should routinely include full execution traces instead of isolated prompts.
  • Tool interfaces may need redesign to reduce exposure of high-risk operations such as transfers.
  • The graph-based approach could be applied to measure risks in other agent frameworks or multi-turn workflows.

Load-bearing premise

Decomposing executions into action-component graphs and tracking attack transfer metrics fully captures agentic risks without missing major failure modes or creating artifacts from the particular tool-calling setup.

What would settle it

A new agent system in which tool-calling attack success rates stay at or below model-level rates under the same attack suite would indicate the claimed agentic-only vulnerabilities do not generalize.

Figures

Figures reproduced from arXiv: 2509.04802 by Adriano Koshiyama, Ilham Wicaksono, Philip Treleaven, Rahul Patel, Theo King, Zekun Wu.

Figure 1
Figure 1. Figure 1: AgentSeer interface showing action graph (chronological LLM operations) and component [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Hierarchical architecture of the 6-agent testbed system used for evaluation. The structure [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AgentSeer action panel interface showing detailed action information including input/output [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AgentSeer component panel view highlighting relationships between actions and system [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: AgentSeer human input visualization demonstrating how user interactions are captured and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: GPT-OSS-20B direct agentic attack success rates across all 29 actions and injection [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: GPT-OSS-20B comparison between iterative and direct agentic attack success rates, ranked [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: GPT-OSS-20B tool risk analysis showing attack success rates for different tools during [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: GPT-OSS-20B scatter plot analysis of attack success rates versus input token length for [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: GPT-OSS-20B agent-specific risk analysis for direct agentic attacks, showing vulnerability [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: GPT-OSS-20B weighted blast radius analysis showing the propagation impact of successful [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemini-2.0-flash direct agentic attack success rates across all 27 actions and injection [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Gemini-2.0-flash comparison between iterative and direct agentic attack success rates, [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Gemini-2.0-flash tool risk analysis showing attack success rates for different tools during [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Gemini-2.0-flash scatter plot analysis of attack success rates versus input token length, [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Gemini-2.0-flash agent-specific risk analysis for direct agentic attacks, showing model [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Gemini-2.0-flash weighted blast radius analysis demonstrating attack impact propagation [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
read the original abstract

As large language models increasingly deployed into agentic systems, existing methods face critical gaps in observing, assessing, and mitigating deployment-specific risks. We present a comprehensive, observability-driven workflow: we introduce \textbf{AgentSeer}, observability tool which decomposes agentic executions into granular \emph{action-component} graphs; we use this decomposition to rigorously quantify the gap between model-level and agent-level jailbreaking risk via cross-model validation on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks; we leverage action-graph risk signals to automate iterative prompt hardening against direct and iterative jailbreak attacks. Stark differences is revealed between model-level and agentic-level vulnerability profiles. Model-level evaluation reveals baseline differences: GPT-OSS-20B (39.47\% ASR) versus Gemini-2.0-flash (50.00\% ASR), with both models showing susceptibility to social engineering. However, agentic-level assessment exposes agent-specific risks invisible to traditional evaluation. We discover "agentic-only" vulnerabilities that emerge exclusively in agentic contexts, with tool-calling showing 24-60\% higher ASR across both models. Cross-model analysis reveals universal agentic patterns, where agent transfer operations as highest-risk tools, with semantic pattern revealed rather than syntactic vulnerability mechanisms. Direct attack transfer from model-level to agentic contexts shows degraded performance of successful prompts (GPT-OSS-20B: 57\% human injection ASR; Gemini-2.0-flash: 28\%), while context-aware iterative attacks successfully compromise objectives that failed at model-level, confirming systematic vulnerabilities gaps. Action-based prompt improvement substantially reduces action-averaged agentic jailbreak success on GPT-OSS-20B (direct: 45.3\%

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces AgentSeer, an observability tool that decomposes agentic LLM executions into granular action-component graphs. It uses this decomposition to quantify gaps between model-level and agentic-level jailbreaking risks via cross-model experiments on GPT-OSS-20B and Gemini-2.0-flash with HarmBench under single-turn and iterative-refinement attacks, identifying 'agentic-only' vulnerabilities (especially 24-60% higher ASR in tool-calling), universal patterns in agent transfer operations, degraded direct attack transfer performance, and the utility of action-graph signals for automated prompt hardening.

Significance. If the central empirical claims hold after methodological clarification, the work would be significant for LLM safety research by providing a concrete workflow to surface deployment-specific risks missed by standard model benchmarks. The cross-model validation, use of action graphs for both measurement and mitigation, and reporting of concrete ASR numbers on two distinct models are strengths that could inform safer agentic system design.

major comments (3)
  1. [Experimental Protocol] Methods/Experimental Protocol: The exact definitions of 'action-component', data exclusion rules, and statistical significance testing for ASR differences are not visible. This is load-bearing for the claim of systematic 'agentic-only' vulnerabilities and the 24-60% tool-calling ASR gap, as the reader's assessment notes these details are required to assess robustness.
  2. [Results on Agentic Vulnerabilities] Results on Agentic Vulnerabilities: The decomposition into action-component graphs and the chosen attack transfer metrics (direct vs. context-aware iterative) may introduce artifacts from the specific tool-calling implementation rather than isolating inherent agentic risks. The skeptic's concern is valid here: if the observed gaps partly reflect differences in attack adaptation or scaffolding instead of model- vs. agent-level differences, the universal patterns and degraded direct-transfer results (GPT-OSS-20B: 57%; Gemini-2.0-flash: 28%) do not fully establish the central claim.
  3. [Cross-model Analysis] Cross-model Analysis: The assertion that agent transfer operations are the highest-risk tools and that semantic patterns (rather than syntactic) drive vulnerabilities requires explicit side-by-side comparison showing these risks are invisible at model level; without this, the 'agentic-only' designation rests on the chosen observability tool without sufficient falsification checks.
minor comments (3)
  1. [Abstract] Abstract: 'Stark differences is revealed' contains a subject-verb agreement error and should read 'Stark differences are revealed'.
  2. [Abstract] Abstract: The final sentence is truncated ('on GPT-OSS-20B (direct: 45.3%') and should be completed with the full results for both models and attack types.
  3. [Throughout] Notation: Ensure consistent expansion and use of acronyms such as ASR on first mention in each major section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and rigor of our work. We address each major comment in detail below, providing clarifications and indicating revisions made to the manuscript.

read point-by-point responses
  1. Referee: [Experimental Protocol] Methods/Experimental Protocol: The exact definitions of 'action-component', data exclusion rules, and statistical significance testing for ASR differences are not visible. This is load-bearing for the claim of systematic 'agentic-only' vulnerabilities and the 24-60% tool-calling ASR gap, as the reader's assessment notes these details are required to assess robustness.

    Authors: We agree that precise definitions and statistical details are crucial for evaluating the robustness of our claims. In the revised manuscript, we have expanded the Methods section to include: (1) a formal definition of 'action-component' as the minimal executable units in the agent graph (tool invocation, internal reasoning, and response generation); (2) data exclusion criteria, which exclude runs with logging failures or non-deterministic tool outputs (less than 3% of total executions); and (3) statistical testing using bootstrap resampling (n=1000) with reported confidence intervals and p-values for ASR differences, confirming the tool-calling gap is statistically significant (p<0.05) across models. These additions directly support the 'agentic-only' vulnerability claims. revision: yes

  2. Referee: [Results on Agentic Vulnerabilities] Results on Agentic Vulnerabilities: The decomposition into action-component graphs and the chosen attack transfer metrics (direct vs. context-aware iterative) may introduce artifacts from the specific tool-calling implementation rather than isolating inherent agentic risks. The skeptic's concern is valid here: if the observed gaps partly reflect differences in attack adaptation or scaffolding instead of model- vs. agent-level differences, the universal patterns and degraded direct-transfer results (GPT-OSS-20B: 57%; Gemini-2.0-flash: 28%) do not fully establish the central claim.

    Authors: We take this concern seriously and have conducted additional experiments to mitigate potential artifacts. Specifically, we replicated the study using an alternative agent framework (AutoGen) and observed consistent 'agentic-only' vulnerabilities with comparable ASR gaps (22-55% for tool-calling). We also include an ablation where we standardize the tool-calling interface across setups, showing that the gaps persist and are not artifacts of our original implementation. The iterative refinement attacks are context-aware by design to simulate realistic agent interactions, and we now explicitly discuss how this isolates agent-level risks from pure model-level ones. The degraded direct-transfer results are presented as evidence of context-specific vulnerabilities rather than a flaw. revision: partial

  3. Referee: [Cross-model Analysis] Cross-model Analysis: The assertion that agent transfer operations are the highest-risk tools and that semantic patterns (rather than syntactic) drive vulnerabilities requires explicit side-by-side comparison showing these risks are invisible at model level; without this, the 'agentic-only' designation rests on the chosen observability tool without sufficient falsification checks.

    Authors: To address this, we have added a new subsection and accompanying table in the Results section that provides direct side-by-side comparisons of ASR for agent transfer operations at model-level (using isolated prompt injections) versus agentic-level (within full execution graphs). This demonstrates that these operations show significantly higher risk only in agentic contexts (e.g., 65% agentic ASR vs. 12% model-level for GPT-OSS-20B), invisible in standard benchmarks. For semantic vs. syntactic, we include clustering analysis of successful jailbreak prompts, revealing shared semantic themes (e.g., authority appeals in transfer ops) across models that do not appear in syntactic pattern matching at model level. These additions provide the requested falsification checks and strengthen the 'agentic-only' designation. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical evaluation of agentic vulnerabilities

full rationale

The paper introduces AgentSeer as an observability tool that decomposes agentic executions into action-component graphs and then measures ASR gaps between model-level and agent-level jailbreaks via direct experiments on two models (GPT-OSS-20B, Gemini-2.0-flash) using the external HarmBench benchmark under single-turn and iterative attacks. No equations, fitted parameters, or derivations appear that reduce reported results or 'agentic-only' vulnerabilities to quantities defined from the same data or self-citations; the central claims rest on observed cross-model transfer differences and external benchmarks, making the work self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that action graphs faithfully represent agent behavior and that the chosen attack success metrics generalize beyond the tested models and benchmark. No free parameters are explicitly fitted in the abstract; the main invented entity is the AgentSeer tool and its graph representation.

axioms (2)
  • domain assumption HarmBench provides a representative sample of jailbreak prompts whose success rates transfer meaningfully to agentic settings.
    Invoked when using HarmBench for both single-turn and iterative-refinement attacks to quantify the gap.
  • domain assumption The decomposition of agent executions into action-component graphs does not itself introduce or mask vulnerabilities.
    Central to the claim that the observed differences are genuine agentic risks rather than measurement artifacts.
invented entities (1)
  • AgentSeer observability tool and action-component graphs no independent evidence
    purpose: To decompose and quantify agentic executions for risk measurement and prompt hardening.
    New construct introduced to enable the model-vs-agentic comparison; no independent falsifiable prediction outside the paper is stated.

pith-pipeline@v0.9.0 · 5878 in / 1593 out tokens · 32896 ms · 2026-05-18T19:14:02.643760+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    A survey on trustworthy llm agents: Threats and countermeasures, 2025

    Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pang, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, Bo An, and Qingsong Wen. A survey on trustworthy llm agents: Threats and countermeasures, 2025

  2. [2]

    Ai agents under threat: A survey of key security challenges and future pathways, 2024

    Zehang Deng, Yongjian Guo, Changzhou Han, Wanlun Ma, Junwu Xiong, Sheng Wen, and Yang Xiang. Ai agents under threat: A survey of key security challenges and future pathways, 2024

  3. [3]

    Pappas, and Eric Wong

    Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, and Eric Wong. Jailbreaking black box large language models in twenty queries, 2024

  4. [4]

    Zico Kolter, and Matt Fredrikson

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models, 2023

  5. [5]

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

    Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, David Forsyth, and Dan Hendrycks. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal, 2024

  6. [6]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023

  7. [7]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

  8. [8]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

  9. [9]

    Watch out for your agents! investigating backdoor threats to llm-based agents, 2024

    Wenkai Yang, Xiaohan Bi, Yankai Lin, Sishuo Chen, Jie Zhou, and Xu Sun. Watch out for your agents! investigating backdoor threats to llm-based agents, 2024

  10. [10]

    Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024

    Zhaorun Chen, Zhen Xiang, Chaowei Xiao, Dawn Song, and Bo Li. Agentpoison: Red-teaming llm agents via poisoning memory or knowledge bases, 2024

  11. [11]

    Agentharm: A benchmark for measuring harmfulness of llm agents, 2025

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, Eric Winsor, Jerome Wynne, Yarin Gal, and Xander Davies. Agentharm: A benchmark for measuring harmfulness of llm agents, 2025

  12. [12]

    Agentbench: Evaluating llms as agents, 2023

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2023

  13. [13]

    https://mlflow

    Mlflow tracing: End-to-end observability for generative ai applications. https://mlflow. org/docs/latest/genai/tracing/, 2025. Accessed: 2025-08-26

  14. [14]

    Accessed: 2025-08-26

    Langgraph: A low-level orchestration framework for building, managing, and deploying stateful agents.https://langchain-ai.github.io/langgraph/, 2025. Accessed: 2025-08-26

  15. [15]

    Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

    Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

  16. [16]

    components

    Alexandra Souly, Qingyuan Lu, Dillon Bowen, Tu Trinh, Elvis Hsieh, Sana Pandey, Pieter Abbeel, Justin Svegliato, Scott Emmons, Olivia Watkins, and Sam Toyer. A strongreject for empty jailbreaks, 2024. 5 A AgentSeer Knowledge Graph Schema The complete JSON schema for AgentSeer’s knowledge graph representation: { "components": { "agents": [ { "label": "agen...