pith. sign in

arxiv: 2605.11053 · v3 · pith:C43N4527new · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG

Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG
keywords attack detectionLLM agentstool callsModel Context Protocolcontent embeddingsSBERTgraph neural networksevaluation protocols
0
0 comments X

The pith

Content embeddings drive attack detection in LLM agent tool-call traffic, with tree ensembles on SBERT features reaching 0.975 AUROC and outperforming GNNs and MLPs under task-stratified evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates detectors for attacks on Model Context Protocol sessions used by LLM agents to call external tools. Sessions are represented as graphs whose nodes carry sentence embeddings of arguments and responses, and multiple architectures are compared on two datasets using task-disjoint splits. Content embeddings prove essential, lifting AUROC from roughly 0.64 with metadata alone to above 0.89, while pooled SBERT features fed to tree ensembles achieve the highest score and render graph structure and neural layers largely unnecessary. The work also demonstrates that random splits can inflate reported performance by up to 26 points, establishing a stricter evaluation protocol.

Core claim

The detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896). Metadata-only baselines plateau around 0.64 regardless of model, and task-stratified splits are required to avoid memorization confounds that prior work overlooked.

What carries the argument

Encoding each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges) with nodes enriched by SBERT sentence-embedding features over arguments and responses, then classifying the resulting graphs or pooled embeddings.

If this is right

  • Content-level features are required to exceed 0.89 AUROC; metadata alone is insufficient.
  • Task-disjoint splits must be used, or reported AUROC can be inflated by up to 26 points.
  • Self-supervised pre-training confers no label-efficiency advantage on this detection task.
  • Graph neural networks and MLPs do not improve upon simple tree ensembles once content embeddings are provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Attack patterns in tool-call traffic may be captured by semantic content rather than call ordering or data-flow topology.
  • Deployment pipelines could favor lightweight tree models on pre-computed embeddings for lower latency and easier auditing.
  • The same content-embedding approach might transfer to detecting misuse in other structured agent interfaces beyond MCP.

Load-bearing premise

The RAS-Eval and ATBench datasets contain accurately labeled examples of benign and attacked MCP sessions that are representative of real deployment scenarios.

What would settle it

A new collection of MCP sessions with verified attack labels drawn from live agent deployments yields AUROC below 0.80 for the tree-ensemble model on pooled SBERT embeddings.

Figures

Figures reproduced from arXiv: 2605.11053 by Sultan Zavrak.

Figure 1
Figure 1. Figure 1: Overview of the MCPShield framework. MCP tool-call sessions are encoded [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 1
Figure 1. Figure 1: Overview of the detection framework. MCP tool-call sessions are encoded [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example MCP tool-call session showing a data exfiltration attack: the agent [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Session-to-graph encoding. Left: sequential tool calls. Right: the resulting [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Session-to-graph encoding. Left: sequential tool calls. Right: the resulting [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-attack-category recall on the Combined dataset (GraphSAGE, content fea [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Label efficiency across datasets. Supervised training (circles) matches or per [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 5
Figure 5. Figure 5: Label efficiency across datasets. Supervised training (circles) matches or per [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
read the original abstract

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents an empirical study of attack detection for LLM agent tool-call traffic under the Model Context Protocol (MCP). Sessions are encoded as graphs (tool calls as nodes, sequential/data-flow edges) with nodes enriched by SBERT sentence embeddings of arguments and responses. It compares three GNNs (GAT, GCN, GraphSAGE), an MLP, and classical baselines (XGBoost, random forest, etc.) on the RAS-Eval dataset using task-stratified splits and on ATBench plus a combined variant using label-stratified splits. Key claims are that content embeddings are essential (AUROC >0.89 vs ~0.64 for metadata-only), random splits inflate performance by up to 26 points, and tree ensembles on pooled SBERT embeddings reach 0.975 AUROC, outperforming GNNs (0.917) and MLP (0.896) in the primary setting.

Significance. If the RAS-Eval and ATBench labels and attack instances are accurate and representative of deployment, the work would usefully demonstrate the dominance of content features over metadata, the necessity of task-disjoint evaluation to avoid memorization, and concrete performance benchmarks across architectures for MCP monitoring.

major comments (2)
  1. [Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.
  2. [Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.
minor comments (1)
  1. [Abstract] The abstract states that self-supervised pre-training does not deliver a label-efficiency advantage, but the corresponding experiment (if present) should be explicitly referenced with a table or figure number for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of the paper.

read point-by-point responses
  1. Referee: [Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.

    Authors: We agree with the referee that a detailed description of the dataset construction, including label generation, validation, and attack synthesis for RAS-Eval and ATBench, is essential for interpreting the results. In the revised manuscript, we will expand the Methods section to include this information, drawing from the original dataset papers and our synthesis process. This will provide the necessary transparency regarding the label quality and realism. revision: yes

  2. Referee: [Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.

    Authors: We acknowledge that the current manuscript does not include statistical significance tests, standard deviations from multiple runs, or detailed hyperparameter search procedures for the baselines. We will add these elements in the revision to strengthen the empirical claims. While the performance gaps are large, we agree that reporting variance and significance will better support the conclusion that the detection signal is primarily in the content embeddings rather than the graph structure. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical comparison on fixed external datasets

full rationale

The paper reports AUROC results from training and evaluating GNNs, MLPs, and tree ensembles on the RAS-Eval and ATBench datasets using SBERT embeddings and graph encodings. No equations, fitted parameters, or predictions are defined in terms of the target metrics; all numbers are direct experimental outputs. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing. The work is self-contained against the stated benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical machine-learning study that introduces no new free parameters, theoretical axioms, or invented entities beyond standard supervised classification assumptions and existing embedding models.

axioms (1)
  • domain assumption Ground truth labels in RAS-Eval and ATBench correctly distinguish benign from attacked sessions.
    All reported AUROC figures and architecture rankings depend directly on the accuracy of these labels.

pith-pipeline@v0.9.0 · 5854 in / 1240 out tokens · 33291 ms · 2026-05-25T05:55:15.518432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    GitHub topics: mcp-server,https://github.com/topics/ mcp-server, accessed: 2026-05-03 (2026)

  2. [2]

    Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, X. Li, MCPTox: A benchmark for tool poisoning attack on real-world MCP servers, arXiv preprint arXiv:2508.14925 (2025)

  3. [3]

    X. Zong, Z. Shen, L. Wang, Y. Lan, C. Yang, MCP-SafetyBench: Safety evaluation for LLMs with real-world MCP servers, in: International Conference on Learning Representations (ICLR), 2026

  4. [4]

    J. Kim, X. Liu, Z. Wang, S. Qiu, B. Li, W. Guo, D. Song, The attack and defense landscape of agentic AI: A comprehensive survey, arXiv preprint arXiv:2603.11088 (2026)

  5. [5]

    SoK: Reshaping Research on Network Intrusion Detection Systems

    G. Apruzzese, Sok: Reshaping research on network intrusion detection systems, arXiv preprint arXiv:2604.17556 (2026). 22

  6. [6]

    J. Liu, B. Ruan, X. Yang, Z. Lin, Y. Liu, Y. Wang, T. Wei, Z. Liang, TraceAegis: Securing LLM-based agents via hierarchical and behavioral anomaly detection, arXiv preprint arXiv:2510.11203 (2025)

  7. [7]

    Y.Liu, C.Zhang, Z.Han, H.Liu, Y.Wang, Y.Yu, X.Wang, Y.Yin, Tra- jAD: Trajectory anomaly detection for trustworthy LLM agents, arXiv preprint arXiv:2602.06443 (2026)

  8. [8]

    Z. Wang, J. Zhang, G. Shi, H. Cheng, Y. Yao, K. Guo, H. Du, X. Li, MindGuard: Tracking, detecting, and attributing MCP tool poisoning attack via decision dependence graph, arXiv preprint arXiv:2508.20412 (2025)

  9. [9]

    X. He, D. Wu, Y. Zhai, K. Sun, SentinelAgent: Graph-based anomaly detection in multi-agent systems, arXiv preprint arXiv:2505.24201 (2025)

  10. [10]

    Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

    L. Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

  11. [11]

    R. F. Del Rosario, Temporal attack pattern detection in multi-agent AI workflows: An open framework for training trace-based security models, arXiv preprint arXiv:2601.00848 (2025)

  12. [12]

    Pathak, H

    D. Pathak, H. Kumar, A. Roy, F. George, M. Verma, P. Moogi, De- tecting silent failures in multi-agentic AI trajectories, arXiv preprint arXiv:2511.04032 (2025)

  13. [13]

    W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, M. Portmann, E- GraphSAGE: A graph neural network based intrusion detection sys- tem for IoT, in: NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022

  14. [14]

    4355–4372

    F.Yang, J.Xu, C.Xiong, Z.Li, K.Zhang, PROGRAPHER:Ananomaly detectionsystembasedonprovenancegraphembedding, in: Proceedings of the 32nd USENIX Security Symposium, 2023, pp. 4355–4372

  15. [15]

    Reimers, I

    N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on 23 Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992

  16. [16]

    Veličković, G

    P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: International Conference on Learning Representations, 2018

  17. [17]

    M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

  18. [18]

    Y. Fu, X. Yuan, D. Wang, RAS-Eval: A comprehensive benchmark for security evaluation of LLM agents in real-world environments, arXiv preprint arXiv:2506.15253 (2025)

  19. [19]

    Y. Li, H. Luo, Y. Xie, Y. Fu, Z. Yang, S. Shao, Q. Ren, W. Qu, Y. Fu, Y. Yang, J. Shao, X. Hu, D. Liu, ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, arXiv preprint arXiv:2604.02022 (2026). 24