Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols
Pith reviewed 2026-05-25 05:55 UTC · model grok-4.3
The pith
Content embeddings drive attack detection in LLM agent tool-call traffic, with tree ensembles on SBERT features reaching 0.975 AUROC and outperforming GNNs and MLPs under task-stratified evaluation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896). Metadata-only baselines plateau around 0.64 regardless of model, and task-stratified splits are required to avoid memorization confounds that prior work overlooked.
What carries the argument
Encoding each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges) with nodes enriched by SBERT sentence-embedding features over arguments and responses, then classifying the resulting graphs or pooled embeddings.
If this is right
- Content-level features are required to exceed 0.89 AUROC; metadata alone is insufficient.
- Task-disjoint splits must be used, or reported AUROC can be inflated by up to 26 points.
- Self-supervised pre-training confers no label-efficiency advantage on this detection task.
- Graph neural networks and MLPs do not improve upon simple tree ensembles once content embeddings are provided.
Where Pith is reading between the lines
- Attack patterns in tool-call traffic may be captured by semantic content rather than call ordering or data-flow topology.
- Deployment pipelines could favor lightweight tree models on pre-computed embeddings for lower latency and easier auditing.
- The same content-embedding approach might transfer to detecting misuse in other structured agent interfaces beyond MCP.
Load-bearing premise
The RAS-Eval and ATBench datasets contain accurately labeled examples of benign and attacked MCP sessions that are representative of real deployment scenarios.
What would settle it
A new collection of MCP sessions with verified attack labels drawn from live agent deployments yields AUROC below 0.80 for the tree-ensemble model on pooled SBERT embeddings.
Figures
read the original abstract
The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents an empirical study of attack detection for LLM agent tool-call traffic under the Model Context Protocol (MCP). Sessions are encoded as graphs (tool calls as nodes, sequential/data-flow edges) with nodes enriched by SBERT sentence embeddings of arguments and responses. It compares three GNNs (GAT, GCN, GraphSAGE), an MLP, and classical baselines (XGBoost, random forest, etc.) on the RAS-Eval dataset using task-stratified splits and on ATBench plus a combined variant using label-stratified splits. Key claims are that content embeddings are essential (AUROC >0.89 vs ~0.64 for metadata-only), random splits inflate performance by up to 26 points, and tree ensembles on pooled SBERT embeddings reach 0.975 AUROC, outperforming GNNs (0.917) and MLP (0.896) in the primary setting.
Significance. If the RAS-Eval and ATBench labels and attack instances are accurate and representative of deployment, the work would usefully demonstrate the dominance of content features over metadata, the necessity of task-disjoint evaluation to avoid memorization, and concrete performance benchmarks across architectures for MCP monitoring.
major comments (2)
- [Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.
- [Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.
minor comments (1)
- [Abstract] The abstract states that self-supervised pre-training does not deliver a label-efficiency advantage, but the corresponding experiment (if present) should be explicitly referenced with a table or figure number for clarity.
Simulated Author's Rebuttal
We appreciate the referee's detailed feedback on our manuscript. We address the major comments below and plan to incorporate revisions to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [Methods / Dataset description] Dataset construction (Methods section): No description is provided of how labels for benign vs. attacked MCP sessions were generated or validated in RAS-Eval or ATBench, nor of the attack synthesis process. This is load-bearing for the central claim that content embeddings carry the primary detection signal (AUROC 0.975 for tree ensembles on pooled SBERT embeddings vs. 0.917 for GNNs and 0.896 for MLP), because all reported AUROCs and architecture rankings rest on the correctness and realism of these labels.
Authors: We agree with the referee that a detailed description of the dataset construction, including label generation, validation, and attack synthesis for RAS-Eval and ATBench, is essential for interpreting the results. In the revised manuscript, we will expand the Methods section to include this information, drawing from the original dataset papers and our synthesis process. This will provide the necessary transparency regarding the label quality and realism. revision: yes
-
Referee: [Results / RAS-Eval evaluation] Results (architecture comparison): The claim that tree ensembles outperform the neural models is presented with specific AUROC numbers on RAS-Eval, but the text does not report statistical significance tests, standard deviations across multiple runs, or hyperparameter search details for the baselines. This weakens the strength of the conclusion that content embeddings make graph structure secondary.
Authors: We acknowledge that the current manuscript does not include statistical significance tests, standard deviations from multiple runs, or detailed hyperparameter search procedures for the baselines. We will add these elements in the revision to strengthen the empirical claims. While the performance gaps are large, we agree that reporting variance and significance will better support the conclusion that the detection signal is primarily in the content embeddings rather than the graph structure. revision: yes
Circularity Check
No circularity: pure empirical comparison on fixed external datasets
full rationale
The paper reports AUROC results from training and evaluating GNNs, MLPs, and tree ensembles on the RAS-Eval and ATBench datasets using SBERT embeddings and graph encodings. No equations, fitted parameters, or predictions are defined in terms of the target metrics; all numbers are direct experimental outputs. No self-citation chains, uniqueness theorems, or ansatzes are load-bearing. The work is self-contained against the stated benchmarks with no reduction of claims to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Ground truth labels in RAS-Eval and ATBench correctly distinguish benign from attacked sessions.
Reference graph
Works this paper leans on
-
[1]
GitHub topics: mcp-server,https://github.com/topics/ mcp-server, accessed: 2026-05-03 (2026)
work page 2026
- [2]
-
[3]
X. Zong, Z. Shen, L. Wang, Y. Lan, C. Yang, MCP-SafetyBench: Safety evaluation for LLMs with real-world MCP servers, in: International Conference on Learning Representations (ICLR), 2026
work page 2026
- [4]
-
[5]
SoK: Reshaping Research on Network Intrusion Detection Systems
G. Apruzzese, Sok: Reshaping research on network intrusion detection systems, arXiv preprint arXiv:2604.17556 (2026). 22
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [6]
- [7]
- [8]
- [9]
-
[10]
L. Advani, Trajectory guard – a lightweight, sequence-aware model for real-time anomaly detection in agentic AI, arXiv preprint arXiv:2601.00516 (2026)
- [11]
- [12]
-
[13]
W. W. Lo, S. Layeghy, M. Sarhan, M. Gallagher, M. Portmann, E- GraphSAGE: A graph neural network based intrusion detection sys- tem for IoT, in: NOMS 2022–2022 IEEE/IFIP Network Operations and Management Symposium, IEEE, 2022
work page 2022
- [14]
-
[15]
N. Reimers, I. Gurevych, Sentence-BERT: Sentence embeddings using siamese BERT-networks, in: Proceedings of the 2019 Conference on 23 Empirical Methods in Natural Language Processing (EMNLP), 2019, pp. 3982–3992
work page 2019
-
[16]
P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, Y. Bengio, Graph attention networks, in: International Conference on Learning Representations, 2018
work page 2018
-
[17]
M. Fey, J. E. Lenssen, Fast graph representation learning with PyTorch Geometric, in: ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019
work page 2019
- [18]
-
[19]
Y. Li, H. Luo, Y. Xie, Y. Fu, Z. Yang, S. Shao, Q. Ren, W. Qu, Y. Fu, Y. Yang, J. Shao, X. Hu, D. Liu, ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, arXiv preprint arXiv:2604.02022 (2026). 24
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.