Can Graph-Based Microservice Performance Detection Be Used for Microservice Intrusion Detection?

Yunjian Ma

arxiv: 2605.24283 · v1 · pith:ISJBN5BQnew · submitted 2026-05-22 · 💻 cs.SE

Can Graph-Based Microservice Performance Detection Be Used for Microservice Intrusion Detection?

Yunjian Ma This is my paper

Pith reviewed 2026-06-30 14:20 UTC · model grok-4.3

classification 💻 cs.SE

keywords microservicesintrusion detectiongraph convolutional networksdistributed tracesanomaly detectionsynthetic benchmarksmulti-modal features

0 comments

The pith

A two-layer graph convolutional network classifies microservice request graphs into normal and five attack types at 96.2 percent accuracy in random splits, yet non-graph baselines outperform it when splits respect trial boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether graph models built for performance anomaly detection can double as intrusion detectors in microservice systems. It builds request-level invocation graphs from traces, augments them with log and metric features, and trains a shallow GCN for six-way classification on over twenty-one thousand examples from a synthetic e-commerce benchmark. High accuracy appears under random graph splits, but stricter trial-level splits and modality ablations reveal that trace structure by itself is weak, while engineered flat features fed to non-graph classifiers currently win. The work therefore positions the graph approach as a promising but still immature direction rather than a ready replacement for existing methods.

Core claim

Request traces converted to multi-modal invocation graphs can be fed to a two-layer graph convolutional network to distinguish normal operation from five controlled attack types; the model reaches 96.2 percent test accuracy and 0.955 macro F1 under graph-level random splits, yet trial-level splits show trace structure alone is insufficient, logs and metrics add value, and flattened non-graph baselines exceed the shallow graph model on the same feature set.

What carries the argument

Two-layer graph convolutional network performing graph-level 6-way classification on request invocation graphs whose nodes carry timestamped log and per-service metric features.

If this is right

Trace structure contributes little without accompanying logs and metrics.
Flattened feature vectors currently yield higher accuracy than the two-layer GCN on the engineered data.
Modality ablation shows performance drops when any single data source is removed.
Runtime overhead of the graph model remains modest enough for online use in the synthetic setting.
Error analysis and t-SNE plots indicate separable clusters only when all modalities are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Production systems may require deeper graph models or richer node features before graph methods surpass current flat baselines.
The gap between random and trial splits suggests that attack signatures learned on one workload may not transfer to new workloads without retraining.
If real intrusions produce request graphs outside the synthetic distribution, hybrid detectors combining graph and non-graph signals will be needed.
Extending the approach to streaming, online graph construction could test whether the method scales beyond offline batch evaluation.

Load-bearing premise

The synthetic Docker Compose e-commerce benchmark and its five controlled attack types generate request graphs whose statistical properties match those of real production microservice systems under intrusion.

What would settle it

Running the identical pipeline on a production microservice deployment that experiences real intrusions and measuring whether accuracy falls below the reported trial-level figures.

Figures

Figures reproduced from arXiv: 2605.24283 by Yunjian Ma.

**Figure 2.** Figure 2: Confusion matrix for trial-level GCN with logs+metrics. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: t-SNE projection of trial-level GCN test logits. Separation is strongest for high-signal classes and [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Runtime cost comparison under trial-level split. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Microservice systems expose rich telemetry streams, including metrics, logs, and distributed traces. Existing performance anomaly detection methods increasingly model these systems as graphs, where nodes represent services and edges represent runtime dependencies. This paper asks whether graph-based microservice performance detection can also serve as a foundation for microservice intrusion detection. We deploy a Docker Compose based synthetic e-commerce microservice benchmark, run 50 controlled trials across five attack types under normal workloads, and collect metrics, logs, and distributed traces. Each request trace is converted into a request-level invocation graph with multi-modal node features derived from timestamped logs and per-service performance metrics. As a first baseline, we train a two-layer graph convolutional network for 6-way classification over 21,438 request graphs. The model achieves 96.2% test accuracy with a macro F1 of 0.955 under a graph-level random split. We then conduct modality ablation, trial-level split evaluation, non-graph baseline comparison, runtime analysis, t-SNE visualization, confusion-matrix analysis, and error-case inspection. The stricter trial-level results show that trace structure alone is insufficient, logs and metrics improve detection, and strong flattened baselines currently outperform the shallow graph model on the engineered feature set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a GCN on microservice request graphs only reaches 96% under a random split that leaks trial artifacts, while the authors' own trial-level split and non-graph baselines tell a weaker story.

read the letter

The main point is that this work checks whether performance graphs can be reused for intrusion detection and ends up showing the limits of that idea once evaluation is done right. They built request-level invocation graphs from traces in a synthetic Docker Compose e-commerce setup across 50 trials and five attack types, added multi-modal features from logs and metrics, and trained a two-layer GCN for six-way classification on 21k graphs.

What stands out is the set of experiments: they report the optimistic random-split result but also run the stricter trial-level split, modality ablations, non-graph baselines, t-SNE, confusion matrices, and runtime checks. The abstract is upfront that trace structure alone is not enough and that flattened baselines currently beat the shallow GCN. That package of comparisons is the concrete contribution.

The soft spot is the evaluation. The headline 96.2% accuracy and 0.955 macro F1 come from a graph-level random split that lets graphs from the same trial appear in both train and test, which leaks workload and attack-instance artifacts. The trial-level split the authors also performed is the more relevant protocol for intrusion detection, and there the graph model loses ground to simple baselines. The data remains entirely synthetic with controlled attacks under normal load, so it is unclear how the patterns would transfer to production systems with real adversaries. The model is only two layers, which may also explain why the graph structure adds little once the features are engineered.

This is for researchers already working on microservice monitoring who want to test if their existing graphs can double for security tasks. A reader in that area gets a clear empirical check on the split issue and the baseline results. It is not a broad advance, but the honest reporting of where the graph approach falls short makes it worth a referee's time. I would send it for peer review so the community can weigh whether the trial split is sufficient or if more realistic attack data is required.

Referee Report

1 major / 2 minor

Summary. The manuscript explores whether graph convolutional networks, previously used for microservice performance anomaly detection, can be applied to intrusion detection. Using a synthetic Docker Compose e-commerce benchmark, the authors run 50 trials with five attack types, convert request traces into invocation graphs with multi-modal features from logs and metrics, and train a two-layer GCN for 6-class classification. They report 96.2% accuracy and 0.955 macro F1 under a graph-level random split on 21,438 graphs, while also presenting results from trial-level splits, modality ablations, non-graph baselines, and other analyses, concluding that trace structure alone is insufficient and that flattened baselines outperform the GCN.

Significance. If the findings hold, this work contributes an initial empirical investigation into repurposing performance-detection graphs for security tasks in microservices. It is notable for its balanced presentation of optimistic and conservative evaluation protocols, direct comparisons to baselines, and acknowledgment of limitations in generalization from synthetic data. This could inform future research on multi-modal graph models for combined performance and security monitoring.

major comments (1)

[Abstract] Abstract: the prominent reporting of 96.2% accuracy and 0.955 macro F1 under graph-level random split is undermined by the authors' own trial-level results (also summarized in the abstract), which indicate leakage of trial-specific artifacts and show that non-graph baselines outperform the GCN; for an intrusion-detection claim, the random-split protocol is not load-bearing and the headline result should be de-emphasized relative to the stricter protocol.

minor comments (2)

The methods section should provide explicit counts of graphs per trial and the precise definition of the trial-level split to support reproducibility of the stricter evaluation.
Figure captions and t-SNE visualizations would benefit from clearer labeling of attack vs. normal clusters to aid interpretation of the confusion-matrix analysis.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive comment on the abstract. We agree that the trial-level split is the more appropriate protocol for assessing intrusion-detection performance and that the random-split headline result should be de-emphasized.

read point-by-point responses

Referee: [Abstract] Abstract: the prominent reporting of 96.2% accuracy and 0.955 macro F1 under graph-level random split is undermined by the authors' own trial-level results (also summarized in the abstract), which indicate leakage of trial-specific artifacts and show that non-graph baselines outperform the GCN; for an intrusion-detection claim, the random-split protocol is not load-bearing and the headline result should be de-emphasized relative to the stricter protocol.

Authors: We agree with the referee that the graph-level random split permits leakage of trial-specific artifacts (e.g., workload patterns or attack timing that are consistent within a trial) and is therefore not the primary protocol for an intrusion-detection claim. The manuscript already reports the trial-level results in the abstract and body, shows that non-graph baselines outperform the GCN under that protocol, and concludes that trace structure alone is insufficient. To address the concern, we will revise the abstract to lead with the trial-level findings, move the 96.2% figure to a secondary clause that explicitly notes its limitations, and add a short sentence clarifying why the stricter split is required for security-related claims. revision: yes

Circularity Check

0 steps flagged

Empirical classification experiment with held-out test data exhibits no circularity

full rationale

The paper describes an empirical machine-learning experiment: request traces are converted to graphs, a two-layer GCN is trained for 6-way classification, and test accuracy (96.2%) plus macro F1 (0.955) are reported on a held-out portion of 21,438 graphs under a stated random split. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The result is a direct measurement on external test data rather than a quantity forced by construction from the training procedure or prior author work. The paper's own additional experiments (trial-level splits, modality ablations, baseline comparisons) are presented as separate evaluations and do not reduce the headline metric to an input by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical machine-learning study; no mathematical axioms, free parameters, or invented entities are introduced beyond standard supervised classification assumptions.

pith-pipeline@v0.9.1-grok · 5741 in / 1083 out tokens · 40841 ms · 2026-06-30T14:20:29.571611+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 7 canonical work pages

[1]

Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015

Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015

2015
[2]

Tracegra: A trace-based anomaly detection for microservice using graph deep learning.Computer Communications, 204:109–117, 2023

Jian Chen, Fagui Liu, Jun Jiang, Guoxiang Zhong, Dishi Xu, Zhuanglun Tan, and Shangsong Shi. Tracegra: A trace-based anomaly detection for microservice using graph deep learning.Computer Communications, 204:109–117, 2023. doi: 10.1016/j.comcom.2023.03.028

work page doi:10.1016/j.comcom.2023.03.028 2023
[3]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017. doi: 10.1145/3133956.3134015

work page doi:10.1145/3133956.3134015 2017
[4]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017

2017
[5]

A closer look at different difficulty levels code generation abilities of chatgpt,

Jun Huang, Yang Yang, Hang Yu, Jianguo Li, and Xiao Zheng. Twin graph-based anomaly detection via attentive multi-modal learning for microservice system. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023. doi: 10.1109/ASE56229.2023.00124. 9

work page doi:10.1109/ase56229.2023.00124 2023
[6]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

2017
[7]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R. Lyu. Eadro: An end-to- end troubleshooting framework for microservices on multi-source data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1750–1762, 2023. doi: 10.1109/ ICSE48619.2023.00150

work page arXiv 2023
[8]

Servicerank: Root cause identification of anomaly in large-scale microservice architectures.IEEE Transactions on Dependable and Secure Computing, 19 (5):3087–3100, 2022

Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. Servicerank: Root cause identification of anomaly in large-scale microservice architectures.IEEE Transactions on Dependable and Secure Computing, 19 (5):3087–3100, 2022. doi: 10.1109/TDSC.2021.3083671

work page doi:10.1109/tdsc.2021.3083671 2022
[9]

Kitsune: An ensemble of autoen- coders for online network intrusion detection

Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. Kitsune: An ensemble of autoen- coders for online network intrusion detection. InProceedings of the Network and Distributed System Security Symposium, 2018

2018
[10]

Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, 2010

2010
[11]

Graph attention networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li `o, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018

2018
[12]

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of The Web Conference 2021, pages 3087–3098, 2021. doi: 10.1145/3442381.3449905

work page doi:10.1145/3442381.3449905 2021
[13]

Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. InProceedings of the 44th International Conference on Software Engineering, pages 623–634,
[14]

doi: 10.1145/3510003.3510180. 10

work page doi:10.1145/3510003.3510180

[1] [1]

Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015

Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015

2015

[2] [2]

Tracegra: A trace-based anomaly detection for microservice using graph deep learning.Computer Communications, 204:109–117, 2023

Jian Chen, Fagui Liu, Jun Jiang, Guoxiang Zhong, Dishi Xu, Zhuanglun Tan, and Shangsong Shi. Tracegra: A trace-based anomaly detection for microservice using graph deep learning.Computer Communications, 204:109–117, 2023. doi: 10.1016/j.comcom.2023.03.028

work page doi:10.1016/j.comcom.2023.03.028 2023

[3] [3]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning

Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017. doi: 10.1145/3133956.3134015

work page doi:10.1145/3133956.3134015 2017

[4] [4]

Hamilton, Rex Ying, and Jure Leskovec

William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017

2017

[5] [5]

A closer look at different difficulty levels code generation abilities of chatgpt,

Jun Huang, Yang Yang, Hang Yu, Jianguo Li, and Xiao Zheng. Twin graph-based anomaly detection via attentive multi-modal learning for microservice system. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023. doi: 10.1109/ASE56229.2023.00124. 9

work page doi:10.1109/ase56229.2023.00124 2023

[6] [6]

Kipf and Max Welling

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017

2017

[7] [7]

Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R. Lyu. Eadro: An end-to- end troubleshooting framework for microservices on multi-source data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1750–1762, 2023. doi: 10.1109/ ICSE48619.2023.00150

work page arXiv 2023

[8] [8]

Servicerank: Root cause identification of anomaly in large-scale microservice architectures.IEEE Transactions on Dependable and Secure Computing, 19 (5):3087–3100, 2022

Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. Servicerank: Root cause identification of anomaly in large-scale microservice architectures.IEEE Transactions on Dependable and Secure Computing, 19 (5):3087–3100, 2022. doi: 10.1109/TDSC.2021.3083671

work page doi:10.1109/tdsc.2021.3083671 2022

[9] [9]

Kitsune: An ensemble of autoen- coders for online network intrusion detection

Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. Kitsune: An ensemble of autoen- coders for online network intrusion detection. InProceedings of the Network and Distributed System Security Symposium, 2018

2018

[10] [10]

Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, 2010

2010

[11] [11]

Graph attention networks

Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li `o, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018

2018

[12] [12]

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments

Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of The Web Conference 2021, pages 3087–3098, 2021. doi: 10.1145/3442381.3449905

work page doi:10.1145/3442381.3449905 2021

[13] [13]

Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning

Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. InProceedings of the 44th International Conference on Software Engineering, pages 623–634,

[14] [14]

doi: 10.1145/3510003.3510180. 10

work page doi:10.1145/3510003.3510180