Can Graph-Based Microservice Performance Detection Be Used for Microservice Intrusion Detection?
Pith reviewed 2026-06-30 14:20 UTC · model grok-4.3
The pith
A two-layer graph convolutional network classifies microservice request graphs into normal and five attack types at 96.2 percent accuracy in random splits, yet non-graph baselines outperform it when splits respect trial boundaries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Request traces converted to multi-modal invocation graphs can be fed to a two-layer graph convolutional network to distinguish normal operation from five controlled attack types; the model reaches 96.2 percent test accuracy and 0.955 macro F1 under graph-level random splits, yet trial-level splits show trace structure alone is insufficient, logs and metrics add value, and flattened non-graph baselines exceed the shallow graph model on the same feature set.
What carries the argument
Two-layer graph convolutional network performing graph-level 6-way classification on request invocation graphs whose nodes carry timestamped log and per-service metric features.
If this is right
- Trace structure contributes little without accompanying logs and metrics.
- Flattened feature vectors currently yield higher accuracy than the two-layer GCN on the engineered data.
- Modality ablation shows performance drops when any single data source is removed.
- Runtime overhead of the graph model remains modest enough for online use in the synthetic setting.
- Error analysis and t-SNE plots indicate separable clusters only when all modalities are present.
Where Pith is reading between the lines
- Production systems may require deeper graph models or richer node features before graph methods surpass current flat baselines.
- The gap between random and trial splits suggests that attack signatures learned on one workload may not transfer to new workloads without retraining.
- If real intrusions produce request graphs outside the synthetic distribution, hybrid detectors combining graph and non-graph signals will be needed.
- Extending the approach to streaming, online graph construction could test whether the method scales beyond offline batch evaluation.
Load-bearing premise
The synthetic Docker Compose e-commerce benchmark and its five controlled attack types generate request graphs whose statistical properties match those of real production microservice systems under intrusion.
What would settle it
Running the identical pipeline on a production microservice deployment that experiences real intrusions and measuring whether accuracy falls below the reported trial-level figures.
Figures
read the original abstract
Microservice systems expose rich telemetry streams, including metrics, logs, and distributed traces. Existing performance anomaly detection methods increasingly model these systems as graphs, where nodes represent services and edges represent runtime dependencies. This paper asks whether graph-based microservice performance detection can also serve as a foundation for microservice intrusion detection. We deploy a Docker Compose based synthetic e-commerce microservice benchmark, run 50 controlled trials across five attack types under normal workloads, and collect metrics, logs, and distributed traces. Each request trace is converted into a request-level invocation graph with multi-modal node features derived from timestamped logs and per-service performance metrics. As a first baseline, we train a two-layer graph convolutional network for 6-way classification over 21,438 request graphs. The model achieves 96.2% test accuracy with a macro F1 of 0.955 under a graph-level random split. We then conduct modality ablation, trial-level split evaluation, non-graph baseline comparison, runtime analysis, t-SNE visualization, confusion-matrix analysis, and error-case inspection. The stricter trial-level results show that trace structure alone is insufficient, logs and metrics improve detection, and strong flattened baselines currently outperform the shallow graph model on the engineered feature set.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores whether graph convolutional networks, previously used for microservice performance anomaly detection, can be applied to intrusion detection. Using a synthetic Docker Compose e-commerce benchmark, the authors run 50 trials with five attack types, convert request traces into invocation graphs with multi-modal features from logs and metrics, and train a two-layer GCN for 6-class classification. They report 96.2% accuracy and 0.955 macro F1 under a graph-level random split on 21,438 graphs, while also presenting results from trial-level splits, modality ablations, non-graph baselines, and other analyses, concluding that trace structure alone is insufficient and that flattened baselines outperform the GCN.
Significance. If the findings hold, this work contributes an initial empirical investigation into repurposing performance-detection graphs for security tasks in microservices. It is notable for its balanced presentation of optimistic and conservative evaluation protocols, direct comparisons to baselines, and acknowledgment of limitations in generalization from synthetic data. This could inform future research on multi-modal graph models for combined performance and security monitoring.
major comments (1)
- [Abstract] Abstract: the prominent reporting of 96.2% accuracy and 0.955 macro F1 under graph-level random split is undermined by the authors' own trial-level results (also summarized in the abstract), which indicate leakage of trial-specific artifacts and show that non-graph baselines outperform the GCN; for an intrusion-detection claim, the random-split protocol is not load-bearing and the headline result should be de-emphasized relative to the stricter protocol.
minor comments (2)
- The methods section should provide explicit counts of graphs per trial and the precise definition of the trial-level split to support reproducibility of the stricter evaluation.
- Figure captions and t-SNE visualizations would benefit from clearer labeling of attack vs. normal clusters to aid interpretation of the confusion-matrix analysis.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comment on the abstract. We agree that the trial-level split is the more appropriate protocol for assessing intrusion-detection performance and that the random-split headline result should be de-emphasized.
read point-by-point responses
-
Referee: [Abstract] Abstract: the prominent reporting of 96.2% accuracy and 0.955 macro F1 under graph-level random split is undermined by the authors' own trial-level results (also summarized in the abstract), which indicate leakage of trial-specific artifacts and show that non-graph baselines outperform the GCN; for an intrusion-detection claim, the random-split protocol is not load-bearing and the headline result should be de-emphasized relative to the stricter protocol.
Authors: We agree with the referee that the graph-level random split permits leakage of trial-specific artifacts (e.g., workload patterns or attack timing that are consistent within a trial) and is therefore not the primary protocol for an intrusion-detection claim. The manuscript already reports the trial-level results in the abstract and body, shows that non-graph baselines outperform the GCN under that protocol, and concludes that trace structure alone is insufficient. To address the concern, we will revise the abstract to lead with the trial-level findings, move the 96.2% figure to a secondary clause that explicitly notes its limitations, and add a short sentence clarifying why the stricter split is required for security-related claims. revision: yes
Circularity Check
Empirical classification experiment with held-out test data exhibits no circularity
full rationale
The paper describes an empirical machine-learning experiment: request traces are converted to graphs, a two-layer GCN is trained for 6-way classification, and test accuracy (96.2%) plus macro F1 (0.955) are reported on a held-out portion of 21,438 graphs under a stated random split. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The result is a direct measurement on external test data rather than a quantity forced by construction from the training procedure or prior author work. The paper's own additional experiments (trial-level splits, modality ablations, baseline comparisons) are presented as separate evaluations and do not reduce the headline metric to an input by definition.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015
Leman Akoglu, Hanghang Tong, and Danai Koutra. Graph based anomaly detection and description: A survey.Data Mining and Knowledge Discovery, 29(3):626–688, 2015
2015
-
[2]
Jian Chen, Fagui Liu, Jun Jiang, Guoxiang Zhong, Dishi Xu, Zhuanglun Tan, and Shangsong Shi. Tracegra: A trace-based anomaly detection for microservice using graph deep learning.Computer Communications, 204:109–117, 2023. doi: 10.1016/j.comcom.2023.03.028
-
[3]
Deeplog: Anomaly detection and diagnosis from system logs through deep learning
Min Du, Feifei Li, Guineng Zheng, and Vivek Srikumar. Deeplog: Anomaly detection and diagnosis from system logs through deep learning. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 1285–1298, 2017. doi: 10.1145/3133956.3134015
-
[4]
Hamilton, Rex Ying, and Jure Leskovec
William L. Hamilton, Rex Ying, and Jure Leskovec. Inductive representation learning on large graphs. InAdvances in Neural Information Processing Systems, 2017
2017
-
[5]
A closer look at different difficulty levels code generation abilities of chatgpt,
Jun Huang, Yang Yang, Hang Yu, Jianguo Li, and Xiao Zheng. Twin graph-based anomaly detection via attentive multi-modal learning for microservice system. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2023. doi: 10.1109/ASE56229.2023.00124. 9
-
[6]
Kipf and Max Welling
Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. InInternational Conference on Learning Representations, 2017
2017
-
[7]
Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R. Lyu. Eadro: An end-to- end troubleshooting framework for microservices on multi-source data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 1750–1762, 2023. doi: 10.1109/ ICSE48619.2023.00150
-
[8]
Meng Ma, Weilan Lin, Disheng Pan, and Ping Wang. Servicerank: Root cause identification of anomaly in large-scale microservice architectures.IEEE Transactions on Dependable and Secure Computing, 19 (5):3087–3100, 2022. doi: 10.1109/TDSC.2021.3083671
-
[9]
Kitsune: An ensemble of autoen- coders for online network intrusion detection
Yisroel Mirsky, Tomer Doitshman, Yuval Elovici, and Asaf Shabtai. Kitsune: An ensemble of autoen- coders for online network intrusion detection. InProceedings of the Network and Distributed System Security Symposium, 2018
2018
-
[10]
Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag
Benjamin H. Sigelman, Luiz Andr ´e Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Don- ald Beaver, Saul Jaspan, and Chandan Shanbhag. Dapper, a large-scale distributed systems tracing infrastructure. Technical report, Google, 2010
2010
-
[11]
Graph attention networks
Petar Veliˇckovi´c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li `o, and Yoshua Bengio. Graph attention networks. InInternational Conference on Learning Representations, 2018
2018
-
[12]
Guangba Yu, Pengfei Chen, Hongyang Chen, Zijie Guan, Zicheng Huang, Linxiao Jing, Tianjun Weng, Xinmeng Sun, and Xiaoyun Li. Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments. InProceedings of The Web Conference 2021, pages 3087–3098, 2021. doi: 10.1145/3442381.3449905
-
[13]
Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning
Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. Deeptralog: Trace-log combined microservice anomaly detection through graph-based deep learning. InProceedings of the 44th International Conference on Software Engineering, pages 623–634,
-
[14]
doi: 10.1145/3510003.3510180. 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.