pith. sign in

arxiv: 2606.09942 · v1 · pith:CU4ESYQOnew · submitted 2026-06-08 · 💻 cs.SE · cs.AI

Anomaly Detection and Root Cause Analysis for Microservice Systems

Pith reviewed 2026-06-27 16:00 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords anomaly detectionroot cause analysismicroservice systemsobservability databenchmarkingcausal inferencecloud applicationsfailure diagnosis
0
0 comments X

The pith

BARO, EventADL and TORAI provide end-to-end anomaly detection and root cause analysis for microservice systems using observability data, plus the RCAEval benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The thesis aims to fix five shared shortcomings in current techniques for spotting and diagnosing failures in complex microservice setups. It introduces BARO to combine detection and analysis on metric data, EventADL on event data such as API calls, and TORAI as a multimodal method that works without a supplied service call graph. Experiments on real systems test their effectiveness and robustness. RCAEval supplies ready datasets and baselines to make comparisons consistent. If the approaches hold, automated diagnosis becomes more reliable when detection is noisy or incomplete, cutting the downtime that follows inevitable failures.

Core claim

BARO, EventADL and TORAI are end-to-end anomaly detection and RCA approaches that exploit observability data independently and collectively; extensive experiments on real microservice systems demonstrate their effectiveness and robustness; RCAEval supplies ready-to-use datasets and reproducible baselines.

What carries the argument

BARO for metric data, EventADL for event data, and TORAI as a multimodal RCA framework that requires no service call graph, together with the RCAEval benchmark for datasets and baselines.

If this is right

  • Anomaly detection and RCA can proceed together even when initial detection is imprecise due to noise or delay.
  • Event data including API calls and configuration changes becomes usable for diagnosis alongside metrics.
  • Root cause analysis works without a given service call graph.
  • Standardized datasets and evaluation frameworks allow fair comparison across methods.
  • Systematic checks on causal inference approaches yield concrete guidance for their use in this domain.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A production system could route different data streams to the matching framework and combine their outputs for higher coverage.
  • The RCAEval benchmark could become the default testbed, reducing duplicated effort when new methods appear.
  • The no-graph requirement in TORAI might extend naturally to environments where call graphs change rapidly or are unavailable.
  • Improved RCA accuracy could feed directly into automated remediation steps that act on identified causes.

Load-bearing premise

The five listed limitations of prior work are both accurate and addressable by the proposed frameworks without introducing comparable new limitations, and the real-system experiments are representative.

What would settle it

Applying BARO, EventADL and TORAI to additional real microservice deployments outside the tested set and finding they perform no better than prior separate detection-plus-RCA pipelines or fail under noise levels seen in production.

Figures

Figures reproduced from arXiv: 2606.09942 by Luan Pham.

Figure 1.1
Figure 1.1. Figure 1.1: Anomaly Detection and Root Cause Analysis for Microservice Systems. [PITH_FULL_IMAGE:figures/full_fig_p015_1_1.png] view at source ↗
Figure 1.2
Figure 1.2. Figure 1.2: Thesis Organisation. 7 (Wednesday 10th June, 2026) [PITH_FULL_IMAGE:figures/full_fig_p019_1_2.png] view at source ↗
Figure 2.1
Figure 2.1. Figure 2.1: The architecture of the Train Ticket microservice system [ [PITH_FULL_IMAGE:figures/full_fig_p020_2_1.png] view at source ↗
Figure 2.2
Figure 2.2. Figure 2.2: Metrics of ServiceA, including workload (rps), CPU usage (%), and latency (ms). [PITH_FULL_IMAGE:figures/full_fig_p021_2_2.png] view at source ↗
Figure 2.3
Figure 2.3. Figure 2.3: Sample log entries from microservice systems. [PITH_FULL_IMAGE:figures/full_fig_p022_2_3.png] view at source ↗
Figure 2.4
Figure 2.4. Figure 2.4: Distributed trace of an anomalous PlaceOrder request. The graph illustrates the service call topology, where each node details the service name, operation, and execution latency. The red text highlights the propagation of high latency from the root cause (paymentservice) upstream to the frontend-ui. 11 (Wednesday 10th June, 2026) [PITH_FULL_IMAGE:figures/full_fig_p023_2_4.png] view at source ↗
Figure 2.5
Figure 2.5. Figure 2.5: Event stream following the OCSF schema [ [PITH_FULL_IMAGE:figures/full_fig_p024_2_5.png] view at source ↗
Figure 3.1
Figure 3.1. Figure 3.1: The overview. The monitoring system monitors the microservice system and [PITH_FULL_IMAGE:figures/full_fig_p037_3_1.png] view at source ↗
Figure 3.2
Figure 3.2. Figure 3.2: An example of using Multivariate BOCPD to detect change points on mul [PITH_FULL_IMAGE:figures/full_fig_p039_3_2.png] view at source ↗
Figure 3.3
Figure 3.3. Figure 3.3: The Robustness of RobustScorer against imprecise anomaly detection time. In [PITH_FULL_IMAGE:figures/full_fig_p042_3_3.png] view at source ↗
Figure 3.4
Figure 3.4. Figure 3.4: Overview of our setup for microservice systems. [PITH_FULL_IMAGE:figures/full_fig_p044_3_4.png] view at source ↗
Figure 3.5
Figure 3.5. Figure 3.5: The performance of N-Sigma, ϵ-Diagnosis, CIRCA, RCD, and BARO with respect to different values of tbias on the Online Boutique dataset. The figure presents the AC@1, AC@3, and Avg@5 scores from left to right. 0.01 0.02 0.05 0.1 0.2 0.5 [PITH_FULL_IMAGE:figures/full_fig_p052_3_5.png] view at source ↗
Figure 3.6
Figure 3.6. Figure 3.6: The performance of CIRCA (a), RCD (b), and [PITH_FULL_IMAGE:figures/full_fig_p052_3_6.png] view at source ↗
Figure 4.1
Figure 4.1. Figure 4.1: An event in the OCSF schema [Open Cybersecurity Schema Framework, 2022]. Limitations of Existing Work. While ADL has been actively studied for metrics [Pham, 2026b; Gu et al., 2024; Pham et al., 2024c; Gu et al., 2025; Li et al., 2022a; Chen et al., 2022] and unstructured logs [Ali et al., 2025; Le and Zhang, 2021; Du et al., 2017; Landauer et al., 2024; Meng et al., 2019; Li et al., 2020], event-based A… view at source ↗
Figure 4.2
Figure 4.2. Figure 4.2: Insights from real-world incidents. (a) Distribution of anomaly types. (b) [PITH_FULL_IMAGE:figures/full_fig_p060_4_2.png] view at source ↗
Figure 4.3
Figure 4.3. Figure 4.3: Overview of EventADL. The framework operates in three phases: offline training (upper left), online anomaly detection (lower left), and root cause localisation (right). During offline training, EventADL learns ESPs and EFPs that capture behaviours observed in historical event data. In the online detection phase, incoming events are continuously evaluated against these patterns: ESP identifies pointwise a… view at source ↗
Figure 4.4
Figure 4.4. Figure 4.4: An ESP in the jsonLogic [Wadhams, n.d.] schema. Online Anomaly Detection. As new events arrive, they are continuously compared against the learned ESPs and EFPs. An event is flagged as a pointwise anomaly if it does not match any ESP (i.e., an Event Type or Event Value anomaly). A time-window is flagged as a frequency-based anomaly if the frequency of ESP-matching events deviates significantly from the c… view at source ↗
Figure 4.5
Figure 4.5. Figure 4.5: Relationship-agnostic and relationship-aware ESP generalizations. [PITH_FULL_IMAGE:figures/full_fig_p064_4_5.png] view at source ↗
Figure 4.6
Figure 4.6. Figure 4.6: Detecting anomalies with EFP. Each subsequence [PITH_FULL_IMAGE:figures/full_fig_p065_4_6.png] view at source ↗
Figure 4.7
Figure 4.7. Figure 4.7: Scalability of EventADL. operates on event-based time series, and we observe that the number of unique time series extracted does not grow proportionally with the number of events. 4.5.8 RQ4: Ablation Study We conduct ablation experiments to better understand the contribution of each component in EventADL. First, we examine how each component contributes to the overall anomaly detection performance, then… view at source ↗
Figure 4.8
Figure 4.8. Figure 4.8: Magnitude-based vs. shape-based EFP. adaptation, further reduces false alarms, raising precision to 0.82. Importantly, when encountering system evolution for the first time, ESP and EFP alone cannot distinguish legitimate evolution from anomalies, and will report both. With root cause localisation, however, EventADL enables operators to identify the true sources of anomalies, discard false alarms, and tr… view at source ↗
Figure 4.9
Figure 4.9. Figure 4.9: Robustness analysis of EventADL w.r.t. different parameters and noise levels on the Falcon dataset. 67 (Wednesday 10th June, 2026) [PITH_FULL_IMAGE:figures/full_fig_p079_4_9.png] view at source ↗
Figure 5.1
Figure 5.1. Figure 5.1: Service Call Graph of the Online Boutique microservice system containing [PITH_FULL_IMAGE:figures/full_fig_p085_5_1.png] view at source ↗
Figure 5.2
Figure 5.2. Figure 5.2: Overview of TORAI. (A) TORAI transforms telemetry data into time series. [PITH_FULL_IMAGE:figures/full_fig_p090_5_2.png] view at source ↗
Figure 5.3
Figure 5.3. Figure 5.3: CausalRanker analyses the multi-source time series data of all services within [PITH_FULL_IMAGE:figures/full_fig_p092_5_3.png] view at source ↗
Figure 5.4
Figure 5.4. Figure 5.4: The Robustness of FineGrainer to Imprecise Anomaly Detection. At time tA, a failure occurs, causing a spike in CPU that eventually leads to increased latency (TraceLat). At tˆA, TraceLat surpasses the anomaly detection threshold, triggering an anomaly detection. This delayed detection introduces abnormal data (outliers) into the normal period of the CPU time series. The median and interquartile range (IQ… view at source ↗
Figure 5.5
Figure 5.5. Figure 5.5: Illustration of our setup for the microservice systems and the multi-source [PITH_FULL_IMAGE:figures/full_fig_p095_5_5.png] view at source ↗
Figure 5.6
Figure 5.6. Figure 5.6: Sensitivity of TORAI to varying levels of blind spots on (a) the Online Boutique [PITH_FULL_IMAGE:figures/full_fig_p107_5_6.png] view at source ↗
Figure 5.7
Figure 5.7. Figure 5.7: Example of a stack trace showing a fine-grained root cause of a code-level fault, [PITH_FULL_IMAGE:figures/full_fig_p107_5_7.png] view at source ↗
Figure 5.8
Figure 5.8. Figure 5.8: The frequency of normal logs versus stack traces of cartservice. [PITH_FULL_IMAGE:figures/full_fig_p108_5_8.png] view at source ↗
Figure 6.1
Figure 6.1. Figure 6.1: Overview of the RCAEval benchmark. and fair comparison remain open challenges in RCA research [Cheng et al., 2023], hindering progress and preventing fair evaluation of new RCA approaches. There have been some related works that introduce datasets or evaluation frameworks, but all of them suffer from several limitations, see [PITH_FULL_IMAGE:figures/full_fig_p112_6_1.png] view at source ↗
Figure 6.2
Figure 6.2. Figure 6.2: Illustration of our data collection setup. [PITH_FULL_IMAGE:figures/full_fig_p114_6_2.png] view at source ↗
Figure 6.3
Figure 6.3. Figure 6.3: Examples of heterogeneous telemetry data in microservice systems. [PITH_FULL_IMAGE:figures/full_fig_p114_6_3.png] view at source ↗
Figure 7.1
Figure 7.1. Figure 7.1: Overview of the causal inference-based root cause analysis for microservice [PITH_FULL_IMAGE:figures/full_fig_p119_7_1.png] view at source ↗
Figure 7.2
Figure 7.2. Figure 7.2: Overview of our setup for microservice systems. [PITH_FULL_IMAGE:figures/full_fig_p125_7_2.png] view at source ↗
Figure 7.3
Figure 7.3. Figure 7.3: Performance of seven causal discovery methods on six synthetic datasets with different data lengths. [PITH_FULL_IMAGE:figures/full_fig_p135_7_3.png] view at source ↗
Figure 7.4
Figure 7.4. Figure 7.4: Performance of fourteen RCA methods on eight datasets with different data lengths. [PITH_FULL_IMAGE:figures/full_fig_p136_7_4.png] view at source ↗
read the original abstract

Microservice systems are widely used to build cloud applications, yet their complexity makes failures inevitable, degrading user experience and causing economic loss. Automated anomaly detection and root cause analysis (RCA) are now active research areas, but existing techniques share five limitations. First, most treat anomaly detection and RCA separately, assuming anomalies are detected correctly, and falter when detection is imprecise due to noise or delay. Second, they focus on metrics, logs, and traces, leaving event data such as API calls and configuration changes underexplored. Third, many require a given service call graph and cannot diagnose without one. Fourth, the field lacks standardised datasets and evaluation frameworks, so methods are hard to compare fairly. Fifth, although causal inference-based RCA has become dominant, its effectiveness, efficiency, and robustness remain unclear. This thesis addresses these limitations through two groups of contributions. The first introduces methods that exploit observability data both independently and collectively. BARO is an end-to-end anomaly detection and RCA approach for metric data. EventADL is an end-to-end framework for event data. TORAI is a multimodal RCA framework that requires no service call graph. Extensive experiments on real microservice systems demonstrate their effectiveness and robustness. The second group delivers benchmarking datasets, an evaluation framework, and systematic evaluation efforts. RCAEval is a comprehensive benchmark providing ready-to-use datasets and reproducible baselines for future research. A systematic evaluation of existing RCA methods, especially causal inference-based approaches, offers insights that guide future directions. This thesis thereby advances automated anomaly detection and RCA for microservice failures, enabling future research on incident mitigation and remediation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript claims that prior anomaly detection and RCA techniques for microservice systems suffer from five limitations (separate treatment of detection/RCA, underuse of event data, reliance on call graphs, lack of standardized benchmarks, and unclear effectiveness of causal inference methods). It introduces BARO (end-to-end metric-based AD/RCA), EventADL (event-data framework), and TORAI (multimodal RCA without call graphs), plus the RCAEval benchmark supplying ready-to-use datasets and reproducible baselines. The work asserts that extensive experiments on real microservice systems demonstrate the effectiveness and robustness of these contributions, while a systematic evaluation of existing (especially causal) RCA methods yields guiding insights.

Significance. If the experimental evidence is rigorous, the integrated end-to-end methods and the provision of reproducible baselines and datasets could meaningfully advance the field by addressing fragmentation and evaluation gaps. The emphasis on real-system validation and multimodal data without call graphs is potentially valuable, though its impact depends on the quality and transparency of the supporting results.

major comments (1)
  1. [Abstract] Abstract: the central claim that 'extensive experiments on real microservice systems demonstrate their effectiveness and robustness' supplies no method details, metrics, baselines, statistical tests, or failure cases, rendering the soundness of the primary empirical assertions impossible to evaluate from the provided text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and agree that the abstract can be strengthened for greater transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'extensive experiments on real microservice systems demonstrate their effectiveness and robustness' supplies no method details, metrics, baselines, statistical tests, or failure cases, rendering the soundness of the primary empirical assertions impossible to evaluate from the provided text.

    Authors: We agree that the abstract, as currently written, is too high-level and does not convey the concrete evaluation details needed to assess the empirical claims. The full manuscript contains the requested information (specific metrics such as F1-score and latency for anomaly detection, precision@K and root-cause ranking accuracy for RCA, comparison against multiple baselines including causal methods, and statistical tests across repeated runs on real systems). We will revise the abstract to concisely incorporate key metrics, mention of the main baselines, and a high-level note on robustness results while remaining within length limits. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical thesis presenting three new frameworks (BARO for metrics, EventADL for events, TORAI for multimodal RCA without call graphs) plus the RCAEval benchmark and systematic evaluations. No equations, fitted parameters, or derivation chains appear in the supplied abstract or description. The five limitations are listed as motivation for the work rather than as premises that the paper proves internally. No self-citations are invoked as load-bearing uniqueness theorems, and no predictions reduce to inputs by construction. The central claims rest on experimental results on real microservice systems, which are externally falsifiable and independent of any internal redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no visible free parameters, mathematical axioms, or invented entities; the contributions are algorithmic frameworks and empirical benchmarks whose internal modeling choices are not described.

pith-pipeline@v0.9.1-grok · 5813 in / 1143 out tokens · 23164 ms · 2026-06-27T16:00:50.887978+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

189 extracted references · 9 canonical work pages · 1 internal anchor

  1. [1]

    The pains and gains of microservices: A Systematic grey literature review , journal =

    Jacopo Soldani and Damian Andrew Tamburri and Willem-Jan. The pains and gains of microservices: A Systematic grey literature review , journal =. 2018 , issn =

  2. [2]

    Gregory , title =

    Mark A. Gregory , title =. Law Society Journal , year =

  3. [3]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Root Cause Analysis of Outliers with Missing Structural Knowledge , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  4. [4]

    arXiv preprint arXiv:2508.20370 , year=

    Adaptive root cause localization for microservice systems with multi-agent recursion-of-thought , author=. arXiv preprint arXiv:2508.20370 , year=

  5. [5]

    International conference on machine learning , pages=

    Causal structure-based root cause analysis of outliers , author=. International conference on machine learning , pages=. 2022 , organization=

  6. [6]

    2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=

    Eadro: An end-to-end troubleshooting framework for microservices on multi-source data , author=. 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , pages=. 2023 , organization=

  7. [7]

    Open Cybersecurity Schema Framework (OCSF) , year =

  8. [8]

    IEEE transactions on dependable and secure computing , volume=

    Basic concepts and taxonomy of dependable and secure computing , author=. IEEE transactions on dependable and secure computing , volume=. 2004 , publisher=

  9. [9]

    International Conference on Service-Oriented Computing , pages=

    Performance diagnosis in cloud microservices using deep learning , author=. International Conference on Service-Oriented Computing , pages=. 2020 , organization=

  10. [10]

    2020 IEEE international conference on web services (ICWS) , pages=

    Root-cause metric location for microservice systems via log anomaly detection , author=. 2020 IEEE international conference on web services (ICWS) , pages=. 2020 , organization=

  11. [11]

    2020 IEEE/IFIP Network Operations and Management Symposium , pages=

    Microrca: Root cause localization of performance issues in microservices , author=. 2020 IEEE/IFIP Network Operations and Management Symposium , pages=

  12. [12]

    2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence) , pages=

    Microdiag: Fine-grained performance diagnosis for microservice systems , author=. 2021 IEEE/ACM International Workshop on Cloud Intelligence (CloudIntelligence) , pages=. 2021 , organization=

  13. [13]

    Proceedings of the Web Conference (WWW'21) , pages=

    Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments , author=. Proceedings of the Web Conference (WWW'21) , pages=

  14. [14]

    2023 , booktitle =

    Chakraborty, Sarthak and Garg, Shaddy and Agarwal, Shubham and Chauhan, Ayush and Saini, Shiv Kumar , title =. 2023 , booktitle =

  15. [15]

    Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE'22) , pages=

    Graph based Incident Extraction and Diagnosis in Large-Scale Online Systems , author=. Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (ASE'22) , pages=

  16. [16]

    2022 , volume =

    Soldani, Jacopo and Brogi, Antonio , title =. 2022 , volume =

  17. [17]

    Root Cause Analysis of Failures in Microservices through Causal Discovery , volume =

    Ikram, Azam and Chakraborty, Sarthak and Mitra, Subrata and Saini, Shiv and Bagchi, Saurabh and Kocaoglu, Murat , booktitle =. Root Cause Analysis of Failures in Microservices through Causal Discovery , volume =

  18. [18]

    CloudRanger: Root Cause Identification for Cloud Native Systems , year=

    Wang, Ping and Xu, Jingmin and Ma, Meng and Lin, Weilan and Pan, Disheng and Wang, Yuan and Chen, Pengfei , booktitle=. CloudRanger: Root Cause Identification for Cloud Native Systems , year=

  19. [19]

    Localizing Failure Root Causes in a Microservice through Causality Inference , year=

    Meng, Yuan and Zhang, Shenglin and Sun, Yongqian and Zhang, Ruru and Hu, Zhilong and Zhang, Yiyin and Jia, Chenyang and Wang, Zhaogang and Pei, Dan , booktitle=. Localizing Failure Root Causes in a Microservice through Causality Inference , year=

  20. [20]

    Science Advances , volume =

    Jakob Runge and Peer Nowack and Marlene Kretschmer and Seth Flaxman and Dino Sejdinovic , title =. Science Advances , volume =

  21. [21]

    2000 , publisher=

    Causation, prediction, and search , author=. 2000 , publisher=

  22. [22]

    Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments

    Lin, Jinjin and Chen, Pengfei and Zheng, Zibin. Microscope: Pinpoint Performance Issues with Causal Graphs in Micro-service Environments. Service-Oriented Computing. 2018

  23. [23]

    CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , year=

    Chen, Pengfei and Qi, Yong and Zheng, Pengfei and Hou, Di , booktitle=. CauseInfer: Automatic and distributed performance diagnosis with hierarchical causality graph in large distributed systems , year=

  24. [24]

    2021 IEEE 14th International Conference on Cloud Computing (CLOUD) , pages=

    Causal modeling based fault localization in cloud systems using golden signals , author=. 2021 IEEE 14th International Conference on Cloud Computing (CLOUD) , pages=. 2021 , organization=

  25. [25]

    International Conference on Service-Oriented Computing , pages=

    Localization of operational faults in cloud applications by mining causal dependencies in logs using golden signals , author=. International Conference on Service-Oriented Computing , pages=. 2020 , organization=

  26. [26]

    IEEE transactions on services computing , volume=

    CauseInfer: Automated end-to-end performance diagnosis with hierarchical causality graph in cloud environment , author=. IEEE transactions on services computing , volume=. 2016 , publisher=

  27. [27]

    2020 , booktitle =

    Ma, Meng and Xu, Jingmin and Wang, Yuan and Chen, Pengfei and Zhang, Zonghua and Wang, Ping , title =. 2020 , booktitle =

  28. [28]

    2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages=

    Microhecl: High-efficient root cause localization in large-scale microservice systems , author=. 2021 IEEE/ACM 43rd International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP) , pages=. 2021 , organization=

  29. [29]

    Proceedings of the 44th International Conference on Software Engineering , pages=

    Adaptive performance anomaly detection for online service systems via pattern sketching , author=. Proceedings of the 44th International Conference on Software Engineering , pages=

  30. [30]

    Proceedings of the ACM Web Conference 2023 , pages=

    CMDiagnostor: An Ambiguity-Aware Root Cause Localization Approach Based on Call Metric Data , author=. Proceedings of the ACM Web Conference 2023 , pages=

  31. [31]

    MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications , year=

    Ma, Meng and Lin, Weilan and Pan, Disheng and Wang, Ping , booktitle=. MS-Rank: Multi-Metric and Self-Adaptive Root Cause Diagnosis for Microservice Applications , year=

  32. [32]

    ACM sigmod record , volume=

    BIRCH: an efficient data clustering method for very large databases , author=. ACM sigmod record , volume=. 1996 , publisher=

  33. [33]

    Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

    Anomaly detection in streams with extreme value theory , author=. Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining , pages=

  34. [34]

    1998 , author =

    The anatomy of a large-scale hypertextual Web search engine , journal =. 1998 , author =

  35. [35]

    2022 , booktitle =

    Li, Mingjie and Li, Zeyan and Yin, Kanglin and Nie, Xiaohui and Zhang, Wenchi and Sui, Kaixin and Pei, Dan , title =. 2022 , booktitle =

  36. [36]

    Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning , volume =

    Jaber, Amin and Kocaoglu, Murat and Shanmugam, Karthikeyan and Bareinboim, Elias , booktitle =. Causal Discovery from Soft Interventions with Unknown Targets: Characterization and Learning , volume =

  37. [37]

    2022 , publisher=

    Automatic performance diagnosis and recovery in cloud microservices , author=. 2022 , publisher=

  38. [38]

    2023 , author =

    CausalRCA: Causal inference based precise fine-grained root cause localization for microservice applications , journal =. 2023 , author =

  39. [39]

    2019 , volume =

    Yu, Yue and Chen, Jie and Gao, Tian and Yu, Mo , booktitle =. 2019 , volume =

  40. [40]

    International Conference on Probabilistic Graphical Models , pages=

    Tuning causal discovery algorithms , author=. International Conference on Probabilistic Graphical Models , pages=. 2020 , organization=

  41. [41]

    IEEE Transactions on Software Engineering , volume=

    Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study , author=. IEEE Transactions on Software Engineering , volume=. 2018 , publisher=

  42. [42]

    2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=

    Localizing faults in cloud systems , author=. 2018 IEEE 11th International Conference on Software Testing, Verification and Validation (ICST) , pages=. 2018 , organization=

  43. [43]

    2012 , publisher=

    Experimentation in software engineering , author=. 2012 , publisher=

  44. [44]

    Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis , pages=

    Faster, deeper, easier: crowdsourcing diagnosis of microservice kernel failure from user space , author=. Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis , pages=

  45. [45]

    Practical Root Cause Localization for Microservice Systems via Trace Analysis , year=

    Li, Zeyan and Chen, Junjie and Jiao, Rui and Zhao, Nengwen and Wang, Zhijun and Zhang, Shuwei and Wu, Yanjun and Jiang, Long and Yan, Leiqin and Wang, Zikai and Chen, Zhekang and Zhang, Wenchi and Nie, Xiaohui and Sui, Kaixin and Pei, Dan , booktitle=. Practical Root Cause Localization for Microservice Systems via Trace Analysis , year=

  46. [46]

    1980 , author =

    Testing for causality: A personal viewpoint , journal =. 1980 , author =

  47. [47]

    1995 , booktitle =

    Spirtes, Peter and Meek, Christopher and Richardson, Thomas , title =. 1995 , booktitle =

  48. [48]

    Hoyer and Aapo Hyvarinen and Antti Kerminen , title =

    Shohei Shimizu and Patrik O. Hoyer and Aapo Hyvarinen and Antti Kerminen , title =. Journal of Machine Learning Research , year =

  49. [49]

    2002 , publisher =

    Chickering, David Maxwell , title =. 2002 , publisher =

  50. [50]

    The Annals of Statistics , number =

    Gideon Schwarz , title =. The Annals of Statistics , number =

  51. [51]

    Ramsey and Madelyn Glymour and Ruben Sanchez

    Joseph D. Ramsey and Madelyn Glymour and Ruben Sanchez. A million variables and more: the Fast Greedy Equivalence Search algorithm for learning high-dimensional graphical causal models, with an application to functional magnetic resonance images , journal =

  52. [52]

    Spitzer, Frank Ludvig , title =

  53. [53]

    2021 , booktitle =

    Arya, Vijay and Shanmugam, Karthikeyan and Aggarwal, Pooja and Wang, Qing and Mohapatra, Prateeti and Nagar, Seema , title =. 2021 , booktitle =

  54. [54]

    The Annals of Statistics , pages=

    Graph-theoretic measures of multivariate association and prediction , author=. The Annals of Statistics , pages=. 1983 , publisher=

  55. [55]

    The Annals of Statistics , pages=

    Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests , author=. The Annals of Statistics , pages=. 1979 , publisher=

  56. [56]

    Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

    Correlating events with time series for incident diagnosis , author=. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining , pages=

  57. [57]

    and Arya, Vijay and Shanmugam, Karthikeyan , booktitle=

    Wang, Qing and Shwartz, Larisa and Grabarnik, Genady Ya. and Arya, Vijay and Shanmugam, Karthikeyan , booktitle=. Detecting Causal Structure on Cloud Application Microservices Using Granger Causality Models , year=

  58. [58]

    2019 7th International Conference on Future Internet of Things and Cloud (FiCloud) , pages=

    Dla: Detecting and localizing anomalies in containerized microservice architectures using markov models , author=. 2019 7th International Conference on Future Internet of Things and Cloud (FiCloud) , pages=. 2019 , organization=

  59. [59]

    Journal of Systems and Software , volume=

    Graph-based root cause analysis for service-oriented and microservice architectures , author=. Journal of Systems and Software , volume=. 2020 , publisher=

  60. [60]

    Advances in neural information processing systems , volume=

    Dags with no tears: Continuous optimization for structure learning , author=. Advances in neural information processing systems , volume=

  61. [61]

    arXiv preprint arXiv:2301.10859 , year=

    Salesforce CausalAI Library: A Fast and Scalable Framework for Causal Analysis of Time Series and Tabular Data , author=. arXiv preprint arXiv:2301.10859 , year=

  62. [62]

    arXiv preprint arXiv:2003.06222 , year=

    An evaluation of change point detection algorithms , author=. arXiv preprint arXiv:2003.06222 , year=

  63. [63]

    Causal Inference Techniques for Microservice Performance Diagnosis: Evaluation and Guiding Recommendations , year=

    Wu, Li and Tordsson, Johan and Elmroth, Erik and Kao, Odej , booktitle=. Causal Inference Techniques for Microservice Performance Diagnosis: Evaluation and Guiding Recommendations , year=

  64. [64]

    Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware'17) , pages=

    Sieve: Actionable insights from monitored metrics in distributed systems , author=. Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference (Middleware'17) , pages=

  65. [65]

    arXiv preprint arXiv:2306.11417 , year=

    PyRCA: A Library for Metric-based Root Cause Analysis , author=. arXiv preprint arXiv:2306.11417 , year=

  66. [66]

    Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems , booktitle =

    Li, Zeyan and Zhao, Nengwen and Li, Mingjie and Lu, Xianglin and Wang, Lixin and Chang, Dongdong and Nie, Xiaohui and Cao, Li and Zhang, Wenchi and Sui, Kaixin and Wang, Yanhua and Du, Xu and Duan, Guoqing and Pei, Dan , year =. Actionable and Interpretable Fault Localization for Recurring Failures in Online Service Systems , booktitle =

  67. [67]

    2023 , author =

    Applications of statistical causal inference in software engineering , journal =. 2023 , author =

  68. [68]

    , title =

    Chen, Zhuangbin and Kang, Yu and Li, Liqun and Zhang, Xu and Zhang, Hongyu and Xu, Hui and Zhou, Yangfan and Yang, Li and Sun, Jeffrey and Xu, Zhangwei and Dang, Yingnong and Gao, Feng and Zhao, Pu and Qiao, Bo and Lin, Qingwei and Zhang, Dongmei and Lyu, Michael R. , title =. 2020 , booktitle =

  69. [69]

    An Empirical Investigation of Incident Triage for Online Service Systems , year=

    Chen, Junjie and He, Xiaoting and Lin, Qingwei and Xu, Yong and Zhang, Hongyu and Hao, Dan and Gao, Feng and Xu, Zhangwei and Dang, Yingnong and Zhang, Dongmei , booktitle=. An Empirical Investigation of Incident Triage for Online Service Systems , year=

  70. [70]

    and Camgoz, Necati Cihan and Bowden, Richard , title =

    Vowels, Matthew J. and Camgoz, Necati Cihan and Bowden, Richard , title =. ACM Computing Surveys , articleno =. 2022 , address =

  71. [71]

    Proceedings of 2018 ACM SIGKDD Workshop on Causal Discovery , pages=

    Evaluation of causal structure learning methods on mixed data types , author=. Proceedings of 2018 ACM SIGKDD Workshop on Causal Discovery , pages=. 2018 , organization=

  72. [72]

    and Devijver, Emilie and Gaussier, Eric , title =

    Assaad, Charles K. and Devijver, Emilie and Gaussier, Eric , title =. 2022 , issue_date =

  73. [73]

    The World Wide Web Conference , pages =

    Chen, Yujun and Yang, Xian and Lin, Qingwei and Zhang, Hongyu and Gao, Feng and Xu, Zhangwei and Dang, Yingnong and Zhang, Dongmei and Dong, Hang and Xu, Yong and Li, Hao and Kang, Yu , title =. The World Wide Web Conference , pages =. 2019 , publisher =

  74. [74]

    2019 , booktitle =

    Shan, Huasong and Chen, Yuan and Liu, Haifeng and Zhang, Yunpeng and Xiao, Xiao and He, Xiaofeng and Li, Min and Ding, Wei , title =. 2019 , booktitle =

  75. [75]

    Knowledge and Information Systems , pages =

    Moraffah, Raha and Sheth, Paras and Karami, Mansooreh and Bhattacharya, Anchit and Wang, Qianru and Tahir, Anique and Raglin, Adrienne and Liu, Huan , title =. Knowledge and Information Systems , pages =. 2021 , publisher =

  76. [76]

    Frontiers in Genetics , volume=

    Glymour, Clark and Zhang, Kun and Spirtes, Peter , title=. Frontiers in Genetics , volume=

  77. [77]

    2023 , note =

    Skipper Seabold and Josef Perktold , title =. 2023 , note =

  78. [78]

    iwankgb”) and Joe Davis (“dims

    Bobby Page and Ivan Wan-Geh (“iwankgb”) and Joe Davis (“dims”) and others , title =. 2024 , lastaccessed =

  79. [79]

    2023 , note =

    Gaspard Ducamp and Christophe Gonzales and Pierre-Henri Wuillemin , title =. 2023 , note =

  80. [80]

    2023 , note =

    Istio Project , title =. 2023 , note =

Showing first 80 references.