Praxium: Diagnosing Cloud Anomalies with AI-based Telemetry and Dependency Analysis
Pith reviewed 2026-05-21 09:58 UTC · model grok-4.3
The pith
Praxium detects microservice anomalies in cloud systems and infers their root causes from recent software installations by combining telemetry monitoring with dependency analysis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Praxium identifies anomalies by monitoring target metrics in telemetry data and then performs root cause analysis by measuring the causal impact of recent software installations, providing relevant diagnostic information to administrators even when package installations occur at shorter intervals.
What carries the argument
Causal impact analysis on dependency installation information, paired with anomaly detection on telemetry data.
If this is right
- Anomaly detection remains effective across varied synthetic test conditions and hyperparameter choices.
- Root cause inference continues to point to the correct installation even as update intervals shorten.
- The combination of telemetry and dependency data supplies administrators with targeted information for anomaly resolution.
- The framework reduces reliance on manual expert diagnosis in continuous deployment environments.
Where Pith is reading between the lines
- The same telemetry-plus-dependency pattern could apply to diagnosing issues in other distributed systems that use frequent updates.
- It opens the possibility of integrating such analysis directly into existing monitoring stacks to automate more of the troubleshooting loop.
- Testing against real production anomalies rather than synthetics would clarify how well the causal signals hold up outside controlled settings.
Load-bearing premise
The method assumes synthetic anomalies accurately represent real microservice behaviors and that dependency data supplies complete causal signals without unmeasured confounders.
What would settle it
Running Praxium on a live production microservice deployment and checking whether it correctly identifies the root cause when a genuine anomaly occurs after a specific installation.
Figures
read the original abstract
As the modern microservice architecture for cloud applications grows in popularity, cloud services are becoming increasingly complex and more vulnerable to misconfiguration and software bugs. Traditional approaches rely on expert input to diagnose and fix microservice anomalies, which lacks scalability in the face of the continuous integration and continuous deployment (CI/CD) paradigm. Microservice rollouts, containing new software installations, have complex interactions with the components of an application. Consequently, this added difficulty in attributing anomalous behavior to any specific installation or rollout results in potentially slower resolution times. To address the gaps in current diagnostic methods, this paper introduces Praxium, a framework for anomaly detection and root cause inference. Praxium aids administrators in evaluating target metric performance in the context of dependency installation information provided by a software discovery tool, PraxiPaaS. Praxium continuously monitors telemetry data to identify anomalies, then conducts root cause analysis via causal impact on recent software installations, in order to provide site reliability engineers (SRE) relevant information about an observed anomaly. In this paper, we demonstrate that Praxium is capable of effective anomaly detection and root cause inference, and we provide an analysis on effective anomaly detection hyperparameter tuning as needed in a practical setting. Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at >0.97 macro-F1. In addition, we show that causal impact analysis reliably infers the correct root cause of anomalies, even as package installations occur at increasingly shorter intervals.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Praxium, a framework for anomaly detection and root cause inference in microservice cloud applications. It combines continuous monitoring of telemetry data to identify anomalies with causal impact analysis that leverages dependency installation information from the PraxiPaaS software discovery tool. The central empirical claims are that anomaly detection achieves macro-F1 scores above 0.97 across 75 trials on four synthetic anomalies and that causal impact analysis reliably identifies the correct root cause even as package installation intervals shorten.
Significance. If the synthetic results generalize, the integration of telemetry-based detection with installation-tied causal inference could offer a scalable, automated alternative to expert-driven diagnosis in CI/CD environments, potentially shortening resolution times for SRE teams. The approach addresses a practical gap in attributing anomalies amid frequent rollouts, but its significance is currently constrained by the absence of real production validation.
major comments (3)
- [§5] §5 (Evaluation): The reported >0.97 macro-F1 is obtained exclusively on four synthetic anomalies across 75 trials; the manuscript must show that these anomalies reproduce the telemetry distributions, correlation structures, and failure cascades observed in real microservice production data, or the generalization claim for root-cause inference cannot be supported.
- [§4.2] §4.2 (Causal Impact Analysis): The root-cause claims rest on the assumption that PraxiPaaS dependency data supplies all relevant causal signals without unmeasured confounders (network jitter, resource contention, external load); no sensitivity analysis or confounder discussion is provided, which is load-bearing for the reliability statement under shortening installation intervals.
- [Abstract and §3] Abstract and §3 (Method): No description of the anomaly detection algorithm, choice of AI model, or baseline methods is supplied, preventing assessment of whether the macro-F1 result constitutes an advance or is reproducible.
minor comments (2)
- [Figures] Figure captions and axis labels in the evaluation plots should explicitly state the synthetic anomaly types and trial counts to improve clarity.
- [§5] The hyperparameter tuning analysis mentioned in the abstract would benefit from a dedicated table summarizing the selected values and their effect on F1.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and indicate the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation): The reported >0.97 macro-F1 is obtained exclusively on four synthetic anomalies across 75 trials; the manuscript must show that these anomalies reproduce the telemetry distributions, correlation structures, and failure cascades observed in real microservice production data, or the generalization claim for root-cause inference cannot be supported.
Authors: We agree that stronger evidence of fidelity between synthetic and real telemetry would better support generalization. Our evaluation is deliberately scoped to synthetic data with known ground truth to enable repeatable, controlled experiments. In the revision we will expand §5 with an explicit description of the synthetic anomaly generation procedure, detailing how the four anomaly types were constructed to match reported distributions, correlations, and cascade patterns from public microservice traces and the literature. We will also add a dedicated limitations subsection that states the current results do not claim direct generalization to production environments and identifies real-world validation as necessary future work. revision: partial
-
Referee: [§4.2] §4.2 (Causal Impact Analysis): The root-cause claims rest on the assumption that PraxiPaaS dependency data supplies all relevant causal signals without unmeasured confounders (network jitter, resource contention, external load); no sensitivity analysis or confounder discussion is provided, which is load-bearing for the reliability statement under shortening installation intervals.
Authors: The concern about unmeasured confounders is well taken. We will revise §4.2 to include an explicit discussion of modeling assumptions and potential confounders. In addition, we will add a sensitivity analysis that injects controlled levels of network jitter, resource contention, and external load into the simulation while varying installation intervals, reporting how these factors affect causal-impact accuracy. This will directly test the robustness of the reliability claims. revision: yes
-
Referee: [Abstract and §3] Abstract and §3 (Method): No description of the anomaly detection algorithm, choice of AI model, or baseline methods is supplied, preventing assessment of whether the macro-F1 result constitutes an advance or is reproducible.
Authors: We apologize for the omission of these details. In the revised manuscript we will expand both the Abstract and §3 to describe the anomaly detection algorithm in full, specify the AI model and its hyperparameters, explain the rationale for the chosen approach, and present comparisons against standard baselines (statistical thresholding and alternative ML detectors). These additions will allow readers to evaluate the advance and reproduce the reported macro-F1 scores. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper introduces Praxium as an empirical framework for anomaly detection via telemetry monitoring and root cause inference via causal impact analysis on installation events from PraxiPaaS. All reported results (>0.97 macro-F1 across 75 synthetic-anomaly trials and reliable root-cause inference) are presented as experimental outcomes from direct evaluation rather than as outputs of any mathematical derivation, equation, or first-principles reduction. No self-definitional loops, fitted parameters renamed as predictions, or load-bearing self-citations that collapse the central claims appear in the provided text. The approach therefore remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
anomaly detection ... variational autoencoder ... CausalImpact ... counterfactual metric data
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Across 75 total trials using four synthetic anomalies, anomaly detection consistently performs at >0.97 macro-F1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: practical and scalable ml-driven performance debugging in microservices,” inProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’21. New York, NY , USA: Association for Computing Machinery, 2021, p. 135–151. [O...
-
[2]
Unleashing Performance Insights with Online Probabilistic Tracing ,
M. Toslali, S. Qasim, S. Parthasarathy, F. A. Oliveira, H. Huang, G. Stringhini, Z. Liu, and A. K. Coskun, “ Unleashing Performance Insights with Online Probabilistic Tracing ,” in2024 IEEE International Conference on Cloud Engineering (IC2E). Los Alamitos, CA, USA: IEEE Computer Society, Sep. 2024, pp. 72–82. [Online]. Available: https://doi.ieeecomputer...
-
[3]
Jaeger: A distributed tracing system,
The Jaeger Authors, “Jaeger: A distributed tracing system,” https://www.jaegertracing.io, 2025, accessed: 20 February 2026
work page 2025
-
[4]
Tritium: A cross-layer analytics system for enhancing microservice rollouts in the cloud,
S. Allen, M. Toslali, S. Parthasarathy, F. Oliveira, and A. K. Coskun, “Tritium: A cross-layer analytics system for enhancing microservice rollouts in the cloud,” inProceedings of the Seventh International Workshop on Container Technologies and Container Clouds, 2021, pp. 19–24
work page 2021
-
[5]
Prometheus: Monitoring system and time series database,
Prometheus Authors, “Prometheus: Monitoring system and time series database,” https://prometheus.io, 2025, computer software. Accessed: Feb 20, 2026
work page 2025
-
[6]
Root cause analysis of failures in microservices through causal discovery,
A. Ikram, S. Chakraborty, S. Mitra, S. Saini, S. Bagchi, and M. Ko- caoglu, “Root cause analysis of failures in microservices through causal discovery,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 158–31 170, 2022
work page 2022
-
[7]
Dapper, a large-scale distributed systems tracing infrastructure,
B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” 2010
work page 2010
-
[8]
Zipkin: A distributed tracing system,
OpenZipkin, “Zipkin: A distributed tracing system,” https://zipkin.io, 2025, accessed: 20 February 2026
work page 2025
-
[9]
Graph-based trace analysis for microservice architecture understanding and problem diagnosis,
X. Guo, X. Peng, H. Wang, W. Li, H. Jiang, D. Ding, T. Xie, and L. Su, “Graph-based trace analysis for microservice architecture understanding and problem diagnosis,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the F oundations of Software Engineering, ser. ESEC/FSE 2020. New York, NY , USA: Ass...
-
[10]
Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,
P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xueet al., “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2020, pp. 48–58
work page 2020
-
[11]
Y . Cai, B. Han, J. Su, and X. Wang, “Tracemodel: An automatic anomaly detection and root cause localization framework for microservice sys- tems,” in2021 17th International Conference on Mobility, Sensing and Networking (MSN). IEEE, 2021, pp. 512–519
work page 2021
-
[12]
Prodigy: Towards unsupervised anomaly detection in production hpc systems,
B. Aksar, E. Sencan, B. Schwaller, O. Aaziz, V . J. Leung, J. Brandt, B. Kulis, M. Egele, and A. K. Coskun, “Prodigy: Towards unsupervised anomaly detection in production hpc systems,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2023, pp. 1–14
work page 2023
-
[13]
Osv-scanner: A tool for scanning open source dependencies for vulnerabilities,
Google, “Osv-scanner: A tool for scanning open source dependencies for vulnerabilities,” https://github.com/google/osv-scanner, 2025, accessed: 20 February 2026
work page 2025
-
[14]
S. Ltd., “Snyk: Open source security,” https://snyk.io, 2025, accessed: 20 February 2026
work page 2025
-
[15]
OW ASP, “Owasp dependency-check,” https://owasp.org/www-project- dependency-check/, 2025, accessed: 20 February 2026
work page 2025
-
[16]
Praxipaas: A decomposable machine learning system for efficient container package discovery,
Z. Zhang, R. Kumar, J. Li, L. Korver, A. Byrne, G. Stringhini, I. Matta, and A. Coskun, “Praxipaas: A decomposable machine learning system for efficient container package discovery,” in2024 IEEE International Conference on Cloud Engineering (IC2E), 2024, pp. 178–188
work page 2024
-
[17]
Deltasherlock: Identifying changes in the cloud,
A. Turk, H. Chen, A. Byrne, J. Knollmeyer, S. S. Duri, C. Isci, and A. K. Coskun, “Deltasherlock: Identifying changes in the cloud,” in 2016 IEEE International Conference on Big Data (Big Data), 2016, pp. 763–772
work page 2016
-
[18]
Praxi: Cloud software discovery that learns from practice,
A. Byrne, E. Ates, A. Turk, V . Pchelin, S. Duri, S. Nadgowda, C. Isci, and A. K. Coskun, “Praxi: Cloud software discovery that learns from practice,”IEEE Transactions on Cloud Computing, vol. 10, no. 2, pp. 872–884, 2022
work page 2022
-
[19]
Inferring causal impact using bayesian structural time-series models,
K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, and S. L. Scott, “Inferring causal impact using bayesian structural time-series models,” The Annals of Applied Statistics, vol. 9, no. 1, pp. 247–274, 2015
work page 2015
-
[20]
Y . Gan, Y . Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y . He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou, “An open-source bench- mark suite for microservices and their hardware-software implications for clou...
work page 2019
-
[21]
Pyyaml: Yaml parser and emitter for python,
K. Simonovet al., “Pyyaml: Yaml parser and emitter for python,” https://pyyaml.org/, 2023, accessed: March 05, 2025
work page 2023
-
[22]
wrk2: A constant throughput load generator,
R. Giltene, “wrk2: A constant throughput load generator,” https://github.com/giltene/wrk2, 2015, accessed: Feb 20, 2026
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.