Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis
Pith reviewed 2026-07-02 08:34 UTC · model grok-4.3
The pith
A stochastic connectivity model computes microservice endpoint availability from traces and deployment data under explicit fault scenarios.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The model treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates, with semantics that separate computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies.
What carries the argument
Typed service-dependency graph with replication map, probability measure over node and edge states, and request-specific success predicates that separate replica failures from dependency failures.
If this is right
- Replication cannot compensate for bottleneck dependencies.
- The model reconstructs from traces and deployment artifacts.
- Monte Carlo simulation supports what-if analysis of architectural changes.
- The model can run before or alongside fault-injection experiments.
Where Pith is reading between the lines
- Architects could parameterize the probability measure to explore different failure correlation patterns.
- Time-dependent failures would require extending the model with explicit timing information from traces.
- Missing traces for some dependencies would force conservative bounds on the computed availability.
Load-bearing premise
Traces and deployment artifacts contain enough detail to reconstruct the typed service-dependency graph, replication map, and probability measure so that Monte Carlo results generalize beyond the tested cases.
What would settle it
A synthetic test case in which Monte Carlo estimates of endpoint success probability differ from the closed-form oracle by more than sampling error.
Figures
read the original abstract
Microservice availability is commonly assessed by fault injection and chaos experiments, but such experiments are costly, operationally risky, and difficult to repeat for every architectural change. Distributed tracing and deployment metadata provide cheaper evidence, yet they usually remain descriptive: they show which services interacted, not what endpoint-level availability property follows. This paper proposes a formal runtime availability model based on stochastic connectivity for resilience-oriented analysis of microservice endpoints. It treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates. Its semantics separates computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies. The model can be reconstructed from traces and deployment artifacts, parameterized for architectural what-if analysis, and analyzed by Monte Carlo simulation before or alongside fault injection. We define the model, its trace-to-model construction, elementary semantic properties, and a synthetic adequacy study. The study matches closed-form oracle cases within sampling error and exposes boundaries caused by edge bottlenecks, correlated failures, missing traces, and time-dependent failures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a formal runtime availability model for microservice endpoints grounded in stochastic connectivity. It combines a typed service-dependency graph, a replication map, a probability measure over node/edge states, and request-specific success predicates. The semantics separate computational failures of replicas from communication failures of dependencies. The model is claimed to be reconstructible from traces and deployment artifacts, support what-if parameterization, and be analyzable via Monte Carlo simulation. A synthetic adequacy study is reported to match closed-form oracles within sampling error while exposing boundaries from bottlenecks, correlated failures, missing traces, and time-dependent effects.
Significance. If the trace-to-model reconstruction generalizes beyond synthetics, the approach could supply a repeatable, lower-risk complement to fault injection for resilience analysis, enabling architectural what-if studies before deployment changes. The separation of failure modes and the demonstration that replication cannot compensate for bottleneck dependencies are potentially useful formal properties for the microservices community.
major comments (3)
- [synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.
- [trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.
- [model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.
minor comments (2)
- Notation for the probability measure and success predicates should be introduced with explicit definitions before use in the Monte Carlo procedure.
- The abstract mentions 'elementary semantic properties' but the manuscript should include a dedicated subsection or theorem list for them to aid readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor. The manuscript's core contribution is the formal model and its synthetic validation; we will adjust claims to reflect the scope of the current evidence.
read point-by-point responses
-
Referee: [synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.
Authors: We agree that these methodological details are required to evaluate robustness. The revised manuscript will specify the Monte Carlo trial count (10,000 runs), report error bars as standard error of the mean, state that no data points were excluded, and describe the probability measure estimation as empirical frequencies computed directly from the generated synthetic traces. These additions will be placed in the adequacy study section and referenced from the abstract. revision: yes
-
Referee: [trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.
Authors: The paper formally defines the trace-to-model construction and validates it under synthetic conditions where ground truth is controlled. We do not provide empirical evidence from production traces, and the current work does not claim such generalization. We will revise the abstract, introduction, and conclusion to explicitly limit the reconstruction claim to the defined procedure and its synthetic adequacy, while adding a limitations paragraph discussing risks from omitted correlated failures or under-sampled bottlenecks in real traces. revision: partial
-
Referee: [model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.
Authors: The property follows deductively from the model semantics once a probability measure is given. We acknowledge the absence of sensitivity analysis to reconstruction noise and of real-trace validation. The revision will include a new sensitivity subsection that perturbs the input measure (e.g., by adding noise to edge probabilities) and re-runs the Monte Carlo analysis to confirm the bottleneck property persists within the tested range. Real-trace validation of the property lies outside the present scope. revision: partial
- Empirical validation of the trace-to-model reconstruction and the bottleneck property on real production traces (as opposed to synthetics)
Circularity Check
No circularity: model defined from external traces and validated against independent oracles
full rationale
The paper constructs a stochastic connectivity model from a typed service-dependency graph, replication map, probability measure over states, and request-specific success predicates, with semantics separating replica computation failures from dependency communication failures. Reconstruction is specified from traces and deployment artifacts, and adequacy is checked by Monte Carlo simulation against closed-form synthetic oracles. No equations, fitted parameters, or self-citations are shown that would make any claimed availability measure or semantic property reduce by construction to quantities defined by the same inputs. The derivation chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Krasnovsky
Anatoly A. Krasnovsky. 2026. Stochastic Connectivity Synthetic Experiments Artifact. https://github.com/a-a-k/stochastic-connectivity-synthetic-artifact. Accessed: 2026-06-06
2026
-
[2]
Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing1, 1 (2004), 11–33. doi:10.1109/ TDSC.2004.2
2004
-
[3]
Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software 33, 3 (2016), 35–41. doi:10.1109/MS.2016.60
-
[4]
Steffen Becker, Heiko Koziolek, and Ralf Reussner. 2009. The Palladio Component Model for Model-Driven Performance Prediction.Journal of Systems and Software 82, 1 (2009), 3–22. doi:10.1016/j.jss.2008.03.066
-
[5]
Nelly Bencomo, Sebastian Götz, and Hui Song. 2019. Models@run.time: A Guided Tour of the State of the Art and Research Challenges.Software and Systems Modeling18, 5 (2019), 3049–3082. doi:10.1007/s10270-018-00712-x
-
[6]
Gordon Blair, Nelly Bencomo, and Robert B. France. 2009. Models@run.time. Computer42, 10 (2009), 22–27. doi:10.1109/MC.2009.326
-
[7]
Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. 2012. Architecture-Based Reliability Prediction with the Palladio Component Model. IEEE Transactions on Software Engineering38, 6 (2012), 1319–1339. doi:10.1109/ TSE.2011.94
2012
-
[8]
Betty H. C. Cheng, Rogério de Lemos, Holger Giese, Paola Inverardi, Jeff Magee, Jesper Andersson, Basil Becker, Nelly Bencomo, Yuriy Brun, Bojan Cukic, Gio- vanna Di Marzo Serugendo, Schahram Dustdar, Anthony Finkelstein, Cristina Gacek, Kurt Geihs, Vincenzo Grassi, Gabor Karsai, Holger M. Kienle, Jeff Kramer, Marin Litoiu, Sam Malek, Raffaela Mirandola, ...
-
[9]
Colbourn
Charles J. Colbourn. 1987.The Combinatorics of Network Reliability. Number 4 in International Series of Monographs on Computer Science. Oxford University Press, New York, NY, USA
1987
-
[10]
Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yester- day, Today, and Tomorrow. InPresent and Ulterior Software Engineering. Springer, Cham, Switzerland, 195–216. doi:10.1007/978-3-319-67425-4_12
-
[11]
Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...
-
[12]
David Garlan, Shang-Wen Cheng, An-Cheng Huang, Bradley Schmerl, and Peter Steenkiste. 2004. Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure.Computer37, 10 (2004), 46–54. doi:10.1109/MC.2004.175
-
[13]
Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke
Holger Giese, Nelly Bencomo, Liliana Pasquale, Andres J. Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke. 2014. Living with Uncertainty in the Age of Runtime Models. InModels@run.time: Foundations, Applications, and Roadmaps. Lecture Notes in Computer Science, Vol. 8378. Springer, Cham, Switzerland, 47–
2014
-
[14]
doi:10.1007/978-3-319-08915-7_3
-
[15]
Swapna S. Gokhale. 2007. Architecture-Based Software Reliability Analysis: Overview and Limitations.IEEE Transactions on Dependable and Secure Computing 4, 1 (2007), 32–40. doi:10.1109/TDSC.2007.4
-
[16]
Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In Proceedings of the 36th IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, Piscataway, NJ, USA, 57–66. doi:10.1109/ICDCS.2016.11
-
[17]
Anne Immonen and Eila Niemelä. 2008. Survey of Reliability and Availability Prediction Methods from the Viewpoint of Software Architecture.Software and Systems Modeling7, 1 (2008), 49–65. doi:10.1007/s10270-006-0040-x
-
[18]
David R. Karger. 1995. A Randomized Fully Polynomial Time Approximation Scheme for the All Terminal Network Reliability Problem. InProceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing (STOC ’95). ACM, New York, NY, USA, 11–17. doi:10.1145/225058.225069
-
[19]
Anatoly A. Krasnovsky. 2025. Evaluating Asynchronous Semantics in Trace- Discovered Resilience Models: A Case Study on the OpenTelemetry Demo. arXiv:2512.12314 [cs.SE] doi:10.48550/arXiv.2512.12314
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.12314 2025
-
[20]
Anatoly A. Krasnovsky. 2026. Model Discovery and Graph Simulation: A Light- weight Gateway to Chaos Engineering. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE ’26). ACM, New York, NY, USA, 5. doi:10.1145/3786582.3786823
-
[21]
Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2022. Enjoy Your Observability: An Industrial Survey of Microservice Tracing and Analysis.Empirical Software Engineering27, 1, Article 25 (2022), 28 pages. doi:10.1007/s10664-021-10063-9
-
[22]
OpenTelemetry Authors. 2025. OpenTelemetry Demo Docs. https:// opentelemetry.io/docs/demo/. Accessed: 2026-06-19
2025
-
[23]
OpenTelemetry Authors. 2026. OpenTelemetry Semantic Conventions 1.41.1. https://github.com/open-telemetry/semantic-conventions/tree/v1.41.1/docs. Ver- sion 1.41.1; accessed: 2026-06-19
2026
-
[24]
Patterson
David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why Do Internet Services Fail, and What Can Be Done About It?. In4th USENIX Symposium on Internet Technologies and Systems (USITS 03). USENIX Association, Seattle, WA, USA, 14 pages. https://www.usenix.org/conference/usits-03/why- do-internet-services-fail-and-what-can-be-done-about-it
2003
-
[25]
Claus Pahl and Pooyan Jamshidi. 2016. Microservices: A Systematic Mapping Study. InProceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016). SCITEPRESS, Rome, Italy, 137–146. doi:10.5220/ 0005785501370146
2016
-
[26]
Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag
Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper- 2010-1. Google, Inc. https://research.google.com/archive/papers/dapper-2010- 1.pdf
2010
-
[27]
Jacopo Soldani, Damian Andrew Tamburri, and Willem-Jan van den Heuvel. 2018. The Pains and Gains of Microservices: A Systematic Grey Literature Review. Journal of Systems and Software146 (2018), 215–232. doi:10.1016/j.jss.2018.09.082
-
[28]
Leslie G. Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput.8, 3 (1979), 410–421. doi:10.1137/0208032
-
[29]
Olaf Zimmermann. 2017. Microservices Tenets: Agile Approach to Service De- velopment and Deployment.Computer Science – Research and Development32, 3–4 (2017), 301–310. doi:10.1007/s00450-016-0337-0
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.