pith. sign in

arxiv: 2607.00740 · v1 · pith:WG65S63Hnew · submitted 2026-07-01 · 💻 cs.SE · cs.DC· cs.PF

Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis

Pith reviewed 2026-07-02 08:34 UTC · model grok-4.3

classification 💻 cs.SE cs.DCcs.PF
keywords microservicesavailabilitystochastic modeldependency graphMonte Carlo simulationfault injectionruntime model
0
0 comments X

The pith

A stochastic connectivity model computes microservice endpoint availability from traces and deployment data under explicit fault scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a formal runtime model that converts distributed traces and deployment metadata into a probabilistic representation of service dependencies. Analysts can then run Monte Carlo simulations to estimate the probability that a given endpoint succeeds when specific replicas or links fail. The model keeps replica computation failures separate from dependency communication failures. If the reconstruction step works, teams gain a repeatable way to test architectural changes for resilience without repeating costly fault-injection experiments for each modification.

Core claim

The model treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates, with semantics that separate computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies.

What carries the argument

Typed service-dependency graph with replication map, probability measure over node and edge states, and request-specific success predicates that separate replica failures from dependency failures.

If this is right

  • Replication cannot compensate for bottleneck dependencies.
  • The model reconstructs from traces and deployment artifacts.
  • Monte Carlo simulation supports what-if analysis of architectural changes.
  • The model can run before or alongside fault-injection experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architects could parameterize the probability measure to explore different failure correlation patterns.
  • Time-dependent failures would require extending the model with explicit timing information from traces.
  • Missing traces for some dependencies would force conservative bounds on the computed availability.

Load-bearing premise

Traces and deployment artifacts contain enough detail to reconstruct the typed service-dependency graph, replication map, and probability measure so that Monte Carlo results generalize beyond the tested cases.

What would settle it

A synthetic test case in which Monte Carlo estimates of endpoint success probability differ from the closed-form oracle by more than sampling error.

Figures

Figures reproduced from arXiv: 2607.00740 by Anatoly A. Krasnovsky, Anna Maslovskaya.

Figure 1
Figure 1. Figure 1: Node-versus-edge sensitivity. Heatmap color is the [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

Microservice availability is commonly assessed by fault injection and chaos experiments, but such experiments are costly, operationally risky, and difficult to repeat for every architectural change. Distributed tracing and deployment metadata provide cheaper evidence, yet they usually remain descriptive: they show which services interacted, not what endpoint-level availability property follows. This paper proposes a formal runtime availability model based on stochastic connectivity for resilience-oriented analysis of microservice endpoints. It treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates. Its semantics separates computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies. The model can be reconstructed from traces and deployment artifacts, parameterized for architectural what-if analysis, and analyzed by Monte Carlo simulation before or alongside fault injection. We define the model, its trace-to-model construction, elementary semantic properties, and a synthetic adequacy study. The study matches closed-form oracle cases within sampling error and exposes boundaries caused by edge bottlenecks, correlated failures, missing traces, and time-dependent failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes a formal runtime availability model for microservice endpoints grounded in stochastic connectivity. It combines a typed service-dependency graph, a replication map, a probability measure over node/edge states, and request-specific success predicates. The semantics separate computational failures of replicas from communication failures of dependencies. The model is claimed to be reconstructible from traces and deployment artifacts, support what-if parameterization, and be analyzable via Monte Carlo simulation. A synthetic adequacy study is reported to match closed-form oracles within sampling error while exposing boundaries from bottlenecks, correlated failures, missing traces, and time-dependent effects.

Significance. If the trace-to-model reconstruction generalizes beyond synthetics, the approach could supply a repeatable, lower-risk complement to fault injection for resilience analysis, enabling architectural what-if studies before deployment changes. The separation of failure modes and the demonstration that replication cannot compensate for bottleneck dependencies are potentially useful formal properties for the microservices community.

major comments (3)
  1. [synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.
  2. [trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.
  3. [model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.
minor comments (2)
  1. Notation for the probability measure and success predicates should be introduced with explicit definitions before use in the Monte Carlo procedure.
  2. The abstract mentions 'elementary semantic properties' but the manuscript should include a dedicated subsection or theorem list for them to aid readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor. The manuscript's core contribution is the formal model and its synthetic validation; we will adjust claims to reflect the scope of the current evidence.

read point-by-point responses
  1. Referee: [synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.

    Authors: We agree that these methodological details are required to evaluate robustness. The revised manuscript will specify the Monte Carlo trial count (10,000 runs), report error bars as standard error of the mean, state that no data points were excluded, and describe the probability measure estimation as empirical frequencies computed directly from the generated synthetic traces. These additions will be placed in the adequacy study section and referenced from the abstract. revision: yes

  2. Referee: [trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.

    Authors: The paper formally defines the trace-to-model construction and validates it under synthetic conditions where ground truth is controlled. We do not provide empirical evidence from production traces, and the current work does not claim such generalization. We will revise the abstract, introduction, and conclusion to explicitly limit the reconstruction claim to the defined procedure and its synthetic adequacy, while adding a limitations paragraph discussing risks from omitted correlated failures or under-sampled bottlenecks in real traces. revision: partial

  3. Referee: [model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.

    Authors: The property follows deductively from the model semantics once a probability measure is given. We acknowledge the absence of sensitivity analysis to reconstruction noise and of real-trace validation. The revision will include a new sensitivity subsection that perturbs the input measure (e.g., by adding noise to edge probabilities) and re-runs the Monte Carlo analysis to confirm the bottleneck property persists within the tested range. Real-trace validation of the property lies outside the present scope. revision: partial

standing simulated objections not resolved
  • Empirical validation of the trace-to-model reconstruction and the bottleneck property on real production traces (as opposed to synthetics)

Circularity Check

0 steps flagged

No circularity: model defined from external traces and validated against independent oracles

full rationale

The paper constructs a stochastic connectivity model from a typed service-dependency graph, replication map, probability measure over states, and request-specific success predicates, with semantics separating replica computation failures from dependency communication failures. Reconstruction is specified from traces and deployment artifacts, and adequacy is checked by Monte Carlo simulation against closed-form synthetic oracles. No equations, fitted parameters, or self-citations are shown that would make any claimed availability measure or semantic property reduce by construction to quantities defined by the same inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The model relies on the unstated assumption that traces capture the necessary dependency and state information.

pith-pipeline@v0.9.1-grok · 5740 in / 1202 out tokens · 20832 ms · 2026-07-02T08:34:51.762859+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 1 internal anchor

  1. [1]

    Krasnovsky

    Anatoly A. Krasnovsky. 2026. Stochastic Connectivity Synthetic Experiments Artifact. https://github.com/a-a-k/stochastic-connectivity-synthetic-artifact. Accessed: 2026-06-06

  2. [2]

    Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing1, 1 (2004), 11–33. doi:10.1109/ TDSC.2004.2

  3. [3]

    Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software 33, 3 (2016), 35–41. doi:10.1109/MS.2016.60

  4. [4]

    Steffen Becker, Heiko Koziolek, and Ralf Reussner. 2009. The Palladio Component Model for Model-Driven Performance Prediction.Journal of Systems and Software 82, 1 (2009), 3–22. doi:10.1016/j.jss.2008.03.066

  5. [5]

    Nelly Bencomo, Sebastian Götz, and Hui Song. 2019. Models@run.time: A Guided Tour of the State of the Art and Research Challenges.Software and Systems Modeling18, 5 (2019), 3049–3082. doi:10.1007/s10270-018-00712-x

  6. [6]

    Gordon Blair, Nelly Bencomo, and Robert B. France. 2009. Models@run.time. Computer42, 10 (2009), 22–27. doi:10.1109/MC.2009.326

  7. [7]

    Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. 2012. Architecture-Based Reliability Prediction with the Palladio Component Model. IEEE Transactions on Software Engineering38, 6 (2012), 1319–1339. doi:10.1109/ TSE.2011.94

  8. [8]

    Betty H. C. Cheng, Rogério de Lemos, Holger Giese, Paola Inverardi, Jeff Magee, Jesper Andersson, Basil Becker, Nelly Bencomo, Yuriy Brun, Bojan Cukic, Gio- vanna Di Marzo Serugendo, Schahram Dustdar, Anthony Finkelstein, Cristina Gacek, Kurt Geihs, Vincenzo Grassi, Gabor Karsai, Holger M. Kienle, Jeff Kramer, Marin Litoiu, Sam Malek, Raffaela Mirandola, ...

  9. [9]

    Colbourn

    Charles J. Colbourn. 1987.The Combinatorics of Network Reliability. Number 4 in International Series of Monographs on Computer Science. Oxford University Press, New York, NY, USA

  10. [10]

    Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yester- day, Today, and Tomorrow. InPresent and Ulterior Software Engineering. Springer, Cham, Switzerland, 195–216. doi:10.1007/978-3-319-67425-4_12

  11. [11]

    Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...

  12. [12]

    David Garlan, Shang-Wen Cheng, An-Cheng Huang, Bradley Schmerl, and Peter Steenkiste. 2004. Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure.Computer37, 10 (2004), 46–54. doi:10.1109/MC.2004.175

  13. [13]

    Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke

    Holger Giese, Nelly Bencomo, Liliana Pasquale, Andres J. Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke. 2014. Living with Uncertainty in the Age of Runtime Models. InModels@run.time: Foundations, Applications, and Roadmaps. Lecture Notes in Computer Science, Vol. 8378. Springer, Cham, Switzerland, 47–

  14. [14]

    doi:10.1007/978-3-319-08915-7_3

  15. [15]

    Swapna S. Gokhale. 2007. Architecture-Based Software Reliability Analysis: Overview and Limitations.IEEE Transactions on Dependable and Secure Computing 4, 1 (2007), 32–40. doi:10.1109/TDSC.2007.4

  16. [16]

    Reiter, and Vyas Sekar

    Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In Proceedings of the 36th IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, Piscataway, NJ, USA, 57–66. doi:10.1109/ICDCS.2016.11

  17. [17]

    Anne Immonen and Eila Niemelä. 2008. Survey of Reliability and Availability Prediction Methods from the Viewpoint of Software Architecture.Software and Systems Modeling7, 1 (2008), 49–65. doi:10.1007/s10270-006-0040-x

  18. [18]

    David R. Karger. 1995. A Randomized Fully Polynomial Time Approximation Scheme for the All Terminal Network Reliability Problem. InProceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing (STOC ’95). ACM, New York, NY, USA, 11–17. doi:10.1145/225058.225069

  19. [19]

    Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

    Anatoly A. Krasnovsky. 2025. Evaluating Asynchronous Semantics in Trace- Discovered Resilience Models: A Case Study on the OpenTelemetry Demo. arXiv:2512.12314 [cs.SE] doi:10.48550/arXiv.2512.12314

  20. [20]

    Krasnovsky

    Anatoly A. Krasnovsky. 2026. Model Discovery and Graph Simulation: A Light- weight Gateway to Chaos Engineering. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE ’26). ACM, New York, NY, USA, 5. doi:10.1145/3786582.3786823

  21. [21]

    Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2022. Enjoy Your Observability: An Industrial Survey of Microservice Tracing and Analysis.Empirical Software Engineering27, 1, Article 25 (2022), 28 pages. doi:10.1007/s10664-021-10063-9

  22. [22]

    OpenTelemetry Authors. 2025. OpenTelemetry Demo Docs. https:// opentelemetry.io/docs/demo/. Accessed: 2026-06-19

  23. [23]

    OpenTelemetry Authors. 2026. OpenTelemetry Semantic Conventions 1.41.1. https://github.com/open-telemetry/semantic-conventions/tree/v1.41.1/docs. Ver- sion 1.41.1; accessed: 2026-06-19

  24. [24]

    Patterson

    David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why Do Internet Services Fail, and What Can Be Done About It?. In4th USENIX Symposium on Internet Technologies and Systems (USITS 03). USENIX Association, Seattle, WA, USA, 14 pages. https://www.usenix.org/conference/usits-03/why- do-internet-services-fail-and-what-can-be-done-about-it

  25. [25]

    Claus Pahl and Pooyan Jamshidi. 2016. Microservices: A Systematic Mapping Study. InProceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016). SCITEPRESS, Rome, Italy, 137–146. doi:10.5220/ 0005785501370146

  26. [26]

    Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

    Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper- 2010-1. Google, Inc. https://research.google.com/archive/papers/dapper-2010- 1.pdf

  27. [27]

    Jacopo Soldani, Damian Andrew Tamburri, and Willem-Jan van den Heuvel. 2018. The Pains and Gains of Microservices: A Systematic Grey Literature Review. Journal of Systems and Software146 (2018), 215–232. doi:10.1016/j.jss.2018.09.082

  28. [28]

    Leslie G. Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput.8, 3 (1979), 410–421. doi:10.1137/0208032

  29. [29]

    Olaf Zimmermann. 2017. Microservices Tenets: Agile Approach to Service De- velopment and Deployment.Computer Science – Research and Development32, 3–4 (2017), 301–310. doi:10.1007/s00450-016-0337-0