Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis

Anatoly A. Krasnovsky; Anna Maslovskaya

arxiv: 2607.00740 · v1 · pith:WG65S63Hnew · submitted 2026-07-01 · 💻 cs.SE · cs.DC· cs.PF

Stochastic Connectivity as the Foundation of a Runtime Model for Microservice Availability Analysis

Anatoly A. Krasnovsky , Anna Maslovskaya This is my paper

Pith reviewed 2026-07-02 08:34 UTC · model grok-4.3

classification 💻 cs.SE cs.DCcs.PF

keywords microservicesavailabilitystochastic modeldependency graphMonte Carlo simulationfault injectionruntime model

0 comments

The pith

A stochastic connectivity model computes microservice endpoint availability from traces and deployment data under explicit fault scenarios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a formal runtime model that converts distributed traces and deployment metadata into a probabilistic representation of service dependencies. Analysts can then run Monte Carlo simulations to estimate the probability that a given endpoint succeeds when specific replicas or links fail. The model keeps replica computation failures separate from dependency communication failures. If the reconstruction step works, teams gain a repeatable way to test architectural changes for resilience without repeating costly fault-injection experiments for each modification.

Core claim

The model treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates, with semantics that separate computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies.

What carries the argument

Typed service-dependency graph with replication map, probability measure over node and edge states, and request-specific success predicates that separate replica failures from dependency failures.

If this is right

Replication cannot compensate for bottleneck dependencies.
The model reconstructs from traces and deployment artifacts.
Monte Carlo simulation supports what-if analysis of architectural changes.
The model can run before or alongside fault-injection experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architects could parameterize the probability measure to explore different failure correlation patterns.
Time-dependent failures would require extending the model with explicit timing information from traces.
Missing traces for some dependencies would force conservative bounds on the computed availability.

Load-bearing premise

Traces and deployment artifacts contain enough detail to reconstruct the typed service-dependency graph, replication map, and probability measure so that Monte Carlo results generalize beyond the tested cases.

What would settle it

A synthetic test case in which Monte Carlo estimates of endpoint success probability differ from the closed-form oracle by more than sampling error.

Figures

Figures reproduced from arXiv: 2607.00740 by Anatoly A. Krasnovsky, Anna Maslovskaya.

read the original abstract

Microservice availability is commonly assessed by fault injection and chaos experiments, but such experiments are costly, operationally risky, and difficult to repeat for every architectural change. Distributed tracing and deployment metadata provide cheaper evidence, yet they usually remain descriptive: they show which services interacted, not what endpoint-level availability property follows. This paper proposes a formal runtime availability model based on stochastic connectivity for resilience-oriented analysis of microservice endpoints. It treats endpoint availability under explicit fault scenarios as a measurable facet of microservice resilience, combining a typed service-dependency graph, a replication map, a probability measure over node and edge states, and request-specific success predicates. Its semantics separates computational failures of service replicas from communication failures of logical dependencies, showing that replication cannot compensate for bottleneck dependencies. The model can be reconstructed from traces and deployment artifacts, parameterized for architectural what-if analysis, and analyzed by Monte Carlo simulation before or alongside fault injection. We define the model, its trace-to-model construction, elementary semantic properties, and a synthetic adequacy study. The study matches closed-form oracle cases within sampling error and exposes boundaries caused by edge bottlenecks, correlated failures, missing traces, and time-dependent failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a stochastic connectivity model for microservice endpoint availability that separates replica failures from dependency communication failures and reconstructs from traces, with a synthetic study matching oracles but no real-trace validation.

read the letter

The core contribution is a formal runtime model that turns traces and deployment data into a typed dependency graph plus probability measure, then uses Monte Carlo to compute endpoint availability under explicit faults. It cleanly separates computational replica failures from logical dependency communication failures and shows replication cannot fix bottleneck links. That framing is new enough in the microservices space and the semantics look consistent on paper.

The synthetic adequacy study is the strongest part: it matches closed-form oracles within sampling error and surfaces the expected limits from missing traces, correlated failures, and time dependence. That gives a reproducible baseline.

The soft spot is exactly where the stress-test note flags it. The model is only shown to work when the input graph and probabilities are handed to it synthetically. There is no demonstration that real traces actually contain the typed dependencies, replication counts, and state probabilities at the granularity needed, nor that the resulting availability numbers align with observed behavior under faults. If reconstruction systematically under-samples bottlenecks or correlations, the what-if analysis will not generalize. The abstract acknowledges these boundaries but does not quantify them on production data.

This is for researchers building formal resilience models in software engineering who already work with tracing and graph-based reliability. It is not yet ready for practitioners who need validated predictions on live systems.

I would send it to peer review. The formal construction and synthetic check are solid enough to justify referee time, even though the real-data step is still missing.

Referee Report

3 major / 2 minor

Summary. The paper proposes a formal runtime availability model for microservice endpoints grounded in stochastic connectivity. It combines a typed service-dependency graph, a replication map, a probability measure over node/edge states, and request-specific success predicates. The semantics separate computational failures of replicas from communication failures of dependencies. The model is claimed to be reconstructible from traces and deployment artifacts, support what-if parameterization, and be analyzable via Monte Carlo simulation. A synthetic adequacy study is reported to match closed-form oracles within sampling error while exposing boundaries from bottlenecks, correlated failures, missing traces, and time-dependent effects.

Significance. If the trace-to-model reconstruction generalizes beyond synthetics, the approach could supply a repeatable, lower-risk complement to fault injection for resilience analysis, enabling architectural what-if studies before deployment changes. The separation of failure modes and the demonstration that replication cannot compensate for bottleneck dependencies are potentially useful formal properties for the microservices community.

major comments (3)

[synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.
[trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.
[model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.

minor comments (2)

Notation for the probability measure and success predicates should be introduced with explicit definitions before use in the Monte Carlo procedure.
The abstract mentions 'elementary semantic properties' but the manuscript should include a dedicated subsection or theorem list for them to aid readability.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating where revisions will be made to improve clarity and rigor. The manuscript's core contribution is the formal model and its synthetic validation; we will adjust claims to reflect the scope of the current evidence.

read point-by-point responses

Referee: [synthetic adequacy study] Abstract and synthetic adequacy study section: the claim that Monte Carlo results match closed-form oracles 'within sampling error' is load-bearing for the adequacy argument, yet no error bars, number of trials, data exclusion rules, or estimation procedure for the probability measure from traces are supplied, leaving it impossible to judge whether the match is robust or merely an artifact of the oracle construction.

Authors: We agree that these methodological details are required to evaluate robustness. The revised manuscript will specify the Monte Carlo trial count (10,000 runs), report error bars as standard error of the mean, state that no data points were excluded, and describe the probability measure estimation as empirical frequencies computed directly from the generated synthetic traces. These additions will be placed in the adequacy study section and referenced from the abstract. revision: yes
Referee: [trace-to-model construction] Trace-to-model construction (described in the abstract and model definition): the central claim that the typed dependency graph, replication map, and state probability measure can be reconstructed from real traces and deployment artifacts such that simulation generalizes is unsupported beyond synthetic oracles; if real traces systematically omit correlated failures or under-sample bottlenecks, the semantics and what-if parameterization will not yield predictions that align with observed endpoint availability.

Authors: The paper formally defines the trace-to-model construction and validates it under synthetic conditions where ground truth is controlled. We do not provide empirical evidence from production traces, and the current work does not claim such generalization. We will revise the abstract, introduction, and conclusion to explicitly limit the reconstruction claim to the defined procedure and its synthetic adequacy, while adding a limitations paragraph discussing risks from omitted correlated failures or under-sampled bottlenecks in real traces. revision: partial
Referee: [model semantics] Model semantics paragraph: the property that 'replication cannot compensate for bottleneck dependencies' is derived from the probability measure over states; this is only as reliable as the input measure reconstructed from traces, but no sensitivity analysis or real-trace validation is provided to show the property survives reconstruction noise.

Authors: The property follows deductively from the model semantics once a probability measure is given. We acknowledge the absence of sensitivity analysis to reconstruction noise and of real-trace validation. The revision will include a new sensitivity subsection that perturbs the input measure (e.g., by adding noise to edge probabilities) and re-runs the Monte Carlo analysis to confirm the bottleneck property persists within the tested range. Real-trace validation of the property lies outside the present scope. revision: partial

standing simulated objections not resolved

Empirical validation of the trace-to-model reconstruction and the bottleneck property on real production traces (as opposed to synthetics)

Circularity Check

0 steps flagged

No circularity: model defined from external traces and validated against independent oracles

full rationale

The paper constructs a stochastic connectivity model from a typed service-dependency graph, replication map, probability measure over states, and request-specific success predicates, with semantics separating replica computation failures from dependency communication failures. Reconstruction is specified from traces and deployment artifacts, and adequacy is checked by Monte Carlo simulation against closed-form synthetic oracles. No equations, fitted parameters, or self-citations are shown that would make any claimed availability measure or semantic property reduce by construction to quantities defined by the same inputs. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The model relies on the unstated assumption that traces capture the necessary dependency and state information.

pith-pipeline@v0.9.1-grok · 5740 in / 1202 out tokens · 20832 ms · 2026-07-02T08:34:51.762859+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 19 canonical work pages · 1 internal anchor

[1]

Krasnovsky

Anatoly A. Krasnovsky. 2026. Stochastic Connectivity Synthetic Experiments Artifact. https://github.com/a-a-k/stochastic-connectivity-synthetic-artifact. Accessed: 2026-06-06

2026
[2]

Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing1, 1 (2004), 11–33. doi:10.1109/ TDSC.2004.2

2004
[3]

Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software 33, 3 (2016), 35–41. doi:10.1109/MS.2016.60

work page doi:10.1109/ms.2016.60 2016
[4]

Steffen Becker, Heiko Koziolek, and Ralf Reussner. 2009. The Palladio Component Model for Model-Driven Performance Prediction.Journal of Systems and Software 82, 1 (2009), 3–22. doi:10.1016/j.jss.2008.03.066

work page doi:10.1016/j.jss.2008.03.066 2009
[5]

Nelly Bencomo, Sebastian Götz, and Hui Song. 2019. Models@run.time: A Guided Tour of the State of the Art and Research Challenges.Software and Systems Modeling18, 5 (2019), 3049–3082. doi:10.1007/s10270-018-00712-x

work page doi:10.1007/s10270-018-00712-x 2019
[6]

Gordon Blair, Nelly Bencomo, and Robert B. France. 2009. Models@run.time. Computer42, 10 (2009), 22–27. doi:10.1109/MC.2009.326

work page doi:10.1109/mc.2009.326 2009
[7]

Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. 2012. Architecture-Based Reliability Prediction with the Palladio Component Model. IEEE Transactions on Software Engineering38, 6 (2012), 1319–1339. doi:10.1109/ TSE.2011.94

2012
[8]

Betty H. C. Cheng, Rogério de Lemos, Holger Giese, Paola Inverardi, Jeff Magee, Jesper Andersson, Basil Becker, Nelly Bencomo, Yuriy Brun, Bojan Cukic, Gio- vanna Di Marzo Serugendo, Schahram Dustdar, Anthony Finkelstein, Cristina Gacek, Kurt Geihs, Vincenzo Grassi, Gabor Karsai, Holger M. Kienle, Jeff Kramer, Marin Litoiu, Sam Malek, Raffaela Mirandola, ...

work page doi:10.1007/978-3-642-02161-9_1 2009
[9]

Colbourn

Charles J. Colbourn. 1987.The Combinatorics of Network Reliability. Number 4 in International Series of Monographs on Computer Science. Oxford University Press, New York, NY, USA

1987
[10]

Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yester- day, Today, and Tomorrow. InPresent and Ulterior Software Engineering. Springer, Cham, Switzerland, 195–216. doi:10.1007/978-3-319-67425-4_12

work page doi:10.1007/978-3-319-67425-4_12 2017
[11]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...

work page doi:10.1145/3297858.3304013 2019
[12]

David Garlan, Shang-Wen Cheng, An-Cheng Huang, Bradley Schmerl, and Peter Steenkiste. 2004. Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure.Computer37, 10 (2004), 46–54. doi:10.1109/MC.2004.175

work page doi:10.1109/mc.2004.175 2004
[13]

Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke

Holger Giese, Nelly Bencomo, Liliana Pasquale, Andres J. Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke. 2014. Living with Uncertainty in the Age of Runtime Models. InModels@run.time: Foundations, Applications, and Roadmaps. Lecture Notes in Computer Science, Vol. 8378. Springer, Cham, Switzerland, 47–

2014
[14]

doi:10.1007/978-3-319-08915-7_3

work page doi:10.1007/978-3-319-08915-7_3
[15]

Swapna S. Gokhale. 2007. Architecture-Based Software Reliability Analysis: Overview and Limitations.IEEE Transactions on Dependable and Secure Computing 4, 1 (2007), 32–40. doi:10.1109/TDSC.2007.4

work page doi:10.1109/tdsc.2007.4 2007
[16]

Reiter, and Vyas Sekar

Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In Proceedings of the 36th IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, Piscataway, NJ, USA, 57–66. doi:10.1109/ICDCS.2016.11

work page doi:10.1109/icdcs.2016.11 2016
[17]

Anne Immonen and Eila Niemelä. 2008. Survey of Reliability and Availability Prediction Methods from the Viewpoint of Software Architecture.Software and Systems Modeling7, 1 (2008), 49–65. doi:10.1007/s10270-006-0040-x

work page doi:10.1007/s10270-006-0040-x 2008
[18]

David R. Karger. 1995. A Randomized Fully Polynomial Time Approximation Scheme for the All Terminal Network Reliability Problem. InProceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing (STOC ’95). ACM, New York, NY, USA, 11–17. doi:10.1145/225058.225069

work page doi:10.1145/225058.225069 1995
[19]

Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

Anatoly A. Krasnovsky. 2025. Evaluating Asynchronous Semantics in Trace- Discovered Resilience Models: A Case Study on the OpenTelemetry Demo. arXiv:2512.12314 [cs.SE] doi:10.48550/arXiv.2512.12314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.12314 2025
[20]

Krasnovsky

Anatoly A. Krasnovsky. 2026. Model Discovery and Graph Simulation: A Light- weight Gateway to Chaos Engineering. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE ’26). ACM, New York, NY, USA, 5. doi:10.1145/3786582.3786823

work page doi:10.1145/3786582.3786823 2026
[21]

Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2022. Enjoy Your Observability: An Industrial Survey of Microservice Tracing and Analysis.Empirical Software Engineering27, 1, Article 25 (2022), 28 pages. doi:10.1007/s10664-021-10063-9

work page doi:10.1007/s10664-021-10063-9 2022
[22]

OpenTelemetry Authors. 2025. OpenTelemetry Demo Docs. https:// opentelemetry.io/docs/demo/. Accessed: 2026-06-19

2025
[23]

OpenTelemetry Authors. 2026. OpenTelemetry Semantic Conventions 1.41.1. https://github.com/open-telemetry/semantic-conventions/tree/v1.41.1/docs. Ver- sion 1.41.1; accessed: 2026-06-19

2026
[24]

Patterson

David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why Do Internet Services Fail, and What Can Be Done About It?. In4th USENIX Symposium on Internet Technologies and Systems (USITS 03). USENIX Association, Seattle, WA, USA, 14 pages. https://www.usenix.org/conference/usits-03/why- do-internet-services-fail-and-what-can-be-done-about-it

2003
[25]

Claus Pahl and Pooyan Jamshidi. 2016. Microservices: A Systematic Mapping Study. InProceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016). SCITEPRESS, Rome, Italy, 137–146. doi:10.5220/ 0005785501370146

2016
[26]

Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper- 2010-1. Google, Inc. https://research.google.com/archive/papers/dapper-2010- 1.pdf

2010
[27]

Jacopo Soldani, Damian Andrew Tamburri, and Willem-Jan van den Heuvel. 2018. The Pains and Gains of Microservices: A Systematic Grey Literature Review. Journal of Systems and Software146 (2018), 215–232. doi:10.1016/j.jss.2018.09.082

work page doi:10.1016/j.jss.2018.09.082 2018
[28]

Leslie G. Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput.8, 3 (1979), 410–421. doi:10.1137/0208032

work page doi:10.1137/0208032 1979
[29]

Olaf Zimmermann. 2017. Microservices Tenets: Agile Approach to Service De- velopment and Deployment.Computer Science – Research and Development32, 3–4 (2017), 301–310. doi:10.1007/s00450-016-0337-0

work page doi:10.1007/s00450-016-0337-0 2017

[1] [1]

Krasnovsky

Anatoly A. Krasnovsky. 2026. Stochastic Connectivity Synthetic Experiments Artifact. https://github.com/a-a-k/stochastic-connectivity-synthetic-artifact. Accessed: 2026-06-06

2026

[2] [2]

Algirdas Avizienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing1, 1 (2004), 11–33. doi:10.1109/ TDSC.2004.2

2004

[3] [3]

Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal. 2016. Chaos Engineering.IEEE Software 33, 3 (2016), 35–41. doi:10.1109/MS.2016.60

work page doi:10.1109/ms.2016.60 2016

[4] [4]

Steffen Becker, Heiko Koziolek, and Ralf Reussner. 2009. The Palladio Component Model for Model-Driven Performance Prediction.Journal of Systems and Software 82, 1 (2009), 3–22. doi:10.1016/j.jss.2008.03.066

work page doi:10.1016/j.jss.2008.03.066 2009

[5] [5]

Nelly Bencomo, Sebastian Götz, and Hui Song. 2019. Models@run.time: A Guided Tour of the State of the Art and Research Challenges.Software and Systems Modeling18, 5 (2019), 3049–3082. doi:10.1007/s10270-018-00712-x

work page doi:10.1007/s10270-018-00712-x 2019

[6] [6]

Gordon Blair, Nelly Bencomo, and Robert B. France. 2009. Models@run.time. Computer42, 10 (2009), 22–27. doi:10.1109/MC.2009.326

work page doi:10.1109/mc.2009.326 2009

[7] [7]

Franz Brosch, Heiko Koziolek, Barbora Buhnova, and Ralf Reussner. 2012. Architecture-Based Reliability Prediction with the Palladio Component Model. IEEE Transactions on Software Engineering38, 6 (2012), 1319–1339. doi:10.1109/ TSE.2011.94

2012

[8] [8]

Betty H. C. Cheng, Rogério de Lemos, Holger Giese, Paola Inverardi, Jeff Magee, Jesper Andersson, Basil Becker, Nelly Bencomo, Yuriy Brun, Bojan Cukic, Gio- vanna Di Marzo Serugendo, Schahram Dustdar, Anthony Finkelstein, Cristina Gacek, Kurt Geihs, Vincenzo Grassi, Gabor Karsai, Holger M. Kienle, Jeff Kramer, Marin Litoiu, Sam Malek, Raffaela Mirandola, ...

work page doi:10.1007/978-3-642-02161-9_1 2009

[9] [9]

Colbourn

Charles J. Colbourn. 1987.The Combinatorics of Network Reliability. Number 4 in International Series of Monographs on Computer Science. Oxford University Press, New York, NY, USA

1987

[10] [10]

Nicola Dragoni, Saverio Giallorenzo, Alberto Lluch Lafuente, Manuel Mazzara, Fabrizio Montesi, Ruslan Mustafin, and Larisa Safina. 2017. Microservices: Yester- day, Today, and Tomorrow. InPresent and Ulterior Software Engineering. Springer, Cham, Switzerland, 195–216. doi:10.1007/978-3-319-67425-4_12

work page doi:10.1007/978-3-319-67425-4_12 2017

[11] [11]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jackson, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delimitrou. 2019. An Open-S...

work page doi:10.1145/3297858.3304013 2019

[12] [12]

David Garlan, Shang-Wen Cheng, An-Cheng Huang, Bradley Schmerl, and Peter Steenkiste. 2004. Rainbow: Architecture-Based Self-Adaptation with Reusable Infrastructure.Computer37, 10 (2004), 46–54. doi:10.1109/MC.2004.175

work page doi:10.1109/mc.2004.175 2004

[13] [13]

Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke

Holger Giese, Nelly Bencomo, Liliana Pasquale, Andres J. Ramirez, Paola Inverardi, Sebastian Wätzoldt, and Siobhán Clarke. 2014. Living with Uncertainty in the Age of Runtime Models. InModels@run.time: Foundations, Applications, and Roadmaps. Lecture Notes in Computer Science, Vol. 8378. Springer, Cham, Switzerland, 47–

2014

[14] [14]

doi:10.1007/978-3-319-08915-7_3

work page doi:10.1007/978-3-319-08915-7_3

[15] [15]

Swapna S. Gokhale. 2007. Architecture-Based Software Reliability Analysis: Overview and Limitations.IEEE Transactions on Dependable and Secure Computing 4, 1 (2007), 32–40. doi:10.1109/TDSC.2007.4

work page doi:10.1109/tdsc.2007.4 2007

[16] [16]

Reiter, and Vyas Sekar

Victor Heorhiadi, Shriram Rajagopalan, Hani Jamjoom, Michael K. Reiter, and Vyas Sekar. 2016. Gremlin: Systematic Resilience Testing of Microservices. In Proceedings of the 36th IEEE International Conference on Distributed Computing Systems (ICDCS). IEEE, Piscataway, NJ, USA, 57–66. doi:10.1109/ICDCS.2016.11

work page doi:10.1109/icdcs.2016.11 2016

[17] [17]

Anne Immonen and Eila Niemelä. 2008. Survey of Reliability and Availability Prediction Methods from the Viewpoint of Software Architecture.Software and Systems Modeling7, 1 (2008), 49–65. doi:10.1007/s10270-006-0040-x

work page doi:10.1007/s10270-006-0040-x 2008

[18] [18]

David R. Karger. 1995. A Randomized Fully Polynomial Time Approximation Scheme for the All Terminal Network Reliability Problem. InProceedings of the Twenty-Seventh Annual ACM Symposium on Theory of Computing (STOC ’95). ACM, New York, NY, USA, 11–17. doi:10.1145/225058.225069

work page doi:10.1145/225058.225069 1995

[19] [19]

Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo

Anatoly A. Krasnovsky. 2025. Evaluating Asynchronous Semantics in Trace- Discovered Resilience Models: A Case Study on the OpenTelemetry Demo. arXiv:2512.12314 [cs.SE] doi:10.48550/arXiv.2512.12314

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.12314 2025

[20] [20]

Krasnovsky

Anatoly A. Krasnovsky. 2026. Model Discovery and Graph Simulation: A Light- weight Gateway to Chaos Engineering. InProceedings of the 48th IEEE/ACM International Conference on Software Engineering (ICSE ’26). ACM, New York, NY, USA, 5. doi:10.1145/3786582.3786823

work page doi:10.1145/3786582.3786823 2026

[21] [21]

Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2022. Enjoy Your Observability: An Industrial Survey of Microservice Tracing and Analysis.Empirical Software Engineering27, 1, Article 25 (2022), 28 pages. doi:10.1007/s10664-021-10063-9

work page doi:10.1007/s10664-021-10063-9 2022

[22] [22]

OpenTelemetry Authors. 2025. OpenTelemetry Demo Docs. https:// opentelemetry.io/docs/demo/. Accessed: 2026-06-19

2025

[23] [23]

OpenTelemetry Authors. 2026. OpenTelemetry Semantic Conventions 1.41.1. https://github.com/open-telemetry/semantic-conventions/tree/v1.41.1/docs. Ver- sion 1.41.1; accessed: 2026-06-19

2026

[24] [24]

Patterson

David Oppenheimer, Archana Ganapathi, and David A. Patterson. 2003. Why Do Internet Services Fail, and What Can Be Done About It?. In4th USENIX Symposium on Internet Technologies and Systems (USITS 03). USENIX Association, Seattle, WA, USA, 14 pages. https://www.usenix.org/conference/usits-03/why- do-internet-services-fail-and-what-can-be-done-about-it

2003

[25] [25]

Claus Pahl and Pooyan Jamshidi. 2016. Microservices: A Systematic Mapping Study. InProceedings of the 6th International Conference on Cloud Computing and Services Science (CLOSER 2016). SCITEPRESS, Rome, Italy, 137–146. doi:10.5220/ 0005785501370146

2016

[26] [26]

Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper- 2010-1. Google, Inc. https://research.google.com/archive/papers/dapper-2010- 1.pdf

2010

[27] [27]

Jacopo Soldani, Damian Andrew Tamburri, and Willem-Jan van den Heuvel. 2018. The Pains and Gains of Microservices: A Systematic Grey Literature Review. Journal of Systems and Software146 (2018), 215–232. doi:10.1016/j.jss.2018.09.082

work page doi:10.1016/j.jss.2018.09.082 2018

[28] [28]

Leslie G. Valiant. 1979. The Complexity of Enumeration and Reliability Problems. SIAM J. Comput.8, 3 (1979), 410–421. doi:10.1137/0208032

work page doi:10.1137/0208032 1979

[29] [29]

Olaf Zimmermann. 2017. Microservices Tenets: Agile Approach to Service De- velopment and Deployment.Computer Science – Research and Development32, 3–4 (2017), 301–310. doi:10.1007/s00450-016-0337-0

work page doi:10.1007/s00450-016-0337-0 2017