Evaluating Asynchronous Semantics in Trace-Discovered Resilience Models: A Case Study on the OpenTelemetry Demo
Pith reviewed 2026-05-16 22:48 UTC · model grok-4.3
The pith
Adding asynchronous semantics for Kafka edges changes predicted HTTP availability by at most 0.001 percentage points in a trace-derived model of the OpenTelemetry demo.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The trace-derived connectivity model reproduces the overall availability degradation curve observed in chaos experiments. Introducing asynchronous semantics for Kafka edges changes the predicted availabilities by at most about 10^{-5}, or 0.001 percentage points. Therefore, for immediate HTTP availability in this case study, a simpler connectivity-only model is sufficient.
What carries the argument
Monte Carlo simulation over a service dependency graph extracted from raw OpenTelemetry traces, using endpoint-specific success predicates and optional non-blocking treatment of Kafka edges under a fail-stop failure model.
If this is right
- Availability estimates stay essentially unchanged across the tested failure fractions when async details are omitted.
- For HTTP-centric microservices, effort spent on timing semantics yields negligible improvement in immediate-success predictions.
- A connectivity-only graph extracted from traces is adequate for reproducing the observed degradation curve in this deployment.
- Computational cost of the resilience analysis can be reduced by dropping the asynchronous rules without loss of accuracy for the studied metric.
Where Pith is reading between the lines
- The same trace-to-graph pipeline could be applied to other observability-heavy systems to test whether the negligible async effect holds more broadly.
- Systems that rely on long-running asynchronous workflows rather than immediate HTTP replies might show larger differences once the same modeling choice is examined.
- Extending the predicates to include partial failure modes or latency bounds would be a direct next measurement to check the limits of the connectivity-only simplification.
Load-bearing premise
The trace-derived graph plus endpoint success predicates plus fail-stop failure model accurately represent the real behavior of the demo under the chaos experiments performed in Docker Compose.
What would settle it
Running the same random service-kill patterns on the live demo and recording endpoint success rates that differ by more than 0.001 percentage points between the connectivity-only and async versions of the model.
Figures
read the original abstract
While distributed tracing and chaos engineering are becoming standard for microservices, resilience models remain largely manual and bespoke. We revisit a trace-discovered connectivity model that derives a service dependency graph from traces and uses Monte Carlo simulation to estimate endpoint availability under fail-stop service failures. Compared to earlier work, we (i) derive the graph directly from raw OpenTelemetry traces, (ii) attach endpoint-specific success predicates, and (iii) add a simple asynchronous semantics that treats Kafka edges as non-blocking for immediate HTTP success. We apply this model to the OpenTelemetry Demo ("Astronomy Shop") using a GitHub Actions workflow that discovers the graph, runs simulations, and executes chaos experiments that randomly kill microservices in a Docker Compose deployment. Across the studied failure fractions, the model reproduces the overall availability degradation curve, while asynchronous semantics for Kafka edges change predicted availabilities by at most about 10^(-5) (0.001 percentage points). This null result suggests that for immediate HTTP availability in this case study, explicitly modeling asynchronous dependencies is not warranted, and a simpler connectivity-only model is sufficient.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates a trace-derived resilience model for microservices using the OpenTelemetry Demo. It derives a service dependency graph directly from raw OpenTelemetry traces, attaches endpoint-specific success predicates, and employs Monte Carlo simulation under a fail-stop failure model to estimate endpoint availability. The model is compared against chaos experiments that randomly kill microservices in a Docker Compose deployment. The central claims are that the simulation reproduces the observed availability degradation curve across studied failure fractions and that introducing simple asynchronous semantics for Kafka edges alters predicted availabilities by at most 10^{-5} (0.001 percentage points), implying that a connectivity-only model suffices for immediate HTTP availability in this case study.
Significance. If the reproduction of the degradation curve holds under the stated assumptions, the work provides concrete evidence from a realistic open-source demo that explicit modeling of asynchronous dependencies is unnecessary for short-term availability predictions in HTTP-centric microservices. The GitHub Actions workflow for trace discovery, simulation, and chaos execution is a strength that supports reproducibility. The null result on async semantics could inform simpler modeling practices, though its generality is limited to the studied system and failure model.
major comments (2)
- Abstract: The claim that the model reproduces the availability degradation curve and that async semantics change predictions by at most 10^{-5} is presented without error bars, confidence intervals, number of Monte Carlo runs, or discussion of simulation variance; this information is load-bearing for interpreting whether the tiny delta is distinguishable from noise and for supporting the conclusion that async modeling is unwarranted.
- Methods (trace processing and predicate attachment): Endpoint-specific success predicates are invoked to determine simulation outcomes but their exact definitions, how they are extracted from traces, and their mapping to the discovered graph are not specified in sufficient detail to assess fidelity to the Docker Compose chaos experiments or to enable independent replication.
minor comments (3)
- The manuscript would benefit from a dedicated limitations section discussing the fail-stop assumption and potential mismatches with real partial-failure or timeout behaviors observed in the demo.
- Figure captions and axis labels for the degradation curves should explicitly state the number of simulation trials and any aggregation method used to produce the plotted points.
- A brief comparison table contrasting the connectivity-only model versus the async variant (e.g., per-endpoint availability at each failure fraction) would make the 10^{-5} delta easier to evaluate.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the work's reproducibility strengths. We address each major comment below and will revise the manuscript accordingly to enhance clarity and support independent replication.
read point-by-point responses
-
Referee: Abstract: The claim that the model reproduces the availability degradation curve and that async semantics change predictions by at most 10^{-5} is presented without error bars, confidence intervals, number of Monte Carlo runs, or discussion of simulation variance; this information is load-bearing for interpreting whether the tiny delta is distinguishable from noise and for supporting the conclusion that async modeling is unwarranted.
Authors: We agree that reporting simulation parameters and variance is necessary to substantiate the claims. In the revised manuscript we will state the exact number of Monte Carlo runs used for each availability estimate, include error bars (or confidence intervals) derived from the simulation replicates, and add a short discussion showing that the maximum observed difference of 10^{-5} lies well below the estimated simulation variance, confirming it is indistinguishable from noise under the fail-stop model. revision: yes
-
Referee: Methods (trace processing and predicate attachment): Endpoint-specific success predicates are invoked to determine simulation outcomes but their exact definitions, how they are extracted from traces, and their mapping to the discovered graph are not specified in sufficient detail to assess fidelity to the Docker Compose chaos experiments or to enable independent replication.
Authors: We acknowledge the need for greater detail. The revised Methods section will explicitly list the success predicate for each endpoint (defined from trace attributes such as HTTP status codes and span status), describe the automated extraction rules applied to the raw OpenTelemetry traces, and specify the mapping from predicates to nodes in the discovered dependency graph. These additions will allow readers to verify fidelity to the chaos experiments and to replicate the simulation outcomes. revision: yes
Circularity Check
Direct Monte Carlo simulation on trace-derived graph yields null result with no fitted inputs or self-referential reduction
full rationale
The paper derives the dependency graph directly from raw OpenTelemetry traces, attaches endpoint-specific success predicates, and runs fail-stop Monte Carlo simulation to estimate availability under random service kills. These estimates are compared to separate chaos experiments in Docker Compose. The reported reproduction of the degradation curve and the ≤10^{-5} delta from adding non-blocking Kafka semantics are outputs of the simulation itself, not parameters fitted to the target curve or defined in terms of the result. The only self-reference is to prior model definition, which is not load-bearing for the null finding on asynchronous semantics. No step reduces by construction to its own inputs.
Axiom & Free-Parameter Ledger
free parameters (1)
- studied failure fractions
axioms (2)
- domain assumption Services fail in a fail-stop manner
- domain assumption Traces from the demo capture the complete dependency graph for the studied endpoints
Reference graph
Works this paper leans on
-
[1]
Opentelemetry demo documentation.https://opentelemetry.io/docs/demo/ (2025), accessed 22 November 2025
work page 2025
-
[2]
Opentelemetry traces specification.https://opentelemetry.io/docs/concepts/ signals/traces/(2025), accessed 22 November 2025
work page 2025
-
[3]
Resiliency in the opentelemetry collector.https://opentelemetry.io/docs/ collector/resiliency/(2025), accessed 22 November 2025
work page 2025
-
[4]
Journal of International Crisis and Risk Communication Re- search8(S10) (2025)
Adapa,M.,SingiReddy,N.R.:Quantifyingchaosengineeringeffectivenessinevent- driven microservices. Journal of International Crisis and Risk Communication Re- search8(S10) (2025)
work page 2025
-
[5]
IEEE Software33(3), 35–41 (2016) 12 Krasnovsky
Basiri, A., Behnam, N., de Rooij, R., Hochstein, L., Kosewski, L., Reynolds, J., Rosenthal, C.: Chaos engineering. IEEE Software33(3), 35–41 (2016) 12 Krasnovsky
work page 2016
-
[6]
Billinton, R., Allan, R.N.: Reliability Evaluation of Engineering Systems: Concepts and Techniques. Springer, 2 edn. (1992)
work page 1992
-
[7]
Dragoni, N., Giallorenzo, S., Lluch Lafuente, A., Mazzara, M., Montesi, F., Mustafin,R.,Safina,L.:Microservices:Yesterday,today,andtomorrow.In:Present and Ulterior Software Engineering. Springer (2017)
work page 2017
-
[8]
In: Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI)
Fonseca, R., Porter, G., Katz, R.H., Shenker, S.: X-trace: A pervasive network tracing framework. In: Proceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI). USENIX (2007)
work page 2007
-
[9]
Gan, Y., Zhang, Y., Cheng, D., Shetty, A., Rathi, P., Katarki, N., Bruno, A., Ritchken, B., Jackson, B., et al.: An open-source benchmark suite for microser- vices and their hardware-software implications for cloud & edge systems. In: Pro- ceedings of the 24th ACM International Conference on Architectural Support for Programming Languages and Operating Sy...
work page 2019
-
[10]
https://github.com/jaegertracing/jaeger(2017), gitHub repository
Jaeger Authors: Jaeger: An open source end-to-end distributed tracing platform. https://github.com/jaegertracing/jaeger(2017), gitHub repository
work page 2017
- [11]
-
[12]
https://doi.org/10.5281/zenodo.17703953
Krasnovsky, A.A.: otel-demo-resilience (Nov 2025). https://doi.org/10.5281/zenodo.17703953
-
[13]
In: Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP)
Mace, J., Roelke, R., Fonseca, R.: Pivot tracing: Dynamic causal monitoring for distributed systems. In: Proceedings of the 25th ACM Symposium on Operating Systems Principles (SOSP). ACM (2015)
work page 2015
-
[14]
OpenTelemetry Authors: Opentelemetry demo (astronomy shop).https:// github.com/open-telemetry/opentelemetry-demo(2025), gitHub repository
work page 2025
-
[15]
arXiv preprint arXiv:2412.01416 (2024)
Owotogbe, J., Kumara, I., van den Heuvel, W.J., Tamburri, D.A.: Chaos engineer- ing: A multi-vocal literature review. arXiv preprint arXiv:2412.01416 (2024)
-
[16]
Sambasivan, J.M., Shafer, I., et al.: So, you want to trace your distributed system? key design insights from years of practical experience. Tech. rep., Carnegie Mellon University, Parallel Data Laboratory (2014)
work page 2014
-
[17]
Sigelman, B.H., Barroso, L.A., Burrows, M., Stephenson, P., Plakal, M., Beaver, D., Jaspan, S., Shanbhag, C.: Dapper, a large-scale distributed systems tracing infrastructure. Tech. Rep. Technical Report dapper-2010-1, Google (2010)
work page 2010
-
[18]
Trivedi, K.S.: Probability and Statistics with Reliability, Queuing, and Computer Science Applications. Wiley, 2 edn. (2016)
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.