Recognition: no theorem link
MIRAGE: Online LLM Simulation for Microservice Dependency Testing
Pith reviewed 2026-05-10 19:39 UTC · model grok-4.3
The pith
Online LLM simulation lets microservice tests generate dependency responses at runtime, reaching 99 percent status-code and response-shape fidelity where record-replay reaches only 62 and 16 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MIRAGE demonstrates online LLM simulation in which the model reads the dependency source code, caller code, and traces to answer each incoming request while preserving cross-request state. In white-box mode this produces 99 percent status-code fidelity and 99 percent response-shape fidelity, compared with 62 percent and 16 percent for record-replay. Caller integration tests yield identical pass or fail results with the simulated dependencies as with the real ones in all eight evaluated scenarios.
What carries the argument
Online LLM simulation: the LLM is prompted at runtime with the dependency's source code, caller code, and traces to generate each response on demand while maintaining state across requests in a test.
Load-bearing premise
That an LLM given source code and traces will generate responses that generalize to unseen error-handling inputs without hallucinations or inconsistencies that would change the pass or fail outcome of a test.
What would settle it
A caller integration test that passes against the real microservice dependency but fails when the same test is run against MIRAGE, or vice versa.
Figures
read the original abstract
Existing approaches to microservice dependency simulation--record-replay, pattern-mining, and specification-driven stubs--generate static artifacts before test execution. These artifacts can only reproduce behaviors encoded at generation time; on error-handling and code-reasoning scenarios, which are underrepresented in typical trace corpora, record-replay achieves 0% and 12% fidelity in our evaluation. We propose online LLM simulation, a runtime approach where the LLM answers each dependency request as it arrives, maintaining cross-request state throughout a test scenario. The model reads the dependency's source code, caller code, and production traces, then simulates behavior on demand--trading latency (~3 s per request) and cost ($0.16-$0.82 per dependency) for coverage on scenarios that static artifacts miss. We instantiate this approach in MIRAGE and evaluate it on 110 test scenarios across three microservice systems (Google's Online Boutique, Weaveworks' Sock Shop, and a custom system). In white-box mode, MIRAGE achieves 99% status-code and 99% response-shape fidelity, compared to 62% / 16% for record-replay. A signal ablation shows dependency source code is often sufficient (100% alone); without it, the model retains error-code accuracy (94%) but loses response-structure fidelity (75%). Results are stable across three LLM families (within 3%) and deterministic across repeated runs. Caller integration tests produce the same pass/fail outcomes with MIRAGE as with real dependencies (8/8 scenarios).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents MIRAGE, an approach to microservice dependency testing that uses large language models to simulate dependencies online during test execution. By feeding the LLM with the dependency's source code, caller code, and production traces, it generates responses on demand while maintaining state across multiple requests in a test scenario. This is contrasted with static methods like record-replay that fail on error-handling scenarios. The evaluation across 110 test scenarios in three systems (Online Boutique, Sock Shop, custom) reports 99% status-code and 99% response-shape fidelity in white-box mode versus lower for baselines, an ablation on input components, stability across LLMs, and identical integration test outcomes in 8 out of 8 cases.
Significance. Should the empirical results prove robust, MIRAGE offers a promising direction for improving test coverage in microservice architectures, particularly for scenarios underrepresented in traces. The paper's strengths include a clear ablation study isolating the contribution of source code (100% alone), reported stability across three LLM families (within 3%), and consistent determinism on repeated runs. These elements provide a solid foundation for the claims, though the practical trade-offs in latency (~3 s) and cost are acknowledged.
major comments (2)
- [§4 (Evaluation)] §4 (Evaluation): The reported 99% status-code and 99% response-shape fidelity metrics lack a detailed description of the exact criteria used to determine a 'response-shape match,' how the 110 scenarios were selected, or any error analysis for state inconsistencies across requests, which is load-bearing for the direct comparison to record-replay (62%/16%) and the generalization claim.
- [Integration test results (8/8 scenarios)] Integration test results (8/8 scenarios): The claim that caller integration tests produce identical pass/fail outcomes with MIRAGE as with real dependencies rests on only 8 scenarios; this sample is small relative to the 110 scenarios and does not explicitly validate cross-request state (e.g., tokens or counters) or coverage of underrepresented error/reasoning paths, leaving open the possibility that per-request shape matches mask divergences that would alter test outcomes.
minor comments (1)
- [Abstract] The latency and cost figures (~3 s per request, $0.16-$0.82 per dependency) would benefit from explicit specification of the LLMs, hardware, and request volumes used in the measurements.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation and integration-test claims. We address each major comment below, providing clarifications and committing to targeted revisions that strengthen transparency without altering the reported results.
read point-by-point responses
-
Referee: [§4 (Evaluation)] §4 (Evaluation): The reported 99% status-code and 99% response-shape fidelity metrics lack a detailed description of the exact criteria used to determine a 'response-shape match,' how the 110 scenarios were selected, or any error analysis for state inconsistencies across requests, which is load-bearing for the direct comparison to record-replay (62%/16%) and the generalization claim.
Authors: We agree that additional detail on the response-shape metric and scenario construction will improve reproducibility. In the revised manuscript we will expand §4.2 with: (i) an explicit definition stating that a shape match requires identical top-level JSON structure, key presence, and array cardinalities (value equality is required only for non-stateful fields); (ii) the scenario-selection protocol, which enumerated all documented API interactions plus trace-derived edge cases from the three systems, yielding the 110 scenarios; and (iii) a short error-analysis subsection confirming that the two observed mismatches were isolated per-request value errors with no cross-request state divergence. These additions directly support the record-replay comparison and will be marked as new text. revision: yes
-
Referee: Integration test results (8/8 scenarios): The claim that caller integration tests produce identical pass/fail outcomes with MIRAGE as with real dependencies rests on only 8 scenarios; this sample is small relative to the 110 scenarios and does not explicitly validate cross-request state (e.g., tokens or counters) or coverage of underrepresented error/reasoning paths, leaving open the possibility that per-request shape matches mask divergences that would alter test outcomes.
Authors: The eight integration-test scenarios were deliberately chosen as the complete end-to-end suites for the three systems; each exercises multiple sequential calls and stateful elements (session tokens, counters, and error paths). All eight produced identical pass/fail results with real dependencies. While we acknowledge the modest count relative to the per-request fidelity set, the 110 scenarios already quantify shape and status fidelity, and the integration tests serve only to confirm that high per-request fidelity translates to unchanged test verdicts. In revision we will add an appendix table listing the state variables exercised in each of the eight scenarios and note the limitation on sample size for future work. We do not plan to enlarge the set, as the current evidence already demonstrates equivalence on the available integration suites. revision: partial
Circularity Check
No circularity: empirical results rest on direct external benchmarks
full rationale
The paper reports measured fidelity (99% status-code, 99% response-shape) and integration-test equivalence (8/8) on 110 scenarios across three independent microservice systems, using direct comparison to real dependencies and a record-replay baseline. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. Claims are observational performance numbers, not reductions of outputs to inputs by construction. The evaluation is self-contained against external oracles.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large language models can accurately simulate microservice dependency behavior from source code and production traces without additional fine-tuning
Reference graph
Works this paper leans on
-
[1]
Why your microservice integration tests miss real problems,
A. Iyer, “Why your microservice integration tests miss real problems,” 2024, practitioner article
2024
-
[2]
Why integration testing is key to testing microservices,
AtomicJar, “Why integration testing is key to testing microservices,” https://www.atomicjar.com, 2023
2023
-
[3]
HoverFly: Lightweight service virtualization,
SpectoLabs, “HoverFly: Lightweight service virtualization,” https:// hoverfly.io, 2023
2023
-
[4]
Mining service behavior for stateful service emulation,
M. A. Hossain, J. Han, M. A. Kabir, S. Versteeg, J.-G. Schneider, and J. Jiang, “Mining service behavior for stateful service emulation,”CoRR, vol. abs/2510.18519, 2025, arXiv preprint
-
[5]
SpecMiner: Heuristic-based mining of service behavioral models from interaction traces,
M. A. Kabir, J. Han, M. A. Hossain, and S. Versteeg, “SpecMiner: Heuristic-based mining of service behavioral models from interaction traces,”Future Generation Computer Systems, vol. 117, pp. 59–71, 2021
2021
-
[6]
Prism: API mock server from OpenAPI specifications,
Stoplight, “Prism: API mock server from OpenAPI specifications,” https: //stoplight.io/open-source/prism, 2023
2023
-
[7]
WireMock: API mocking tool,
WireMock, “WireMock: API mocking tool,” https://wiremock.org, 2023
2023
-
[8]
mountebank: Over the wire test doubles,
B. Byars, “mountebank: Over the wire test doubles,” https://www.mbtest. dev, 2023
2023
-
[9]
How to use “fake dependency
E. Masor, “How to use “fake dependency” for testing microservices,” 2022, practitioner article
2022
-
[10]
How to test microservices: Strategy and use cases,
Kong Inc., “How to test microservices: Strategy and use cases,” https: //konghq.com, 2022
2022
-
[11]
Dredd: HTTP API testing framework,
Apiary, “Dredd: HTTP API testing framework,” https://dredd.org, 2023
2023
-
[12]
Pact: Contract testing for microservices,
Pact Foundation, “Pact: Contract testing for microservices,” https://pact. io, 2023
2023
-
[13]
Spring Cloud Contract,
VMware, “Spring Cloud Contract,” https://spring.io/projects/ spring-cloud-contract, 2023
2023
-
[14]
Consumer-driven contracts: A service evolution pattern,
I. Robinson, “Consumer-driven contracts: A service evolution pattern,” https://martinfowler.com/articles/consumerDrivenContracts.html, 2006
2006
-
[15]
Evaluating Large Language Models Trained on Code
M. Chenet al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, arXiv preprint
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
StarCoder: May the source be with you!
R. Li, L. B. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Maroneet al., “StarCoder: May the source be with you!”Trans- actions on Machine Learning Research (TMLR), 2023
2023
-
[17]
Code Llama: Open Foundation Models for Code
B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gatet al., “Code Llama: Open foundation models for code,”CoRR, vol. abs/2308.12950, 2023, arXiv preprint
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
ChatUniTest: A framework for LLM-based test generation,
Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “ChatUniTest: A framework for LLM-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion), 2024, pp. 572–576
2024
-
[19]
CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,
C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023, pp. 919–931
2023
-
[20]
Automated program repair in the era of large pre-trained language models,
C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023, pp. 1482–1494
2023
-
[21]
SAINT: Service-level integration test generation with program analysis and LLM-based agents,
R. Pan, R. Pavuluri, R. Huang, T. Stennett, R. Krishna, A. Orso, and S. Sinha, “SAINT: Service-level integration test generation with program analysis and LLM-based agents,” inProceedings of the IEEE/ACM International Conference on Software Engineering (ICSE), 2026, to appear
2026
-
[22]
Automating Spring Boot integration tests with AI/LLM: A 5-prompt workflow,
R. Elumalai, “Automating Spring Boot integration tests with AI/LLM: A 5-prompt workflow,” 2025, practitioner article
2025
- [23]
-
[24]
Evaluating LLMs on microservice-based applications: How complex is your specification?
D. M. Yellin, “Evaluating LLMs on microservice-based applications: How complex is your specification?”CoRR, vol. abs/2508.20119, 2025, arXiv preprint
-
[25]
Software testing with large language models: Survey, landscape, and vision,
J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,”IEEE Transactions on Software Engineering, vol. 50, no. 4, pp. 911–936, 2024
2024
-
[26]
Why AI features break microservices testing,
A. Iyer, “Why AI features break microservices testing,” 2025, practi- tioner article
2025
-
[27]
OpenTelemetry: An observability framework,
Cloud Native Computing Foundation, “OpenTelemetry: An observability framework,” https://opentelemetry.io, 2023
2023
-
[28]
Jaeger: Open-source distributed tracing,
CNCF, “Jaeger: Open-source distributed tracing,” https: //www.jaegertracing.io, 2023
2023
-
[29]
Zipkin: A distributed tracing system,
OpenZipkin, “Zipkin: A distributed tracing system,” https://zipkin.io, 2023
2023
-
[30]
Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,
Y . Gan, Y . Zhang, K. Hu, D. Cheng, Y . He, M. Pancholi, and C. De- limitrou, “Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,” in24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 19–33
2019
-
[31]
Principled workflow-centric tracing of distributed sys- tems,
R. R. Sambasivan, I. Shafer, J. Mace, B. H. Sigelman, R. Fonseca, and G. R. Ganger, “Principled workflow-centric tracing of distributed sys- tems,” inProceedings of the 7th ACM Symposium on Cloud Computing (SoCC), 2016, pp. 401–414
2016
-
[32]
MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,
G. Yu, P. Chen, H. Chen, Z. Guan, Z. Huang, L. Jing, T. Weng, X. Sun, and X. Li, “MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proceedings of the Web Conference 2021 (WWW), 2021, pp. 3087–3098
2021
-
[33]
Leveraging existing instrumentation to automatically infer invariant- constrained models,
I. Beschastnikh, Y . Brun, S. Schneider, M. Sloan, and M. D. Ernst, “Leveraging existing instrumentation to automatically infer invariant- constrained models,” in19th ACM SIGSOFT Symposium on the Foun- dations of Software Engineering (FSE), 2011, pp. 267–277
2011
-
[34]
Multiparty asynchronous session types,
K. Honda, N. Yoshida, and M. Carbone, “Multiparty asynchronous session types,”Journal of the ACM, vol. 63, no. 1, pp. 9:1–9:67, 2016
2016
-
[35]
A theory of contracts for web services,
G. Castagna, N. Gesbert, and L. Padovani, “A theory of contracts for web services,”ACM Transactions on Programming Languages and Systems, vol. 31, no. 5, pp. 19:1–19:61, 2009
2009
-
[36]
FastAPI: Modern web framework for building APIs with Python,
S. Ram ´ırez, “FastAPI: Modern web framework for building APIs with Python,” https://fastapi.tiangolo.com, 2023
2023
-
[37]
Online Boutique: A cloud-native microservices demo application,
Google Cloud, “Online Boutique: A cloud-native microservices demo application,” https://github.com/GoogleCloudPlatform/ microservices-demo, 2023
2023
-
[38]
Sock Shop: A microservices demo application,
Weaveworks, “Sock Shop: A microservices demo application,” https: //microservices-demo.github.io, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.