arxiv: 2604.04806 · v3 · submitted 2026-04-06 · 💻 cs.SE

Recognition: no theorem link

MIRAGE: Online LLM Simulation for Microservice Dependency Testing

Xinran Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:39 UTC · model grok-4.3

classification 💻 cs.SE

keywords microservice dependency testingLLM simulationonline mockingintegration testingrecord-replayerror handlingwhite-box testing

0 comments

The pith

Online LLM simulation lets microservice tests generate dependency responses at runtime, reaching 99 percent status-code and response-shape fidelity where record-replay reaches only 62 and 16 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that an LLM can simulate microservice dependencies by answering each request on the fly during test execution instead of using pre-built static artifacts. The model receives the dependency source code, caller code, and production traces, then maintains state across the sequence of calls in a given test scenario. This matters for a sympathetic reader because record-replay and similar static methods leave error-handling and code-reasoning paths almost completely uncovered when those behaviors are rare in the collected traces. Evaluation across 110 scenarios in three real systems shows that the resulting test outcomes match those produced by actual dependencies.

Core claim

MIRAGE demonstrates online LLM simulation in which the model reads the dependency source code, caller code, and traces to answer each incoming request while preserving cross-request state. In white-box mode this produces 99 percent status-code fidelity and 99 percent response-shape fidelity, compared with 62 percent and 16 percent for record-replay. Caller integration tests yield identical pass or fail results with the simulated dependencies as with the real ones in all eight evaluated scenarios.

What carries the argument

Online LLM simulation: the LLM is prompted at runtime with the dependency's source code, caller code, and traces to generate each response on demand while maintaining state across requests in a test.

Load-bearing premise

That an LLM given source code and traces will generate responses that generalize to unseen error-handling inputs without hallucinations or inconsistencies that would change the pass or fail outcome of a test.

What would settle it

A caller integration test that passes against the real microservice dependency but fails when the same test is run against MIRAGE, or vice versa.

Figures

Figures reproduced from arXiv: 2604.04806 by Xinran Zhang.

**Figure 2.** Figure 2: Main results. (a) Status-code fidelity: M [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Structured IR vs. online simulation on Demo by [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Signal ablation (OB+SS combined). Status fidelity [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

read the original abstract

Existing approaches to microservice dependency simulation--record-replay, pattern-mining, and specification-driven stubs--generate static artifacts before test execution. These artifacts can only reproduce behaviors encoded at generation time; on error-handling and code-reasoning scenarios, which are underrepresented in typical trace corpora, record-replay achieves 0% and 12% fidelity in our evaluation. We propose online LLM simulation, a runtime approach where the LLM answers each dependency request as it arrives, maintaining cross-request state throughout a test scenario. The model reads the dependency's source code, caller code, and production traces, then simulates behavior on demand--trading latency (~3 s per request) and cost ($0.16-$0.82 per dependency) for coverage on scenarios that static artifacts miss. We instantiate this approach in MIRAGE and evaluate it on 110 test scenarios across three microservice systems (Google's Online Boutique, Weaveworks' Sock Shop, and a custom system). In white-box mode, MIRAGE achieves 99% status-code and 99% response-shape fidelity, compared to 62% / 16% for record-replay. A signal ablation shows dependency source code is often sufficient (100% alone); without it, the model retains error-code accuracy (94%) but loses response-structure fidelity (75%). Results are stable across three LLM families (within 3%) and deterministic across repeated runs. Caller integration tests produce the same pass/fail outcomes with MIRAGE as with real dependencies (8/8 scenarios).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIRAGE gets solid fidelity gains over record-replay by running LLMs online with source code, but the 8/8 integration-test match is too thin to carry the main claim.

read the letter

MIRAGE replaces static stubs with an LLM that reads the dependency source, caller code, and traces at runtime and keeps state across requests. On 110 scenarios from three systems it hits 99% status-code and 99% response-shape fidelity where record-replay manages only 62% and 16%. The ablation is the clearest part: source code by itself often delivers 100% on the metrics they track, and removing it drops structure fidelity to 75% while error codes stay at 94%. Results hold within 3% across three LLM families and stay deterministic on repeats. That is concrete and useful for anyone who has watched record-replay miss error paths in production microservices.

Referee Report

2 major / 1 minor

Summary. The manuscript presents MIRAGE, an approach to microservice dependency testing that uses large language models to simulate dependencies online during test execution. By feeding the LLM with the dependency's source code, caller code, and production traces, it generates responses on demand while maintaining state across multiple requests in a test scenario. This is contrasted with static methods like record-replay that fail on error-handling scenarios. The evaluation across 110 test scenarios in three systems (Online Boutique, Sock Shop, custom) reports 99% status-code and 99% response-shape fidelity in white-box mode versus lower for baselines, an ablation on input components, stability across LLMs, and identical integration test outcomes in 8 out of 8 cases.

Significance. Should the empirical results prove robust, MIRAGE offers a promising direction for improving test coverage in microservice architectures, particularly for scenarios underrepresented in traces. The paper's strengths include a clear ablation study isolating the contribution of source code (100% alone), reported stability across three LLM families (within 3%), and consistent determinism on repeated runs. These elements provide a solid foundation for the claims, though the practical trade-offs in latency (~3 s) and cost are acknowledged.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation): The reported 99% status-code and 99% response-shape fidelity metrics lack a detailed description of the exact criteria used to determine a 'response-shape match,' how the 110 scenarios were selected, or any error analysis for state inconsistencies across requests, which is load-bearing for the direct comparison to record-replay (62%/16%) and the generalization claim.
[Integration test results (8/8 scenarios)] Integration test results (8/8 scenarios): The claim that caller integration tests produce identical pass/fail outcomes with MIRAGE as with real dependencies rests on only 8 scenarios; this sample is small relative to the 110 scenarios and does not explicitly validate cross-request state (e.g., tokens or counters) or coverage of underrepresented error/reasoning paths, leaving open the possibility that per-request shape matches mask divergences that would alter test outcomes.

minor comments (1)

[Abstract] The latency and cost figures (~3 s per request, $0.16-$0.82 per dependency) would benefit from explicit specification of the LLMs, hardware, and request volumes used in the measurements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation and integration-test claims. We address each major comment below, providing clarifications and committing to targeted revisions that strengthen transparency without altering the reported results.

read point-by-point responses

Referee: [§4 (Evaluation)] §4 (Evaluation): The reported 99% status-code and 99% response-shape fidelity metrics lack a detailed description of the exact criteria used to determine a 'response-shape match,' how the 110 scenarios were selected, or any error analysis for state inconsistencies across requests, which is load-bearing for the direct comparison to record-replay (62%/16%) and the generalization claim.

Authors: We agree that additional detail on the response-shape metric and scenario construction will improve reproducibility. In the revised manuscript we will expand §4.2 with: (i) an explicit definition stating that a shape match requires identical top-level JSON structure, key presence, and array cardinalities (value equality is required only for non-stateful fields); (ii) the scenario-selection protocol, which enumerated all documented API interactions plus trace-derived edge cases from the three systems, yielding the 110 scenarios; and (iii) a short error-analysis subsection confirming that the two observed mismatches were isolated per-request value errors with no cross-request state divergence. These additions directly support the record-replay comparison and will be marked as new text. revision: yes
Referee: Integration test results (8/8 scenarios): The claim that caller integration tests produce identical pass/fail outcomes with MIRAGE as with real dependencies rests on only 8 scenarios; this sample is small relative to the 110 scenarios and does not explicitly validate cross-request state (e.g., tokens or counters) or coverage of underrepresented error/reasoning paths, leaving open the possibility that per-request shape matches mask divergences that would alter test outcomes.

Authors: The eight integration-test scenarios were deliberately chosen as the complete end-to-end suites for the three systems; each exercises multiple sequential calls and stateful elements (session tokens, counters, and error paths). All eight produced identical pass/fail results with real dependencies. While we acknowledge the modest count relative to the per-request fidelity set, the 110 scenarios already quantify shape and status fidelity, and the integration tests serve only to confirm that high per-request fidelity translates to unchanged test verdicts. In revision we will add an appendix table listing the state variables exercised in each of the eight scenarios and note the limitation on sample size for future work. We do not plan to enlarge the set, as the current evidence already demonstrates equivalence on the available integration suites. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results rest on direct external benchmarks

full rationale

The paper reports measured fidelity (99% status-code, 99% response-shape) and integration-test equivalence (8/8) on 110 scenarios across three independent microservice systems, using direct comparison to real dependencies and a record-replay baseline. No equations, fitted parameters, self-definitional constructs, or load-bearing self-citations appear in the derivation chain. Claims are observational performance numbers, not reductions of outputs to inputs by construction. The evaluation is self-contained against external oracles.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that current LLMs can translate source code and traces into faithful runtime behavior for unrecorded scenarios; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Large language models can accurately simulate microservice dependency behavior from source code and production traces without additional fine-tuning
This assumption is invoked to justify why the online simulation produces 99% fidelity and why source code alone suffices in the ablation.

pith-pipeline@v0.9.0 · 5566 in / 1395 out tokens · 63623 ms · 2026-05-10T19:39:22.817421+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Why your microservice integration tests miss real problems,

A. Iyer, “Why your microservice integration tests miss real problems,” 2024, practitioner article

2024
[2]

Why integration testing is key to testing microservices,

AtomicJar, “Why integration testing is key to testing microservices,” https://www.atomicjar.com, 2023

2023
[3]

HoverFly: Lightweight service virtualization,

SpectoLabs, “HoverFly: Lightweight service virtualization,” https:// hoverfly.io, 2023

2023
[4]

Mining service behavior for stateful service emulation,

M. A. Hossain, J. Han, M. A. Kabir, S. Versteeg, J.-G. Schneider, and J. Jiang, “Mining service behavior for stateful service emulation,”CoRR, vol. abs/2510.18519, 2025, arXiv preprint

work page arXiv 2025
[5]

SpecMiner: Heuristic-based mining of service behavioral models from interaction traces,

M. A. Kabir, J. Han, M. A. Hossain, and S. Versteeg, “SpecMiner: Heuristic-based mining of service behavioral models from interaction traces,”Future Generation Computer Systems, vol. 117, pp. 59–71, 2021

2021
[6]

Prism: API mock server from OpenAPI specifications,

Stoplight, “Prism: API mock server from OpenAPI specifications,” https: //stoplight.io/open-source/prism, 2023

2023
[7]

WireMock: API mocking tool,

WireMock, “WireMock: API mocking tool,” https://wiremock.org, 2023

2023
[8]

mountebank: Over the wire test doubles,

B. Byars, “mountebank: Over the wire test doubles,” https://www.mbtest. dev, 2023

2023
[9]

How to use “fake dependency

E. Masor, “How to use “fake dependency” for testing microservices,” 2022, practitioner article

2022
[10]

How to test microservices: Strategy and use cases,

Kong Inc., “How to test microservices: Strategy and use cases,” https: //konghq.com, 2022

2022
[11]

Dredd: HTTP API testing framework,

Apiary, “Dredd: HTTP API testing framework,” https://dredd.org, 2023

2023
[12]

Pact: Contract testing for microservices,

Pact Foundation, “Pact: Contract testing for microservices,” https://pact. io, 2023

2023
[13]

Spring Cloud Contract,

VMware, “Spring Cloud Contract,” https://spring.io/projects/ spring-cloud-contract, 2023

2023
[14]

Consumer-driven contracts: A service evolution pattern,

I. Robinson, “Consumer-driven contracts: A service evolution pattern,” https://martinfowler.com/articles/consumerDrivenContracts.html, 2006

2006
[15]

Evaluating Large Language Models Trained on Code

M. Chenet al., “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021, arXiv preprint

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

StarCoder: May the source be with you!

R. Li, L. B. Allal, Y . Zi, N. Muennighoff, D. Kocetkov, C. Mou, M. Maroneet al., “StarCoder: May the source be with you!”Trans- actions on Machine Learning Research (TMLR), 2023

2023
[17]

Code Llama: Open Foundation Models for Code

B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gatet al., “Code Llama: Open foundation models for code,”CoRR, vol. abs/2308.12950, 2023, arXiv preprint

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

ChatUniTest: A framework for LLM-based test generation,

Y . Chen, Z. Hu, C. Zhi, J. Han, S. Deng, and J. Yin, “ChatUniTest: A framework for LLM-based test generation,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion), 2024, pp. 572–576

2024
[19]

CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,

C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “CodaMosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023, pp. 919–931

2023
[20]

Automated program repair in the era of large pre-trained language models,

C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in45th IEEE/ACM International Conference on Software Engineering (ICSE), 2023, pp. 1482–1494

2023
[21]

SAINT: Service-level integration test generation with program analysis and LLM-based agents,

R. Pan, R. Pavuluri, R. Huang, T. Stennett, R. Krishna, A. Orso, and S. Sinha, “SAINT: Service-level integration test generation with program analysis and LLM-based agents,” inProceedings of the IEEE/ACM International Conference on Software Engineering (ICSE), 2026, to appear

2026
[22]

Automating Spring Boot integration tests with AI/LLM: A 5-prompt workflow,

R. Elumalai, “Automating Spring Boot integration tests with AI/LLM: A 5-prompt workflow,” 2025, practitioner article

2025
[23]

Zhang, Y

L. Zhang, Y . Zhai, T. Jia, C. Duan, M. He, L. Pan, Z. Liu, B. Ding, and Y . Li, “MicroRemed: Benchmarking LLMs in microservices reme- diation,”CoRR, vol. abs/2511.01166, 2025, arXiv preprint

work page arXiv 2025
[24]

Evaluating LLMs on microservice-based applications: How complex is your specification?

D. M. Yellin, “Evaluating LLMs on microservice-based applications: How complex is your specification?”CoRR, vol. abs/2508.20119, 2025, arXiv preprint

work page arXiv 2025
[25]

Software testing with large language models: Survey, landscape, and vision,

J. Wang, Y . Huang, C. Chen, Z. Liu, S. Wang, and Q. Wang, “Software testing with large language models: Survey, landscape, and vision,”IEEE Transactions on Software Engineering, vol. 50, no. 4, pp. 911–936, 2024

2024
[26]

Why AI features break microservices testing,

A. Iyer, “Why AI features break microservices testing,” 2025, practi- tioner article

2025
[27]

OpenTelemetry: An observability framework,

Cloud Native Computing Foundation, “OpenTelemetry: An observability framework,” https://opentelemetry.io, 2023

2023
[28]

Jaeger: Open-source distributed tracing,

CNCF, “Jaeger: Open-source distributed tracing,” https: //www.jaegertracing.io, 2023

2023
[29]

Zipkin: A distributed tracing system,

OpenZipkin, “Zipkin: A distributed tracing system,” https://zipkin.io, 2023

2023
[30]

Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,

Y . Gan, Y . Zhang, K. Hu, D. Cheng, Y . He, M. Pancholi, and C. De- limitrou, “Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,” in24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2019, pp. 19–33

2019
[31]

Principled workflow-centric tracing of distributed sys- tems,

R. R. Sambasivan, I. Shafer, J. Mace, B. H. Sigelman, R. Fonseca, and G. R. Ganger, “Principled workflow-centric tracing of distributed sys- tems,” inProceedings of the 7th ACM Symposium on Cloud Computing (SoCC), 2016, pp. 401–414

2016
[32]

MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,

G. Yu, P. Chen, H. Chen, Z. Guan, Z. Huang, L. Jing, T. Weng, X. Sun, and X. Li, “MicroRank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proceedings of the Web Conference 2021 (WWW), 2021, pp. 3087–3098

2021
[33]

Leveraging existing instrumentation to automatically infer invariant- constrained models,

I. Beschastnikh, Y . Brun, S. Schneider, M. Sloan, and M. D. Ernst, “Leveraging existing instrumentation to automatically infer invariant- constrained models,” in19th ACM SIGSOFT Symposium on the Foun- dations of Software Engineering (FSE), 2011, pp. 267–277

2011
[34]

Multiparty asynchronous session types,

K. Honda, N. Yoshida, and M. Carbone, “Multiparty asynchronous session types,”Journal of the ACM, vol. 63, no. 1, pp. 9:1–9:67, 2016

2016
[35]

A theory of contracts for web services,

G. Castagna, N. Gesbert, and L. Padovani, “A theory of contracts for web services,”ACM Transactions on Programming Languages and Systems, vol. 31, no. 5, pp. 19:1–19:61, 2009

2009
[36]

FastAPI: Modern web framework for building APIs with Python,

S. Ram ´ırez, “FastAPI: Modern web framework for building APIs with Python,” https://fastapi.tiangolo.com, 2023

2023
[37]

Online Boutique: A cloud-native microservices demo application,

Google Cloud, “Online Boutique: A cloud-native microservices demo application,” https://github.com/GoogleCloudPlatform/ microservices-demo, 2023

2023
[38]

Sock Shop: A microservices demo application,

Weaveworks, “Sock Shop: A microservices demo application,” https: //microservices-demo.github.io, 2023

2023