Duet instrumentation: An Agentic Approach to Improving Sensitivity in Cloud Service Benchmarking
Pith reviewed 2026-05-19 23:51 UTC · model grok-4.3
The pith
Duet instrumentation uses LLMs to target performance measurements at code changes, detecting regressions at up to 5 times lower severity than standard benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Duet instrumentation analyzes code changes between two consecutive application versions using large language models to identify performance-relevant modifications, then instruments those locations for direct performance comparison during a synchronized execution of both versions. This uncovers performance changes with higher sensitivity than traditional black-box application benchmarks. In experiments, the system achieves 58% precision, 93% recall, and 71% specificity for the instrumentation task, and detects injected regressions at up to 5x lower severity while maintaining similar A/A latency distributions.
What carries the argument
Duet instrumentation, the mechanism of LLM-driven identification of performance-relevant code changes followed by targeted instrumentation for synchronized dual-version benchmarking.
If this is right
- Performance regressions can be detected at lower severity levels than with traditional duet application benchmarks.
- Similar A/A latency distributions are preserved, indicating no increase in measurement noise.
- The automation reduces the manual effort needed to combine application and microbenchmark approaches.
- Continuous benchmarking becomes more effective for catching bugs before production deployment.
Where Pith is reading between the lines
- This approach might generalize to other types of software beyond cloud services if the LLM identification step works reliably across domains.
- Integrating this into CI/CD pipelines could allow more frequent performance checks without proportional increase in testing time.
- Future work could explore using the same instrumentation for diagnosing the root cause of detected regressions rather than just detection.
Load-bearing premise
The large language model must correctly identify enough of the performance-relevant code changes to make the added instrumentation points actually increase the ability to spot small regressions.
What would settle it
A controlled experiment where known performance regressions are injected at specific code locations and the method's detection rate for low-severity cases is compared directly to a non-instrumented benchmark.
Figures
read the original abstract
Continuous cloud service performance benchmarking is essential for detecting performance bugs early before deploying them to production. However, detecting performance regressions using application benchmarks, which usually treat the system under test as a black box, is challenging due to variable I/O calls or changing performance characteristics of the underlying cloud infrastructure. Microbenchmarks are often more sensitive and accurate, but also more time-consuming to implement and run. Further, they do not capture the performance of the integrated system as a whole. A comprehensive performance assessment therefore typically requires a combination of both approaches. To address the shortcomings of application benchmarks, we propose duet instrumentation, a novel benchmarking paradigm enabled by recent advancements in large language model (LLM) code understanding. The idea is to analyze code changes between two consecutive application versions and measure performance differences directly at performance-relevant changes during a synchronized benchmark of both application versions, uncovering performance changes with higher sensitivity. We design a system that reliably automates the assessment and instrumentation of performance-relevant code changes between the two application versions. In experiments with a realistic testbed application offering configurable performance regressions, we find that our prototype achieves 58% precision, 93% recall, and 71% specificity (averaged across tasks) when comparing the generated instrumentation against the ideal instrumentation with a line-distance threshold of five. In the downstream application benchmark, we find that our prototype can detect performance regressions at up to 5x lower injected severity compared to a traditional duet application benchmark while preserving similar A/A latency distributions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes 'duet instrumentation,' an agentic LLM-based method to analyze code changes between consecutive versions of a cloud service, identify performance-relevant sites, and apply targeted instrumentation. This enables synchronized benchmarking of both versions to detect performance regressions with greater sensitivity than traditional black-box application benchmarks. On a realistic testbed with configurable injected regressions, the prototype reports 58% precision, 93% recall, and 71% specificity (at line-distance threshold of five) versus ideal instrumentation, plus up to 5x lower detectable severity in downstream regression detection while preserving similar A/A latency distributions.
Significance. If the results hold, the work offers a practical way to improve early performance regression detection in cloud services by combining microbenchmark-like sensitivity with application-level integration. The concrete precision/recall/specificity numbers, realistic testbed, and reported 5x severity improvement are strengths that could influence benchmarking practices if the causal link between LLM-identified sites and actual latency changes is robustly demonstrated.
major comments (2)
- [Experimental evaluation / downstream benchmark results] The central 5x sensitivity claim in the downstream benchmark rests on the assumption that LLM-identified instrumentation sites align with locations where injected regressions alter latency. However, the reported 58% precision at line-distance threshold of five implies a high rate of false-positive probes; it is unclear whether these extra measurements on unrelated paths dilute the sensitivity gain or introduce overhead that offsets the benefit, even with preserved A/A distributions. This needs explicit analysis or ablation in the experimental section to support the causal claim.
- [Testbed and regression injection description] The evaluation uses a configurable testbed with localized injected regressions. It is not shown whether the performance-relevant changes detected by the LLM correspond to the actual regression sites or if the 5x improvement generalizes beyond this setup-specific localization, which could artificially favor line-proximate detection.
minor comments (2)
- [Abstract] The abstract states the 58% precision / 93% recall / 71% specificity figures but provides no statistical details, error bars, or variance across tasks; adding these would strengthen the instrumentation accuracy claim.
- [Instrumentation generation] The line-distance threshold of five is used without reported sensitivity analysis or justification for its selection; a brief ablation on threshold choice would clarify robustness.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comments raise important points about the strength of our causal claims and the scope of our evaluation. We address each major comment below and indicate the revisions we will make to strengthen the paper.
read point-by-point responses
-
Referee: [Experimental evaluation / downstream benchmark results] The central 5x sensitivity claim in the downstream benchmark rests on the assumption that LLM-identified instrumentation sites align with locations where injected regressions alter latency. However, the reported 58% precision at line-distance threshold of five implies a high rate of false-positive probes; it is unclear whether these extra measurements on unrelated paths dilute the sensitivity gain or introduce overhead that offsets the benefit, even with preserved A/A distributions. This needs explicit analysis or ablation in the experimental section to support the causal claim.
Authors: We agree that an explicit ablation would more directly support the causal relationship between the identified sites and the observed sensitivity improvement. Our current results already show that A/A latency distributions are preserved, which indicates that the additional probes from false positives do not introduce detectable overhead or variance in the testbed. To strengthen this, we will add an ablation study in the revised experimental section that compares regression detection sensitivity when using only the ideal instrumentation sites versus the full set of LLM-generated sites. This will quantify any dilution effect from false positives and report the corresponding benchmark runtime overhead. revision: yes
-
Referee: [Testbed and regression injection description] The evaluation uses a configurable testbed with localized injected regressions. It is not shown whether the performance-relevant changes detected by the LLM correspond to the actual regression sites or if the 5x improvement generalizes beyond this setup-specific localization, which could artificially favor line-proximate detection.
Authors: The testbed injects regressions at specific, known code locations precisely to enable controlled measurement of how well the LLM identifies performance-relevant changes. The reported 93% recall shows that the majority of these actual regression sites are captured by the LLM analysis. The line-distance threshold of five is used to account for small structural differences in the generated instrumentation. We acknowledge that the controlled and localized nature of the injections may favor detection of proximate changes and that broader generalization to non-injected, production-like regressions remains to be validated. In the revision we will expand the testbed description to clarify the injection mechanism and add a dedicated limitations subsection discussing the scope of the 5x improvement and directions for future validation on real-world traces. revision: partial
Circularity Check
No circularity; empirical evaluation is self-contained
full rationale
The paper presents an empirical prototype for LLM-assisted duet instrumentation, reporting measured precision (58%), recall (93%), and specificity (71%) against an ideal instrumentation baseline at line-distance threshold of five, plus downstream benchmark results showing up to 5x lower-severity regression detection while preserving A/A latency distributions. No mathematical derivations, equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The central claims rest on direct experimental comparisons in a configurable testbed rather than reducing to self-referential definitions or prior author results by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- line-distance threshold =
five
axioms (1)
- domain assumption Large language models can analyze code changes and identify performance-relevant locations with usable accuracy
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose duet instrumentation... analyze code changes between two consecutive application versions and measure performance differences directly at performance-relevant changes
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
58% precision, 93% recall... detect performance regressions at up to 5× lower injected severity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
G. Schermann, D. Sch ¨oni, P. Leitner, and H. C. Gall, “Bifrost: Sup- porting continuous deployment with automated enactment of multi- phase live testing strategies,” inProceedings of the 17th International Middleware Conference, 2016
work page 2016
-
[2]
Continuous benchmark- ing: Using system benchmarking in build pipelines,
M. Grambow, F. Lehmann, and D. Bermbach, “Continuous benchmark- ing: Using system benchmarking in build pipelines,” in2019 IEEE International Conference on Cloud Engineering (IC2E), 2019, pp. 241– 246
work page 2019
-
[3]
Creating a virtuous cycle in performance testing at mongodb,
D. Daly, “Creating a virtuous cycle in performance testing at mongodb,” inProceedings of the ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’21. ACM, Apr. 2021. [Online]. Available: http://dx.doi.org/10.1145/3427921.3450234
-
[4]
Patterns in the chaos - A study of performance variation and predictability in public iaas clouds,
P. Leitner and J. Cito, “Patterns in the chaos - A study of performance variation and predictability in public iaas clouds,”ACM Transactions on Internet Technology, vol. 16, no. 3, pp. 15:1–15:23, 2016. [Online]. Available: https://doi.org/10.1145/2885497
-
[5]
D. Bermbach, E. Wittern, and S. Tai,Cloud Service Benchmarking: Measuring Quality of Cloud Services from a Client Perspective, 1st ed. Springer Publishing Company, Incorporated, 2017
work page 2017
-
[6]
N. Japke, C. Witzko, M. Grambow, and D. Bermbach, “The early microbenchmark catches the bug – studying performance issues using micro- and application benchmarks,” inProceedings of the 16th IEEE/ACM International Conference on Utility and Cloud Computing, ser. UCC ’23. New York, NY , USA: Association for Computing Machinery (ACM), Dec. 2023. [Online]. Ava...
-
[7]
Using microbenchmark suites to detect application performance changes,
M. Grambow, D. Kovalev, C. Laaber, P. Leitner, and D. Bermbach, “Using microbenchmark suites to detect application performance changes,”IEEE Transactions on Cloud Computing, vol. 11, no. 3, pp. 2575–2590, Jul. 2023. [Online]. Available: http://dx.doi.org/10.1109/ TCC.2022.3217947
-
[8]
Unit testing performance in java projects: Are we there yet?
P. Stefan, V . Horky, L. Bulej, and P. Tuma, “Unit testing performance in java projects: Are we there yet?” inProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ser. ICPE ’17. New York, NY , USA: Association for Computing Machinery, 2017, pp. 401–412. [Online]. Available: https://doi.org/10.1145/3030207.3030226
-
[9]
The night shift: Understanding performance variability of cloud serverless platforms,
T. Schirmer, N. Japke, S. Greten, T. Pfandzelter, and D. Bermbach, “The night shift: Understanding performance variability of cloud serverless platforms,” inProceedings of the 1st Workshop on SErverless Systems, Applications and MEthodologies, ser. SESAME ’23. ACM, May 2023, p. 27–33. [Online]. Available: http: //dx.doi.org/10.1145/3592533.3592808
-
[10]
Software microbenchmarking in the cloud. how bad is it really?
C. Laaber, J. Scheuner, and P. Leitner, “Software microbenchmarking in the cloud. how bad is it really?”Empirical Software Engineering, vol. 24, no. 4, pp. 2469–2508, Aug. 2019. [Online]. Available: https://doi.org/10.1007/s10664-019-09681-1
-
[11]
Duet benchmarking: Improving measurement accuracy in the cloud,
L. Bulej, V . Hork ´y, P. Tuma, F. Farquet, and A. Prokopec, “Duet benchmarking: Improving measurement accuracy in the cloud,” inProceedings of the ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’20. ACM, Apr. 2020, pp. 100–
work page 2020
-
[12]
Available: http://dx.doi.org/10.1145/3358960.3379132
[Online]. Available: http://dx.doi.org/10.1145/3358960.3379132
-
[13]
Unit testing performance in java projects: Are we there yet?
A. Abedi and T. Brecht, “Conducting repeatable experiments in highly variable cloud computing environments,” inProceedings of the 8th ACM/SPEC on International Conference on Performance Engineering, ser. ICPE ’17. New York, NY , USA: Association for Computing Machinery, 2017, p. 287–292. [Online]. Available: https://doi.org/10.1145/3030207.3030229
-
[14]
A. C. Davison and D. V . Hinkley,Bootstrap Methods and their Applica- tion, ser. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 1997
work page 1997
-
[15]
Elastibench: Scalable continuous benchmarking on cloud faas platforms,
T. Schirmer, T. Pfandzelter, and D. Bermbach, “Elastibench: Scalable continuous benchmarking on cloud faas platforms,” inProceedings of the 12th IEEE International Conference on Cloud Engineering, ser. IC2E ’24. New York, NY , USA: IEEE, Sep. 2024
work page 2024
-
[16]
What’s wrong with my benchmark results? studying bad practices in jmh benchmarks,
D. Costa, C.-P. Bezemer, P. Leitner, and A. Andrzejak, “What’s wrong with my benchmark results? studying bad practices in jmh benchmarks,” IEEE Transactions on Software Engineering, vol. 47, no. 7, pp. 1452– 1467, 2019
work page 2019
-
[17]
µOpTime: Statically reducing the execution time of microbenchmark suites using stability metrics,
N. Japke, M. Grambow, C. Laaber, and D. Bermbach, “µOpTime: Statically reducing the execution time of microbenchmark suites using stability metrics,”ACM Transactions on Software Engineering and Methodology, Jan. 2025. [Online]. Available: https://doi.org/10.1145/ 3715322
work page 2025
-
[18]
Initial experiments with duet benchmarking: Performance testing interference in the cloud,
L. Bulej, V . Hork ´y, and P. Tuma, “Initial experiments with duet benchmarking: Performance testing interference in the cloud,” in27th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, MASCOTS 2019, Rennes, France, October 21-25, 2019. IEEE Computer Society, 2019, pp. 249–255. [Online]. Availabl...
-
[19]
Investigating the impact of isolation on synchronized benchmarks,
N. Japke, F. Hamdan, D. Baumann, and D. Bermbach, “Investigating the impact of isolation on synchronized benchmarks,” inProceedings of the 18th IEEE/ACM International Conference on Utility and Cloud Computing, ser. UCC ’25. New York, NY , USA: Association for Computing Machinery, 2026. [Online]. Available: https://doi.org/10. 1145/3773274.3774703
-
[20]
Automated identification of performance changes at code level,
D. G. Reichelt, S. K ¨uhne, and W. Hasselbring, “Automated identification of performance changes at code level,” in2022 IEEE 22nd International Conference on Software Quality, Reliability and Security (QRS). IEEE, Dec. 2022, pp. 916–925. [Online]. Available: http://dx.doi.org/10.1109/QRS57517.2022.00096
-
[21]
PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing Commits ,
J. Chen, W. Shang, and E. Shihab, “ PerfJIT: Test-Level Just-in-Time Prediction for Performance Regression Introducing Commits ,”IEEE Transactions on Software Engineering, vol. 48, no. 05, pp. 1529–1544, May 2022. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/TSE.2020.3023955
-
[22]
Including performance benchmarks into continuous integration to enable devops,
J. Waller, N. C. Ehmke, and W. Hasselbring, “Including performance benchmarks into continuous integration to enable devops,”ACM SIGSOFT Software Engineering Notes, vol. 40, no. 2, pp. 1–4, apr
-
[23]
Including performance benchmarks into continuous integration to enable DevOps,
[Online]. Available: https://doi.org/10.1145/2735399.2735416
-
[24]
O. Javed, J. H. Dawes, M. Han, G. Franzoni, A. Pfeiffer, G. Reger, and W. Binder, “Perfci: A toolchain for automated performance testing during continuous integration of python projects,” inProceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’20. New York, NY , USA: Association for Computing Machinery, 202...
-
[25]
D. Daly, W. Brown, H. Ingo, J. O’Leary, and D. Bradford, “The use of change point detection to identify software performance regressions in a continuous integration system,” inProceedings of the ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’20. New York, NY , USA: Association for Computing Machinery, 2020, pp. 67–75. [Online]. A...
-
[26]
An evaluation of open-source software microbenchmark suites for continuous performance assessment,
C. Laaber and P. Leitner, “An evaluation of open-source software microbenchmark suites for continuous performance assessment,” in Proceedings of the 15th International Conference on Mining Software Repositories, ser. MSR ’18. New York, NY , USA: Association for Computing Machinery, May 2018, pp. 119–130. [Online]. Available: https://doi.org/10.1145/319639...
-
[27]
Towards an optimized benchmarking platform for ci/cd pipelines,
N. Japke, S. Koch, H. Lukasczyk, and D. Bermbach, “Towards an optimized benchmarking platform for ci/cd pipelines,” in2025 IEEE International Conference on Cloud Engineering (IC2E), 2025, pp. 36– 41
work page 2025
-
[28]
BeFaaS: An application-centric benchmarking frame- work for faas platforms,
M. Grambow, T. Pfandzelter, L. Burchard, C. Schubert, M. Zhao, and D. Bermbach, “BeFaaS: An application-centric benchmarking frame- work for faas platforms,” in2021 IEEE International Conference on Cloud Engineering (IC2E), 2021, pp. 1–8
work page 2021
-
[29]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” 2023. [Online]. Available: https://arxiv.org/abs/2210.03629
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[30]
Overhead comparison of instrumentation frameworks,
D. G. Reichelt, L. Bulej, R. Jung, and A. van Hoorn, “Overhead comparison of instrumentation frameworks,” inCompanion of the 15th ACM/SPEC International Conference on Performance Engineering, ser. ICPE ’24 Companion. New York, NY , USA: Association for Computing Machinery, 2024, pp. 249–256. [Online]. Available: https://doi.org/10.1145/3629527.3652269
-
[31]
Octoverse: A new developer joins github every second as ai leads typescript to #1,
G. Staff, “Octoverse: A new developer joins github every second as ai leads typescript to #1,” https://github.blog/news-insights/octoverse/ octoverse-a-new-developer-joins-github-every-second-as-ai-leads-typescript-to-1/, 2025, updated November 7, 2025. Accessed January 27, 2026
work page 2025
-
[32]
Matrix Completion and Low-Rank SVD via Fast Alternating Least Squares
T. Hastie, R. Mazumder, J. Lee, and R. Zadeh, “Matrix completion and low-rank svd via fast alternating least squares,” 2014. [Online]. Available: https://arxiv.org/abs/1410.2596
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
The pagerank citation ranking: Bringing order to the web
L. Page, S. Brin, R. Motwani, and T. Winograd, “The pagerank citation ranking: Bringing order to the web.” Stanford InfoLab, Technical Report 1999-66, November 1999, previous number = SIDL-WP-1999-0120. [Online]. Available: http://ilpubs.stanford.edu:8090/422/
work page 1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.