{"paper":{"title":"Duet instrumentation: An Agentic Approach to Improving Sensitivity in Cloud Service Benchmarking","license":"http://creativecommons.org/licenses/by-nc-nd/4.0/","headline":"Duet instrumentation uses LLMs to target performance measurements at code changes, detecting regressions at up to 5 times lower severity than standard benchmarks.","cross_cats":[],"primary_cat":"cs.DC","authors_text":"David Bermbach, Nils Japke, Sebastian Koch","submitted_at":"2026-05-18T13:43:10Z","abstract_excerpt":"Continuous cloud service performance benchmarking is essential for detecting performance bugs early before deploying them to production. However, detecting performance regressions using application benchmarks, which usually treat the system under test as a black box, is challenging due to variable I/O calls or changing performance characteristics of the underlying cloud infrastructure. Microbenchmarks are often more sensitive and accurate, but also more time-consuming to implement and run. Further, they do not capture the performance of the integrated system as a whole. A comprehensive perform"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"our prototype can detect performance regressions at up to 5x lower injected severity compared to a traditional duet application benchmark while preserving similar A/A latency distributions.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The LLM can reliably identify performance-relevant code changes between versions with enough accuracy (reported 58% precision, 93% recall at line-distance threshold of five) that the added instrumentation actually improves downstream regression detection sensitivity.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"Duet instrumentation uses LLM-driven code analysis to instrument performance-relevant changes between two app versions, detecting regressions at up to 5x lower severity than standard duet benchmarks in a testbed evaluation.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Duet instrumentation uses LLMs to target performance measurements at code changes, detecting regressions at up to 5 times lower severity than standard benchmarks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3aef4b8de22be621bcffdfe5a5e52e5c5f9578f44b7ea337952e1a4c3f7db08e"},"source":{"id":"2605.18397","kind":"arxiv","version":1},"verdict":{"id":"0efabba9-c2f0-4957-b692-946e5a6d1a95","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-19T23:51:40.450089Z","strongest_claim":"our prototype can detect performance regressions at up to 5x lower injected severity compared to a traditional duet application benchmark while preserving similar A/A latency distributions.","one_line_summary":"Duet instrumentation uses LLM-driven code analysis to instrument performance-relevant changes between two app versions, detecting regressions at up to 5x lower severity than standard duet benchmarks in a testbed evaluation.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The LLM can reliably identify performance-relevant code changes between versions with enough accuracy (reported 58% precision, 93% recall at line-distance threshold of five) that the added instrumentation actually improves downstream regression detection sensitivity.","pith_extraction_headline":"Duet instrumentation uses LLMs to target performance measurements at code changes, detecting regressions at up to 5 times lower severity than standard benchmarks."},"integrity":{"clean":true,"summary":{"advisory":0,"critical":0,"by_detector":{},"informational":0},"endpoint":"/pith/2605.18397/integrity.json","findings":[],"available":true,"detectors_run":[{"name":"doi_compliance","ran_at":"2026-05-20T00:01:26.676718Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"doi_title_agreement","ran_at":"2026-05-20T00:01:20.336962Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"cited_work_retraction","ran_at":"2026-05-19T23:52:10.661667Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"citation_quote_validity","ran_at":"2026-05-19T23:50:03.288231Z","status":"skipped","version":"0.1.0","findings_count":0},{"name":"ai_meta_artifact","ran_at":"2026-05-19T23:33:29.750640Z","status":"skipped","version":"1.0.0","findings_count":0},{"name":"external_links","ran_at":"2026-05-19T23:31:43.303921Z","status":"completed","version":"1.0.0","findings_count":0},{"name":"claim_evidence","ran_at":"2026-05-19T23:21:58.728174Z","status":"completed","version":"1.0.0","findings_count":0}],"snapshot_sha256":"1d00c3a1e6b3a8ddf2d88088bd231d60367dafa8a9f8277bb0ce1809f1505086"},"references":{"count":33,"sample":[{"doi":"","year":2016,"title":"Bifrost: Sup- porting continuous deployment with automated enactment of multi- phase live testing strategies,","work_id":"550a7271-f224-4f42-8c70-b6f09774dbb6","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2019,"title":"Continuous benchmark- ing: Using system benchmarking in build pipelines,","work_id":"aabe20e1-14e3-47d1-97e7-6fff50822178","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/3427921.3450234","year":2021,"title":"Creating a virtuous cycle in performance testing at mongodb,","work_id":"da5e0257-ba3c-4716-8e27-868b3402f7a9","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"10.1145/2885497","year":2016,"title":"Patterns in the chaos - A study of performance variation and predictability in public iaas clouds,","work_id":"9c208d66-b054-4238-8846-49b03d0763b7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"D. Bermbach, E. Wittern, and S. Tai,Cloud Service Benchmarking: Measuring Quality of Cloud Services from a Client Perspective, 1st ed. Springer Publishing Company, Incorporated, 2017","work_id":"8edacf17-469b-4f84-8e9c-eeb5aedbd84b","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":33,"snapshot_sha256":"ed9bffd848427b2b25a974702c993aa5b2662d9551a6c031fd3ed4fcd9215318","internal_anchors":2},"formal_canon":{"evidence_count":2,"snapshot_sha256":"c2e2f7e3fd5f7834c8f919ec8b86adcee71518e7bc188f336d4e12f5b735bc6b"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}