Auditing Empirical Comparisons in Quantum Software
Pith reviewed 2026-07-02 09:01 UTC · model grok-4.3
The pith
Only 8 of 455 reported quantum-software comparisons expose enough evidence for locked audit without proxy reconstruction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CLAIMSTAB-QC classifies strict scalar-directional comparisons as Sustained, Unresolved, or Reversed inside a locked audit scope. Evaluation on 455 claims yields a materialization gap in which only 8 records expose matched evidence without proxy reconstruction, producing 2 Sustained, 4 Unresolved, and 2 Reversed outcomes; diagnostics show simpler checks can retain apparent directions that locked designs weaken.
What carries the argument
CLAIMSTAB-QC, a source-bounded framework that records baselines, metric, relation, and admissible evidence, locks the comparison design, and reports a scoped relation outcome or explicit evidence boundary.
If this is right
- Most reported performance edges between compilers, optimizers, backends, or ansatzes cannot be verified from the evidence the papers expose.
- Published comparisons that appear directional under informal checks frequently become Unresolved or Reversed once the audit scope is locked.
- Benchmark-relevant comparisons require explicit recording of admissible evidence and locked designs before outcomes are computed.
- Simpler post-hoc checks tend to preserve directions whose support weakens under the stricter locked-audit procedure.
Where Pith is reading between the lines
- Journals and conferences could require authors to supply a locked-audit record alongside each comparative claim.
- The same framework could be applied to other empirical domains where tool comparisons depend on benchmark scope and noise assumptions.
- Reproducibility efforts would gain from treating the comparison design itself as an auditable artifact rather than only the code or data.
- Extending CLAIMSTAB-QC to multi-metric or non-directional relations would cover a larger fraction of the 455 claims.
Load-bearing premise
The 455 extracted comparative claims are representative of empirical comparisons in the quantum software literature and CLAIMSTAB-QC's evidence classification rules can be applied consistently from the information stated in the source papers.
What would settle it
Re-running the audit on the same 119 papers after authors supply the missing matched evidence for the 45 claims that reached lockable designs but lacked full evidence, and counting how many of the original directions remain Sustained.
Figures
read the original abstract
Empirical quantum-software papers often report that one compiler, optimizer, backend, or ansatz outperforms another. Such comparisons are not properties of a tool alone: they can change with benchmark scope, circuit construction, compilation, sampling, backend or noise assumptions, optimizer choices, and resource budgets. Existing testing, benchmarking, and reproducibility methods help assess programs, tools, executions, and platforms, but they do not directly audit whether the reported comparison itself is supported by the evidence exposed in the source paper or accompanying materials. We present CLAIMSTAB-QC, a source-bounded framework for auditing empirical comparisons in quantum software. Given a reported comparison, the framework records the baselines, metric, relation, and admissible evidence; locks the comparison design before outcomes are computed; and reports either a scoped relation outcome or an explicit evidence boundary. For strict scalar-directional comparisons, the reported direction is classified as Sustained, Unresolved, or Reversed within the locked audit scope. We evaluate CLAIMSTAB-QC on 455 comparative claims from 119 quantum-software papers. The central finding is a materialization gap: 175 claims can be represented for audit planning, 79 become scalar-directional planning records, 53 yield lockable audit or diagnostic designs, and only 8 expose enough matched evidence to audit the original comparison without proxy reconstruction. These 8 records yield 2 Sustained, 4 Unresolved, and 2 Reversed outcomes. Controlled diagnostics over 24 benchmark-relevant comparisons further show that simpler checks can preserve apparent directions whose support weakens under locked audit designs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CLAIMSTAB-QC, a source-bounded framework that records baselines, metrics, relations, and admissible evidence for empirical comparisons in quantum software, locks the audit design, and classifies strict scalar-directional outcomes as Sustained, Unresolved, or Reversed. Applied to 455 comparative claims extracted from 119 papers, it reports a materialization gap: 175 claims representable for planning, 79 scalar-directional, 53 lockable designs, and only 8 fully auditable without proxies, yielding 2 Sustained, 4 Unresolved, and 2 Reversed. Controlled diagnostics on 24 comparisons illustrate that simpler checks can preserve directions that weaken under locked audits.
Significance. If the sampled corpus is representative, the materialization gap would demonstrate that most reported comparisons in quantum software lack sufficient exposed evidence for direct verification, with implications for reproducibility and benchmarking practices in the field. The framework itself is a constructive contribution that separates planning from outcome computation and applies to external papers without circularity or self-referential parameters.
major comments (2)
- [Abstract and evaluation section] The selection of the 119 papers and extraction of the 455 claims is presented without any search strategy, inclusion/exclusion criteria, date bounds, database, or sampling justification (Abstract and the evaluation that produces the headline counts 175/79/53/8). This is load-bearing for the central claim of a literature-wide materialization gap, as the steep drop-off could be an artifact of an arbitrary or convenience corpus rather than a representative sample.
- [Abstract and framework application] The manuscript provides no information on claim selection criteria, inter-rater reliability, or how CLAIMSTAB-QC's evidence classification rules handle ambiguous cases when reducing 455 claims to the reported counts (Abstract). Without these details the precise materialization numbers cannot be independently verified or reproduced.
minor comments (1)
- [Abstract] The abstract uses the symbol 'o' in the chain 455 claims o 175 representable; this should be replaced by an explicit arrow or '→' for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify key gaps in methodological transparency. We address each major comment below and will revise the manuscript to incorporate the requested details.
read point-by-point responses
-
Referee: [Abstract and evaluation section] The selection of the 119 papers and extraction of the 455 claims is presented without any search strategy, inclusion/exclusion criteria, date bounds, database, or sampling justification (Abstract and the evaluation that produces the headline counts 175/79/53/8). This is load-bearing for the central claim of a literature-wide materialization gap, as the steep drop-off could be an artifact of an arbitrary or convenience corpus rather than a representative sample.
Authors: We agree that the absence of explicit corpus-construction details weakens the support for a literature-wide claim. The current manuscript describes the counts but does not document how the 119 papers were identified. In revision we will add a new subsection (likely 4.1) that reports: the database(s) queried (arXiv), the date range, the keyword combinations used to locate quantum-software papers containing empirical comparisons, the inclusion criteria applied to retain only papers with at least one explicit baseline-metric-relation statement, and any exclusion rules (e.g., purely theoretical or simulation-only works). We will also state that the sample is a convenience corpus of recent, publicly available papers rather than a probabilistically representative draw, and we will qualify the materialization-gap finding accordingly while retaining the illustrative value of the 8 fully auditable cases. revision: yes
-
Referee: [Abstract and framework application] The manuscript provides no information on claim selection criteria, inter-rater reliability, or how CLAIMSTAB-QC's evidence classification rules handle ambiguous cases when reducing 455 claims to the reported counts (Abstract). Without these details the precise materialization numbers cannot be independently verified or reproduced.
Authors: We concur that reproducibility of the headline counts requires documentation of the claim-extraction and classification process. The manuscript currently reports only the final tallies. In the revised version we will expand Section 4 to include: (i) the operational definition used to identify a “comparative claim” (explicit mention of two or more baselines, a scalar or directional metric, and a stated relation), (ii) whether extraction was performed by a single rater or multiple raters and, if the latter, any inter-rater agreement statistic, and (iii) concrete examples of ambiguous cases together with the exact rule from CLAIMSTAB-QC that resolved them (e.g., “when the paper states a direction but omits variance, the claim is classified as scalar-directional but not lockable”). These additions will allow an independent team to replicate the reduction from 455 to 8. revision: yes
Circularity Check
No circularity: framework application to external corpus yields independent counts
full rationale
The paper defines CLAIMSTAB-QC as a source-bounded auditing procedure and applies its classification rules (representable claims, scalar-directional records, lockable designs, matched evidence) directly to 455 claims extracted from 119 external quantum-software papers. The resulting materialization gap (175→79→53→8) is produced by those rule applications on outside data; no equations, fitted parameters, or self-citation chains reduce the reported outcomes to quantities defined inside the present work. The evaluation is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
CLAIMSTAB-QC
no independent evidence
Reference graph
Works this paper leans on
-
[1]
M. A. Nielsen and I. L. Chuang,Quantum computation and quantum information. Cambridge University Press, 2010
2010
-
[2]
Algorithms for quantum computation: discrete logarithms and factoring,
P. W. Shor, “Algorithms for quantum computation: discrete logarithms and factoring,” inProceedings 35th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 1994, pp. 124–134
1994
-
[3]
A fast quantum mechanical algorithm for database search,
L. K. Grover, “A fast quantum mechanical algorithm for database search,” inProceedings of the 28th Annual ACM symposium on Theory of computing (STOC). ACM, 1996, pp. 212–219
1996
-
[4]
Quantum computing in the NISQ era and beyond,
J. Preskill, “Quantum computing in the NISQ era and beyond,”Quantum, vol. 2, p. 79, 2018
2018
-
[5]
A variational eigenvalue solver on a photonic quantum processor,
A. Peruzzo, J. McClean, P. Shadbolt, M.-H. Yung, X.-Q. Zhou, P. J. Love, A. Aspuru-Guzik, and J. L. O’brien, “A variational eigenvalue solver on a photonic quantum processor,”Nature Communications, vol. 5, no. 1, p. 4213, 2014
2014
-
[6]
A Quantum Approximate Optimization Algorithm
E. Farhi, J. Goldstone, and S. Gutmann, “A quantum approximate optimization algorithm,”arXiv preprint arXiv:1411.4028, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[7]
Cirq: A python framework for creating, editing, and invoking noisy intermediate-scale quantum (NISQ) circuits,
Cirq Developers, “Cirq: A python framework for creating, editing, and invoking noisy intermediate-scale quantum (NISQ) circuits,” https:// github.com/quantumlib/Cirq, 2022, quantum AI Team, Google
2022
-
[8]
PennyLane: Automatic differentiation of hybrid quantum-classical computations
V . Bergholm, J. Izaac, M. Schuld, C. Gogolin, S. Ahmed, V . Ajith, M. S. Alam, G. Alonso-Linaje, B. AkashNarayanan, A. Asadiet al., “PennyLane: Automatic differentiation of hybrid quantum-classical computations,”arXiv preprint arXiv:1811.04968, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[9]
Quantum computer benchmarking: An explorative systematic literature review,
T. Rohe, F. H. Ruiloba, S. Egger, S. von Beck, J. Stein, and C. Linnhoff- Popien, “Quantum computer benchmarking: An explorative systematic literature review,”arXiv preprint arXiv:2509.03078, 2025
-
[10]
Tackling the qubit mapping problem for nisq-era quantum devices,
G. Li, Y . Ding, and Y . Xie, “Tackling the qubit mapping problem for nisq-era quantum devices,” inProceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). ACM, 2019, pp. 1001–1014
2019
-
[11]
An empirical study into the effects of transpilation on quantum circuit smells,
M. D. Stefano, D. D. Nucci, F. Palomba, and A. D. Lucia, “An empirical study into the effects of transpilation on quantum circuit smells,”Empirical Software Engineering, vol. 29, no. 3, p. 61, 2024
2024
-
[12]
MorphQ: Metamorphic testing of the qiskit quantum computing platform,
M. Paltenghi and M. Pradel, “MorphQ: Metamorphic testing of the qiskit quantum computing platform,” inProceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2413–2424
2023
-
[13]
Benchmarking the performance of quantum computing software for quantum circuit creation, manipulation and compilation,
P. D. Nation, A. A. Saki, S. Brandhofer, L. Bello, S. Garion, M. Treinish, and A. Javadi-Abhari, “Benchmarking the performance of quantum computing software for quantum circuit creation, manipulation and compilation,”Nature Computational Science, vol. 5, pp. 427–435, 2025
2025
-
[14]
1-2-3 reproducibility for quantum software experiments,
W. Mauerer and S. Scherzinger, “1-2-3 reproducibility for quantum software experiments,” inProceedings of the 29th IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2022, pp. 1247–1248
2022
-
[15]
Stability of quantum computers,
S. Dasgupta, “Stability of quantum computers,”arXiv preprint arXiv:2404.19082, 2024
-
[16]
Bench- marking the quantum approximate optimization algorithm,
M. Willsch, D. Willsch, F. Jin, H. De Raedt, and K. Michielsen, “Bench- marking the quantum approximate optimization algorithm,”Quantum Information Processing, vol. 19, no. 7, p. 197, 2020
2020
-
[17]
Quantum noise in the flow of time: A temporal study of the noise in quantum computers,
B. Baheri, Q. Guan, V . Chaudhary, and A. Li, “Quantum noise in the flow of time: A temporal study of the noise in quantum computers,” inProceedings of the 28th IEEE International Symposium on On-Line Testing and Robust System Design (IOLTS). IEEE, 2022, pp. 1–5
2022
-
[18]
Adaptive mitigation of time-varying quantum noise,
S. Dasgupta, T. S. Humble, and A. Danageozian, “Adaptive mitigation of time-varying quantum noise,” inProceedings of the 4th IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2023, pp. 99–110
2023
-
[19]
CLAIMSTAB-QC: Audit evidence package,
B. Ye, P. Liang, M. T. Sabzevari, and A. A. Khan, “CLAIMSTAB-QC: Audit evidence package,” 2026, artifact package to be released publicly after the review period
2026
-
[20]
Arline benchmarks: Automated benchmarking platform for quantum compilers,
Y . Kharkov, A. Ivanova, E. Mikhantiev, and A. Kotelnikov, “Arline benchmarks: Automated benchmarking platform for quantum compilers,” arXiv preprint arXiv:2202.14025, 2022
-
[21]
Probable inference, the law of succession, and statistical inference,
E. B. Wilson, “Probable inference, the law of succession, and statistical inference,”Journal of the American Statistical Association, vol. 22, no. 158, pp. 209–212, 1927
1927
-
[22]
Interval estimation for a binomial proportion,
L. D. Brown, T. T. Cai, and A. DasGupta, “Interval estimation for a binomial proportion,”Statistical Science, vol. 16, no. 2, pp. 101–133, 2001
2001
-
[23]
D. G. Altman, D. Machin, T. N. Bryant, and M. J. Gardner, Eds.,Statistics with Confidence: Confidence Intervals and Statistical Guidelines, 2nd ed. London: BMJ Books, 2000
2000
-
[24]
A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Crosset al., “Quantum computing with Qiskit,”arXiv preprint arXiv:2405.08810, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Array programming with numpy,
C. R. Harris, K. J. Millman, S. J. Van Der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smithet al., “Array programming with numpy,”Nature, vol. 585, no. 7825, pp. 357–362, 2020
2020
-
[26]
SciPy 1.0: Fundamental algorithms for scientific computing in Python,
P. Virtanen, R. Gommers, T. E. Oliphant, M. Haberland, T. Reddy, D. Cournapeau, E. Burovski, P. Peterson, W. Weckesser, J. Brightet al., “SciPy 1.0: Fundamental algorithms for scientific computing in Python,” Nature Methods, vol. 17, no. 3, pp. 261–272, 2020
2020
-
[27]
A. Meijer-van de Griend, “A comparison of quantum compilers using a DAG-based or phase polynomial-based intermediate representation,” arXiv preprint arXiv:2304.08814, 2023
-
[28]
Optimal layout synthesis for quantum computing,
B. Tan and J. Cong, “Optimal layout synthesis for quantum computing,” arXiv preprint arXiv:2007.15671, 2020
-
[29]
Quantum tree generator improves QAOA state-of-the-art for the knapsack problem,
P. Christiansen, L. Binkowski, D. Ramacciotti, and S. Wilkening, “Quantum tree generator improves QAOA state-of-the-art for the knapsack problem,”arXiv preprint arXiv:2411.00518, 2024
-
[30]
Eclipse Qrisp QAOA: description and preliminary comparison with Qiskit counterparts,
E. Osaba, M. Petri ˇc, I. Oregi, R. Seidel, A. Ruiz, S. Bock, and M.-A. Kourtis, “Eclipse Qrisp QAOA: description and preliminary comparison with Qiskit counterparts,”arXiv preprint arXiv:2405.20173, 2024
-
[31]
Reducing the CNOT count for Clifford+T circuits on NISQ architectures,
V . Gheorghiu, J. Huang, S. M. Li, M. Mosca, and P. Mukhopadhyay, “Reducing the CNOT count for Clifford+T circuits on NISQ architectures,” arXiv preprint arXiv:2011.12191, 2020
-
[32]
Highly optimized quantum circuits synthesized via data- flow engines,
P. Rakyta, G. Morse, J. N ´adori, Z. Majnay-Tak ´acs, O. Mencer, and Z. Zimbor ´as, “Highly optimized quantum circuits synthesized via data- flow engines,”arXiv preprint arXiv:2211.07685, 2022
-
[33]
QASMBench: A low- level quantum benchmark suite for NISQ evaluation and simulation,
A. Li, S. Stein, S. Krishnamoorthy, and J. Ang, “QASMBench: A low- level quantum benchmark suite for NISQ evaluation and simulation,” ACM Transactions on Quantum Computing, vol. 4, no. 2, pp. 1–26, 2023
2023
-
[34]
MQT Bench: Bench- marking software and design automation tools for quantum computing,
N. Quetschlich, L. Burgholzer, and R. Wille, “MQT Bench: Bench- marking software and design automation tools for quantum computing,” Quantum, vol. 7, p. 1062, 2023
2023
-
[35]
MaxCut quantum approximate optimization algorithm performance guarantees for p >1 ,
J. Wurtz and P. J. Love, “MaxCut quantum approximate optimization algorithm performance guarantees for p >1 ,”Physical Review A, vol. 103, no. 4, p. 042612, 2021
2021
-
[36]
Increasing transparency through a multiverse analysis,
S. Steegen, F. Tuerlinckx, A. Gelman, and W. Vanpaemel, “Increasing transparency through a multiverse analysis,”Perspectives on Psychologi- cal Science, vol. 11, no. 5, pp. 702–712, 2016
2016
-
[37]
Specification curve analysis,
U. Simonsohn, J. P. Simmons, and L. D. Nelson, “Specification curve analysis,”Nature Human Behaviour, vol. 4, no. 11, pp. 1208–1214, 2020
2020
-
[38]
Qdiff: Differential testing of quantum software stacks,
J. Wang, Q. Zhang, G. H. Xu, and M. Kim, “Qdiff: Differential testing of quantum software stacks,” inProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 692–704
2021
-
[39]
Muskit: A mutation analysis tool for quantum software testing,
E. Mendiluze, S. Ali, P. Arcaini, and T. Yue, “Muskit: A mutation analysis tool for quantum software testing,” inProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 1266–1270
2021
-
[40]
Quito: a coverage-guided test generator for quantum programs,
X. Wang, P. Arcaini, T. Yue, and S. Ali, “Quito: a coverage-guided test generator for quantum programs,” inProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 1237–1241
2021
-
[41]
MorphQ++: A reproducibility study of metamorphic testing on quantum compilers,
L. J. Kitt and M. B. Cohen, “MorphQ++: A reproducibility study of metamorphic testing on quantum compilers,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering Workshops (ASEW). ACM, 2024, pp. 8–14
2024
-
[42]
Qite: Assembly-level, cross-platform testing of quantum computing platforms,
M. Paltenghi and M. Pradel, “Qite: Assembly-level, cross-platform testing of quantum computing platforms,”arXiv preprint arXiv:2503.17322, 2025
-
[43]
Qsimbench: An execution-level benchmark suite for quantum software engineering,
G. Bisicchia, A. Bocci, J. Garc ´ıa-Alonso, J. M. Murillo, and A. Brogi, “Qsimbench: An execution-level benchmark suite for quantum software engineering,” inProceedings of the 6th IEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2025, pp. 175–180
2025
-
[44]
The state of open science in software engineering research: A case study of ICSE artifacts,
A. Muttakin, S. Mondal, and C. K. Roy, “The state of open science in software engineering research: A case study of ICSE artifacts,”arXiv preprint arXiv:2601.02066, 2026
-
[45]
Qef: Reproducible and exploratory quantum software experiments,
V . Gierisch and W. Mauerer, “Qef: Reproducible and exploratory quantum software experiments,”arXiv preprint arXiv:2511.04563, 2025
-
[46]
Quantum software experiments: A reporting and laboratory package structure guidelines proposal,
E. Moguel, J. A. Parejo, A. Ruiz-Cort ´es, J. Garcia-Alonso, and J. M. Murillo, “Quantum software experiments: A reporting and laboratory package structure guidelines proposal,” inProceedings of the 4th IEEE International Conference on Quantum Software (QSW). IEEE, 2025, pp. 185–194
2025
-
[47]
Reproducibility in quantum computing,
S. Dasgupta and T. S. Humble, “Reproducibility in quantum computing,” inProceedings of the 20th IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2021, pp. 458–461
2021
-
[48]
Bugs in quantum computing platforms: an empirical study,
M. Paltenghi and M. Pradel, “Bugs in quantum computing platforms: an empirical study,”Proceedings of the ACM on Programming Languages, vol. 6, no. OOPSLA1, pp. 1–27, 2022
2022
-
[49]
The quantum frontier of software engineering: A systematic mapping study,
M. De Stefano, F. Pecorelli, D. Di Nucci, F. Palomba, and A. De Lucia, “The quantum frontier of software engineering: A systematic mapping study,”Information and Software Technology, vol. 175, p. 107525, 2024
2024
-
[50]
Quantum software testing: State of the art,
A. Garc´ıa de la Barrera, I. Garc ´ıa-Rodr´ıguez de Guzm´an, M. Polo, and M. Piattini, “Quantum software testing: State of the art,”Journal of Software: Evolution and Process, vol. 35, no. 4, p. e2419, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.