pith. sign in

arxiv: 2603.15515 · v3 · submitted 2026-03-16 · 🪐 quant-ph · cs.ET

End-to-end performance of quantum-accelerated large-scale linear algebra workflows

Pith reviewed 2026-05-15 09:55 UTC · model grok-4.3

classification 🪐 quant-ph cs.ET
keywords quantum optimizationgraph partitioningfinite element analysisLS-DYNAhybrid quantum-classicalIterative QAOAsparse linear systemsNISQ workflows
0
0 comments X

The pith

A quantum graph partitioner inside LS-DYNA cuts amortized wall-clock time by 5.9 to 14.6 percent on large FEA meshes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that replacing a classical graph partitioner with Iterative-QAOA inside the LS-DYNA solver pipeline reduces total run time for sparse linear systems that arise in finite-element analysis. The authors embed the quantum routine for the graph-partitioning step that controls fill-in, then measure end-to-end wall-clock time on production-scale meshes up to 35 million elements for both eigenmode and transient problems. Even after adding the cost of quantum-classical data movement and classical post-processing, every model tested finishes faster, with the largest gains reaching 14.6 percent. The work therefore supplies the first concrete evidence that a hybrid quantum-classical workflow can deliver measurable speed-up on an industrial multiphysics code without changing the surrounding simulation infrastructure.

Core claim

By routing the graph-partitioning subproblem of LS-DYNA through Iterative-QAOA on up to 150 qubits, the hybrid framework lowers amortized wall-clock time for vibrational analysis of a sedan and a jet engine and for transient simulation of a drill and an impeller; the observed reductions are at least 5.9 percent on every model and reach 14.6 percent on some, after full accounting for MPI-distributed execution, quantum hardware calls, and classical overhead on meshes of up to 35 million elements.

What carries the argument

Iterative-QAOA applied to the graph-partitioning problem that minimizes fill-in during sparse-matrix factorization inside the LS-DYNA finite-element solver.

If this is right

  • The same quantum partitioner can be dropped into other sparse-direct or iterative solvers that rely on graph ordering.
  • As qubit count and fidelity increase, the same workflow extends without code changes to meshes larger than 35 million elements.
  • End-to-end timing already includes data-transfer overhead, so further reductions require only faster quantum hardware rather than new classical interfaces.
  • The measured gains are independent of the particular FEA physics (vibration or transient), suggesting broad applicability inside multiphysics packages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Engineering codes that already expose a modular partitioner step can adopt quantum acceleration with minimal refactoring.
  • If quantum-classical latency drops by another factor of two, the same technique would become attractive for real-time design iterations rather than overnight batch jobs.
  • The current results set a baseline against which future fault-tolerant quantum partitioners can be compared without changing the classical outer loop.

Load-bearing premise

The time saved by the quantum partitioner exceeds the added cost of moving mesh data to and from the quantum device plus any classical post-processing for the mesh sizes and solver settings that were tested.

What would settle it

Re-running the identical LS-DYNA jobs on the same hardware with the classical partitioner restored and finding that total wall-clock time is equal or longer on every model would falsify the reported speed-up.

Figures

Figures reproduced from arXiv: 2603.15515 by Ananth Kaushik, Claudio Girotto, Daiwei Zhu, Fran\c{c}ois-Henry Rouet, Martin Roetteler, Miguel Angel Lopez-Ruiz, Robert Lucas, Willie Aboumrad.

Figure 1
Figure 1. Figure 1: End-to-end quantum-accelerated linear algebra work [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LR-QAOA performance landscape for a 24-qubit drill [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Iterative-QAOA executed on a 120-qubit Drill problem instance using the NVIDIA CUDA-Q/cuTensorNet MPS simulator [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Total wall-clock time (WCT) reduction as a function of coarse graph size using quantum-derived partitions. The [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Maximum percentage reduction in total wall-clock [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Maximum percentage reduction in factorization wall [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: LR-QAOA cost landscape for 24-qubit instances of the Impeller, SedanCar, and JetEngine models, as a function of [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Iterative-QAOA executed on a 120-qubit Impeller and SedanCar problem instances using the NVIDIA CUDA [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Factorization wall-clock time (WCT) reduction as a function of coarse graph size using quantum-derived partitions. [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Number of operations computed from symbolic factorization as a function of coarse graph size using quantum-derived [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Number of non-zeros computed from symbolic factorization as a function of coarse graph size using quantum-derived [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
read the original abstract

Solving large-scale sparse linear systems is a challenging computational task due to the introduction of non-zero elements, or "fill-in." The Graph Partitioning Problem (GPP) arises naturally when minimizing fill-in and accelerating solvers. In this paper, we measure the end-to-end performance of a hybrid quantum-classical framework designed to accelerate Finite Element Analysis (FEA) by integrating a quantum solver for GPP into Synopsys/Ansys' LS-DYNA multiphysics simulation software. The quantum solver we use is based on Iterative-QAOA, a scalable, non-variational quantum approach for optimization. We focus on two specific classes of FEA problems, namely vibrational (eigenmode) analysis and transient simulation. We report numerical simulations on up to 150 qubits done on NVIDIA's CUDA-Q/cuTensorNet and implementation on IonQ's Forte quantum hardware. The potential impact on LS-DYNA workflows is quantified by measuring the wall-clock time-to-solution for complex problem instances, including vibrational analysis of large finite element models of a sedan car and a Rolls-Royce jet engine, as well as transient simulations of a drill and an impeller. We performed end-to-end performance measurements on meshes comprising up to 35 million elements. Measurements were conducted using LS-DYNA in distributed-memory mode via Message Passing Interface (MPI) on AWS and Synopsys compute clusters. Our findings indicate that with a quantum computer in the loop, amortized LS-DYNA wall-clock time can be improved by up to 14.6% for specific cases and by at least 5.9% for all models considered. These results highlight the significant potential of quantum computing to reduce time-to-solution for large-scale FEA simulations within the Noisy Intermediate-Scale Quantum (NISQ) era, offering an approach that is scalable and extendable into the fault-tolerant quantum computing regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript presents a hybrid quantum-classical framework that embeds Iterative-QAOA graph partitioning into LS-DYNA to reduce fill-in during large-scale finite-element solves. It reports end-to-end wall-clock time reductions of 5.9–14.6 % on vibrational and transient problems with meshes up to 35 million elements, obtained from 150-qubit CUDA-Q simulations and IonQ Forte hardware runs.

Significance. If the net speedups survive full overhead accounting, the work supplies one of the first concrete demonstrations that a NISQ optimizer can measurably accelerate an industrial multiphysics code on production-scale meshes, thereby linking quantum combinatorial solvers to real engineering workflows.

major comments (3)
  1. [Abstract and performance-evaluation section] Abstract and performance-evaluation section: the headline 5.9–14.6 % amortized improvements are stated without any breakdown of quantum versus classical wall-clock components, without error bars, and without the number of QAOA iterations or shots. Consequently it is impossible to confirm that the reported savings exceed the combined costs of mesh encoding, quantum-classical data movement, Iterative-QAOA runtime, and classical post-processing for the 35 M-element cases.
  2. [Performance-evaluation section] Performance-evaluation section: no timing or quality comparison is supplied against the classical partitioner METIS (or any other standard baseline) under identical LS-DYNA MPI configurations. Without this control it cannot be established that the quantum partitioner, rather than simply a different partitioning heuristic, is responsible for the observed solver-time reduction.
  3. [Methods and results] Methods and results: the mapping from measured partition quality to actual fill-in reduction and solver runtime is not quantified; the manuscript therefore provides no direct evidence that the Iterative-QAOA partitions improve the linear-algebra kernels enough to offset interface overhead on the tested meshes.
minor comments (2)
  1. [Abstract] Abstract: the phrase “amortized LS-DYNA wall-clock time” is used without defining the amortization window or the number of repeated solves over which the quantum overhead is spread.
  2. [Figures and text] Figure captions and text: qubit counts, circuit depths, and hardware versus simulation distinctions for each model should be tabulated for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important aspects of clarity and validation that we address below. We have revised the manuscript to incorporate additional details, comparisons, and quantifications as described in the point-by-point responses.

read point-by-point responses
  1. Referee: [Abstract and performance-evaluation section] Abstract and performance-evaluation section: the headline 5.9–14.6 % amortized improvements are stated without any breakdown of quantum versus classical wall-clock components, without error bars, and without the number of QAOA iterations or shots. Consequently it is impossible to confirm that the reported savings exceed the combined costs of mesh encoding, quantum-classical data movement, Iterative-QAOA runtime, and classical post-processing for the 35 M-element cases.

    Authors: We agree that the original presentation lacked sufficient granularity. In the revised manuscript we have expanded both the abstract and the performance-evaluation section with a new table that decomposes amortized wall-clock time into quantum (encoding, Iterative-QAOA runtime on CUDA-Q and IonQ Forte, shots, and iterations) and classical (data movement, post-processing, and LS-DYNA solver) components. Error bars are now reported from repeated runs, and the number of QAOA iterations and shots is stated explicitly for each mesh size. The updated data confirm that net savings for the 35 M-element cases remain positive after all overheads. revision: yes

  2. Referee: [Performance-evaluation section] Performance-evaluation section: no timing or quality comparison is supplied against the classical partitioner METIS (or any other standard baseline) under identical LS-DYNA MPI configurations. Without this control it cannot be established that the quantum partitioner, rather than simply a different partitioning heuristic, is responsible for the observed solver-time reduction.

    Authors: We accept that a direct baseline is necessary to isolate the contribution of the quantum solver. The revised performance-evaluation section now includes a side-by-side comparison of Iterative-QAOA partitions against METIS (and Scotch) under identical LS-DYNA MPI configurations on the same AWS and Synopsys clusters. We report both partition quality metrics (edge cut, balance) and the resulting LS-DYNA wall-clock times, demonstrating that the observed solver-time reductions are attributable to the Iterative-QAOA partitions rather than to a generic change in heuristic. revision: yes

  3. Referee: [Methods and results] Methods and results: the mapping from measured partition quality to actual fill-in reduction and solver runtime is not quantified; the manuscript therefore provides no direct evidence that the Iterative-QAOA partitions improve the linear-algebra kernels enough to offset interface overhead on the tested meshes.

    Authors: We have added a new quantitative subsection in Methods that explicitly maps partition quality metrics (edge cut and vertex balance) to measured fill-in reduction and LS-DYNA solver runtime for each mesh. Using data from the vibrational and transient test cases, we show the correlation coefficients and the net offset of interface overhead, thereby providing direct evidence that the linear-algebra improvements exceed the hybrid overhead on the 35 M-element meshes. revision: yes

Circularity Check

0 steps flagged

No circularity: end-to-end claims rest on direct wall-clock measurements

full rationale

The paper reports empirical wall-clock time reductions (5.9–14.6 %) obtained by running LS-DYNA with Iterative-QAOA graph partitions on concrete meshes up to 35 M elements, using CUDA-Q simulations and IonQ hardware. These numbers are external observables measured on AWS/Synopsys clusters; they are not outputs of any fitted parameter, self-referential equation, or derivation that reduces to the paper’s own inputs by construction. No load-bearing step invokes a uniqueness theorem, ansatz smuggled via self-citation, or renaming of a known result. The central claim therefore remains independent of the paper’s internal formalism.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are stated. The approach relies on the standard assumption that Iterative-QAOA can produce useful partitions for the fill-in objective and that the hybrid loop overhead remains sub-dominant.

pith-pipeline@v0.9.0 · 5677 in / 1262 out tokens · 41820 ms · 2026-05-15T09:55:04.209121+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Nested dissection of a regular finite element mesh,

    A. George, “Nested dissection of a regular finite element mesh,”SIAM journal on numerical analysis, vol. 10, no. 2, pp. 345–363, 1973. [Online]. Available: https://doi.org/10.1137/0710032

  2. [2]

    METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices,

    “METIS: A software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices,” technical report, Department of Computer Science, University of Minnesota, 1998. [Online]. Available: https://hdl.handle.net/11299/2 15346

  3. [3]

    Balanced graph partitioning,

    K. Andreev and H. Racke, “Balanced graph partitioning,”Theory of Computing Systems, vol. 39, no. 6, pp. 929–939, Oct 2006. [Online]. Available: https://doi.org/10.1007/s00224-006-1350-7

  4. [4]

    and Yamashiro, Y

    W. Aboumrad, D. Zhu, C. Girotto, F.-H. Rouet, J. Jojo, R. Lucas, J. Pathak, A. Kaushik, and M. Roetteler, “Accelerating large-scale linear algebra using variational quantum imaginary time evolution,” in2025 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 1. IEEE, 2025, pp. 1965–1970. [Online]. Available: https://doi.org/10.1...

  5. [6]

    Available: https://arxiv.org/abs/2510.26859

    [Online]. Available: https://arxiv.org/abs/2510.26859

  6. [7]

    Warm-starting quantum optimization,

    D. J. Egger, J. Mare ˇcek, and S. Woerner, “Warm-starting quantum optimization,”Quantum, vol. 5, p. 479, 17 Jun. 2021. [Online]. Available: https://quantum-journal.org/papers/q-2021-06-17-479/pdf/

  7. [8]

    GPU-accelerated simulations of quantum annealing and the quantum approximate optimization algorithm,

    D. Willsch, M. Willsch, F. Jin, K. Michielsen, and H. De Raedt, “GPU-accelerated simulations of quantum annealing and the quantum approximate optimization algorithm,”Comput. Phys. Commun., vol. 278, no. 108411, p. 108411, 1 Sep. 2022. [Online]. Available: http://dx.doi.org/10.1016/j.cpc.2022.108411

  8. [9]

    Quantum annealing initialization of the quantum approximate optimization algorithm,

    S. H. Sack and M. Serbyn, “Quantum annealing initialization of the quantum approximate optimization algorithm,”Quantum, vol. 5, p. 491, 1 Jul. 2021. [Online]. Available: http://dx.doi.org/10.22331/q-202 1-07-01-491

  9. [10]

    Quantum approximate optimization algorithm with adaptive bias fields,

    Y . Yu, C. Cao, C. Dewey, X.-B. Wang, N. Shannon, and R. Joynt, “Quantum approximate optimization algorithm with adaptive bias fields,”Phys. Rev. Res., vol. 4, no. 2, p. 023249, 27 Jun. 2022. [Online]. Available: http://dx.doi.org/10.1103/PhysRevResearch.4.023249

  10. [11]

    Solution of SAT problems with the adaptive-bias quantum approximate optimization algorithm,

    Y . Yu, C. Cao, X.-B. Wang, N. Shannon, and R. Joynt, “Solution of SAT problems with the adaptive-bias quantum approximate optimization algorithm,”Phys. Rev. Res., vol. 5, no. 2, p. 023147, 2 Jun. 2023. [Online]. Available: http://dx.doi.org/10.1103/PhysRevResearch.5.023 147

  11. [12]

    Adaptive quantum approximate optimization algorithm for solving combinatorial problems on a quantum computer,

    L. Zhu, H. L. Tang, G. S. Barron, F. A. Calderon-Vargas, N. J. Mayhall, E. Barnes, and S. E. Economou, “Adaptive quantum approximate optimization algorithm for solving combinatorial problems on a quantum computer,”Phys. Rev. Res., vol. 4, no. 3, p. 033029, 11 Jul

  12. [13]
  13. [14]

    Warm-started QAOA with custom mixers provably converges and computationally beats goemans-williamson’s max-cut at low circuit depths,

    R. Tate, J. Moondra, B. Gard, G. Mohler, and S. Gupta, “Warm-started QAOA with custom mixers provably converges and computationally beats goemans-williamson’s max-cut at low circuit depths,”Quantum, vol. 7, p. 1121, 26 Sep. 2023. [Online]. Available: http://dx.doi.org/10. 22331/q-2023-09-26-1121 10 Fig. 7: LR-QAOA cost landscape for 24-qubit instances of ...

  14. [15]

    Bias-field digitized counterdiabatic quantum optimization,

    A. G. Cadavid, A. Dalal, A. Simen, E. Solano, and N. N. Hegade, “Bias-field digitized counterdiabatic quantum optimization,”Phys. Rev. Res., vol. 7, no. 2, p. L022010, 9 Apr. 2025. [Online]. Available: http://dx.doi.org/10.1103/PhysRevResearch.7.L022010

  15. [16]

    Warm-start adaptive- bias quantum approximate optimization algorithm,

    Y . Yu, X.-B. Wang, N. Shannon, and R. Joynt, “Warm-start adaptive- bias quantum approximate optimization algorithm,”Phys. Rev. A, vol. 112, no. 1, p. 012422, 23 Jul. 2025. [Online]. Available: http://dx.doi.org/10.1103/nt3w-j4mj

  16. [17]

    National highway traffic safety administration (nhtsa), crash simulation vehicle models,

    “National highway traffic safety administration (nhtsa), crash simulation vehicle models,” https://www.nhtsa.gov/crash-simulation- vehicle-models. [Online]. Available: https://www.nhtsa.gov/crash-simul ation-vehicle-models

  17. [18]

    Quantum alternating operator ansatz (qaoa) beyond low depth with gradu- ally changing unitaries

    V . Kremenetski, A. Apte, T. Hogg, S. Hadfield, and N. M. Tubman, “Quantum alternating operator ansatz (QAOA) beyond low depth with gradually changing unitaries,”arXiv [quant-ph], 8 May 2023. [Online]. Available: http://arxiv.org/abs/2305.04455

  18. [19]

    Toward a linear- ramp QAOA protocol: evidence of a scaling advantage in solving some combinatorial optimization problems,

    J. A. Montañez-Barrera and K. Michielsen, “Toward a linear- ramp QAOA protocol: evidence of a scaling advantage in solving some combinatorial optimization problems,”Npj Quantum Inf., vol. 11, no. 1, pp. 1–12, 4 Aug. 2025. [Online]. Available: http://dx.doi.org/10.1038/s41534-025-01082-1

  19. [20]

    Godsil and G

    C. Godsil and G. Royle,Algebraic Graph Theory, ser. Graduate Texts in Mathematics. New York: Springer-Verlag, 2001, vol. 207. [Online]. Available: https://link.springer.com/book/10.1007/978-1-4613-0163-9 11 10 4 10 3 10 2 10 1 100 Quasi-probability Iter = 0 Iter = 3 Iter = 9 Impeller 20 0 20 Objective Function Cost 10 4 10 3 10 2 10 1 100 Quasi-probabilit...

  20. [21]

    A linear-time heuristic for improving network partitions,

    C. M. Fiduccia and R. M. Mattheyses, “A linear-time heuristic for improving network partitions,” inPapers on Twenty-Five Years of Electronic Design Automation, ser. 25 years of DAC. New York, NY , USA: Association for Computing Machinery, 1988, p. 241–247. [Online]. Available: https://doi.org/10.1145/62882.62910

  21. [22]

    Benchmarking a trapped-ion quantum computer with 30 qubits,

    J.-S. Chen, E. Nielsen, M. Ebert, V . Inlek, K. Wright, V . Chaplin, A. Maksymov, E. Páez, A. Poudel, P. Maunzet al., “Benchmarking a trapped-ion quantum computer with 30 qubits,”Quantum, vol. 8, p. 1516, 2024. [Online]. Available: https://doi.org/10.22331/q-2024-11-0 7-1516

  22. [23]

    Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of rb atoms in a quantum information processor,

    S. Kim, R. R. McLeod, M. Saffman, and K. H. Wagner, “Doppler-free, multiwavelength acousto-optic deflector for two-photon addressing arrays of rb atoms in a quantum information processor,”Appl. Opt., vol. 47, no. 11, pp. 1816–1831, Apr. 2008. [Online]. Available: https://doi.org/10.1364/AO.47.001816

  23. [24]

    Compact ion-trap quantum computing demonstrator,

    I. Pogorelov, T. Feldker, C. D. Marciniak, L. Postler, G. Jacob, O. Krieglsteiner, V . Podlesnic, M. Meth, V . Negnevitsky, M. Stadler et al., “Compact ion-trap quantum computing demonstrator,”PRX Quantum, vol. 2, no. 2, p. 020343, Jun. 2021. [Online]. Available: https://doi.org/10.1103/PRXQuantum.2.020343

  24. [25]

    Maksymov, J

    A. Maksymov, J. Nguyen, Y . Nam, and I. Markov, “Enhancing quantum computer performance via symmetrization,” 2023. [Online]. Available: https://arxiv.org/abs/2301.07233 12 40 60 80 100 120 140 0.85 0.90 0.95 1.00 1.05 1.10 Factorization WCT Drill 40 60 80 100 120 140 0.85 0.95 1.05 1.15 1.25 1.35 Factorization WCT Impeller 40 60 80 100 120 140 Nodes 0.85 ...