Scalable Quantum Reinforcement Learning on NISQ Devices with Dynamic-Circuit Qubit Reuse and Grover Optimization
Pith reviewed 2026-05-18 15:25 UTC · model grok-4.3
The pith
Dynamic circuits with mid-circuit resets reduce qubit needs for multi-step quantum reinforcement learning from O(T) to O(1) while preserving trajectory fidelity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the dynamic execution model for multi-step QMDPs employs mid-circuit measurement and reset to recycle a fixed physical quantum register across sequential interactions, generating identical state-action sequences to a static unrolled QMDP while reducing the physical qubit requirement from 7xT to a constant 7 independent of the interaction horizon T, thereby transforming qubit complexity from O(T) to O(1) while maintaining trajectory fidelity.
What carries the argument
The dynamic execution model with mid-circuit measurement and reset that recycles a fixed seven-qubit register across sequential interactions in the QMDP.
Load-bearing premise
Mid-circuit measurements and resets can be performed with low enough error that the generated state-action sequences remain functionally equivalent to a static unrolled circuit without cumulative decoherence altering the sampled trajectories.
What would settle it
Execute both the dynamic and static unrolled circuits for successively larger interaction horizons T on the same NISQ device and check whether the distribution of sampled trajectory returns begins to diverge beyond the level expected from hardware noise alone.
Figures
read the original abstract
A scalable and resource-efficient quantum reinforcement learning framework is presented that eliminates the linear qubit-scaling barrier in multi-step quantum Markov decision processes (QMDPs). The proposed framework integrates a QMDP formulation, dynamic-circuit execution, and Grover-based amplitude amplification into a unified quantum-native architecture. Environment dynamics are encoded entirely within quantum Hilbert space, enabling coherent superposition over state-action sequences and a direct quantum agent-environment interface without intermediate quantum-to-classical conversion. The central contribution is a dynamic execution model for multi-step QMDPs that employs mid-circuit measurement and reset to recycle a fixed physical quantum register across sequential interactions. This approach preserves trajectory fidelity relative to a static unrolled QMDP, generating identical state-action sequences while reducing the physical qubit requirement from 7xT to a constant 7, independent of the interaction horizon T. Thus, the qubit complexity of multi-step QMDPs is transformed from O(T) to O(1) while maintaining functional equivalence at the level of trajectory generation. Trajectory returns are evaluated via quantum arithmetic, and high-return trajectories are marked and amplified using amplitude amplification to increase their sampling probability. Simulations confirm preservation of trajectory fidelity with a 66% qubit reduction compared to a static design. Experimental execution on an IBM Heron-class processor demonstrates feasibility on noisy intermediate-scale quantum hardware, establishing a scalable and resource-efficient foundation for large-scale quantum-native reinforcement learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a quantum reinforcement learning framework for multi-step QMDPs that combines a quantum-native formulation with dynamic-circuit execution using mid-circuit measurement and reset to reuse a fixed 7-qubit register across T steps, reducing qubit count from 7T to 7 (O(T) to O(1)). It incorporates Grover amplitude amplification to boost sampling of high-return trajectories and claims that the dynamic model produces identical state-action sequences and preserves trajectory fidelity relative to a static unrolled circuit, supported by simulations showing 66% qubit reduction and experimental runs on an IBM Heron processor.
Significance. If the fidelity-preservation claim holds under realistic NISQ noise, the work would remove a major resource barrier for scaling quantum RL to longer horizons, enabling practical quantum-native agents on current hardware. The architectural synthesis of dynamic circuits, QMDP encoding, and Grover optimization is a concrete step toward resource-efficient quantum algorithms.
major comments (2)
- [Abstract] Abstract: The central claim that the dynamic execution model 'preserves trajectory fidelity relative to a static unrolled QMDP' and generates 'identical state-action sequences' is presented without any quantitative fidelity metrics, error bounds, or direct distributional comparisons (e.g., total variation distance or KL divergence between trajectory returns). This absence prevents verification that mid-circuit measurement/reset errors do not cumulatively alter sampled trajectories differently from the parallel noise channels in the 7T-qubit static circuit.
- [Results] Results/Experimental section: The simulation confirmation of fidelity preservation and the IBM Heron execution are described only qualitatively ('demonstrates feasibility'). No numerical values for per-step reset/measurement error rates, accumulated fidelity after T steps, success probabilities, or baseline comparisons against the static unrolled circuit are supplied, leaving the O(1) scaling claim without load-bearing empirical support.
minor comments (1)
- [Abstract] Abstract: The reported 66% qubit reduction is specific to a particular T (presumably T=3); stating the exact horizon used in the simulations would clarify how the general O(1) claim maps to the concrete result.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments highlight opportunities to strengthen the quantitative presentation of our fidelity-preservation results. We address each point below and will revise the manuscript to incorporate the requested metrics and comparisons.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that the dynamic execution model 'preserves trajectory fidelity relative to a static unrolled QMDP' and generates 'identical state-action sequences' is presented without any quantitative fidelity metrics, error bounds, or direct distributional comparisons (e.g., total variation distance or KL divergence between trajectory returns). This absence prevents verification that mid-circuit measurement/reset errors do not cumulatively alter sampled trajectories differently from the parallel noise channels in the 7T-qubit static circuit.
Authors: We agree that explicit quantitative metrics improve verifiability. By construction, the dynamic-circuit model with mid-circuit measurement and reset replicates the exact unitary evolution and measurement outcomes of the static unrolled circuit on a per-step basis; therefore the generated state-action sequences and trajectory returns are identical in the absence of hardware noise. In the revised manuscript we will add to the abstract and a new results subsection the total variation distance (which equals zero under ideal simulation) together with analytic error bounds on the cumulative deviation arising from realistic per-step measurement/reset error rates. These bounds will be compared directly against the parallel noise channels present in the 7T-qubit static circuit. revision: yes
-
Referee: [Results] Results/Experimental section: The simulation confirmation of fidelity preservation and the IBM Heron execution are described only qualitatively ('demonstrates feasibility'). No numerical values for per-step reset/measurement error rates, accumulated fidelity after T steps, success probabilities, or baseline comparisons against the static unrolled circuit are supplied, leaving the O(1) scaling claim without load-bearing empirical support.
Authors: The referee is correct that the current text is primarily qualitative. We will expand the Results section with concrete numerical data: per-step reset and measurement error rates extracted from the IBM Heron calibration data, accumulated fidelity after T = 5 and T = 10 steps for both dynamic and static circuits, success probabilities of the Grover-amplified high-return trajectories, and side-by-side distributional comparisons (including total variation distance and KL divergence) between the two implementations. Additional figures will display fidelity-decay curves and qubit-count scaling, thereby supplying the load-bearing empirical support for the O(1) claim. revision: yes
Circularity Check
No circularity: qubit scaling is an architectural construction, not a reduction to fitted inputs or self-citations.
full rationale
The paper introduces a dynamic-circuit execution model that recycles a fixed register via mid-circuit measurement and reset, claiming this transforms qubit complexity from O(T) to O(1) while preserving trajectory fidelity by construction of the circuit design. No equations, fitted parameters, or self-citations are presented that reduce the central claim back to its own inputs; the equivalence to the static unrolled QMDP is asserted as a property of the proposed architecture rather than derived from prior results or data fits. The framework remains self-contained against external benchmarks as a novel hardware-efficient implementation for QMDPs.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat recovery; no qubit-reuse or dynamic-circuit theorem unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
reducing the physical qubit requirement from 7xT to a constant 7, independent of the interaction horizon T, transforming qubit complexity from O(T) to O(1)
-
IndisputableMonolith/Foundation/AlphaDerivationExplicit.leanphi-fixed-point and 44-slot structure unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Grover’s search is applied to the superposition of these evaluated trajectories to amplify the probability of measuring those with the highest return
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Quantum Return Calculation To evaluate trajectory performance in the quantum domain, the classical concept of return, defined as the discounted sum of rewards, is encoded into a dedicated quantum register |g⟩. The return register|g⟩ occupies a Hilbert spaceG that contains a sufficient number of qubits to represent all possible return values. Initially, th...
-
[2]
Optimal Policy Search via Grover’s Algorithm In classical RL, the objective is to learn an optimal policy, a strategy that prescribes the best action for each state to max- imize return. In quantum reinforcement learning (QRL), this objective can be reformulated as a search problem, in which the ensemble of length- T quantum trajectories, generated by the...
-
[3]
1 is implemented in the quantum domain by encoding its dynamics into quantum states
Quantum encoding of the classical MDP The classical MDP described in Fig. 1 is implemented in the quantum domain by encoding its dynamics into quantum states. This quantum realization, shown in Fig. 2, preserves the structure of the classical model while exploiting superpo- sition to evaluate all state–action transitions in parallel. The implementation us...
-
[4]
Implementation of multiple interactions on quantum hardware To evaluate the practicality of the proposed dynamic- circuit-based reusable QMDP for multiple interactions, we deployed the full three-timestep circuit on the 133-qubit IBM Heron-class quantum processor (ibm torino). This device rep- resents IBM quantum’s latest generation of superconducting har...
work page 2000
-
[5]
Trade-offs between static and dynamic implementations To better understand the practical implications of adopting a dynamic-circuit approach, we outline the trade-offs between static and dynamic implementations of QMDP. These differ- ences clarify the respective strengths and limitations of each method when deployed on near-term quantum devices. Table I s...
-
[6]
Implementation and validation of optimal trajectory search on quantum hardware To further evaluate the proposed QRL framework, Grover’s search–based optimal trajectory identification was executed entirely on a quantum device (IBM’s ibm torino processor), without reliance on quantum simulators or classical subrou- tines. Each run was configured for 32K sho...
-
[7]
G. Dulac-Arnold, N. Levine, D. J. Mankowitz, J. Li, C. Padu- raru, S. Gowal, and T. Hester, Challenges of real-world re- inforcement learning: definitions, benchmarks and analysis, Mach. Learn. 110, 2419–2468 (2021)
work page 2021
- [8]
- [9]
- [10]
-
[11]
Preskill, Quantum Computing in the NISQ era and beyond, Quantum 2, 79 (2018)
J. Preskill, Quantum Computing in the NISQ era and beyond, Quantum 2, 79 (2018)
work page 2018
-
[12]
S. Wiedemann, D. Hein, S. Udluft, and C. B. Mendl, Quantum policy iteration via amplitude estimation and grover search – towards quantum advantage for reinforcement learning, Trans- actions on Machine Learning Research (2023)
work page 2023
-
[13]
T. H. Su, S. Shresthamali, and M. Kondo, Quantum framework for reinforcement learning: Integrating the markov decision process, quantum arithmetic, and trajectory search, Phys. Rev. A 111, 062421 (2025)
work page 2025
-
[14]
A. Pawar, Y . Li, Z. Mo, Y . Guo, X. Tang, Y . Zhang, and J. Yang, QRCC: Evaluating large quantum circuits on small quantum computers through integrated qubit reuse and circuit cutting, in Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Oper- ating Systems, ASPLOS ‘24 (Association for Computing M...
work page 2025
-
[15]
P. Nation, Dynamic Bernstein–Vazirani using mid-circuit re- set and measurement, https://nonhermitian.org/ posts/2021/2021-10-27-dynamic_BV.html
work page 2021
-
[16]
J. M. Pino, J. M. Dreiling, C. Figgatt, J. P. Gaebler, S. A. Moses, M. S. Allman, C. H. Baldwin, M. Foss-Feig, D. Hayes, K. Mayer, C. Ryan-Anderson, and B. Neyenhuis, Demonstra- tion of the trapped-ion quantum CCD computer architecture, Nature 592, 209 (2021)
work page 2021
-
[17]
B. Johnson, Bringing the full power of dynamic circuits to Qiskit runtime, https://www.ibm.com/quantum/ blog/quantum-dynamic-circuits
-
[18]
D. Dong, C. Chen, H. Li, and T.-J. Tarn, Quantum reinforce- ment learning, IEEE Transactions on Systems, Man, and Cy- bernetics, Part B (Cybernetics) 38, 1207 (2008)
work page 2008
- [19]
-
[20]
C.-L. Chen and D.-Y . Dong, Superposition-inspired reinforce- ment learning and quantum reinforcement learning, in Rein- forcement Learning, edited by C. Weber, M. Elshaw, and N. M. Mayer (IntechOpen, Rijeka, 2008), Chap. 4
work page 2008
-
[21]
C. L. CHEN, D. Y . DONG, and Z. H. CHEN, Quantum compu- tation for action selection using reinforcement learning, Inter- national Journal of Quantum Information 04, 1071 (2006)
work page 2006
-
[22]
D. Dong, C. Chen, J. Chu, and T.-J. Tarn, Robust quantum-inspired reinforcement learning for robot navigation, IEEE/ASME Transactions on Mechatronics 17, 86 (2012)
work page 2012
-
[23]
M. Ganger and W. Hu, Quantum multiple q-learning, Interna- tional Journal of Intelligence Science 9, 1 (2019)
work page 2019
-
[24]
B. Cho, Y . Xiao, P. Hui, and D. Dong, Quantum bandit with amplitude amplification exploration in an adversarial environ- ment, IEEE Transactions on Knowledge and Data Engineering 36, 311 (2024)
work page 2024
-
[25]
Q. Wei, H. Ma, C. Chen, and D. Dong, Deep reinforcement learning with quantum-inspired experience replay, IEEE Trans- actions on Cybernetics 52, 9326 (2022)
work page 2022
-
[26]
Y . Li, A. H. Aghvami, and D. Dong, Intelligent trajectory plan- ning in UA V-mounted wireless networks: A quantum-inspired reinforcement learning perspective, IEEE Wireless Communi- cations Letters 10, 1994 (2021)
work page 1994
-
[27]
J.-A. Li, D. Dong, Z. Wei, Y . Liu, Y . Pan, F. Nori, and X. Zhang, Quantum reinforcement learning during human decision-making, Nature Human Behaviour 4, 294 (2020)
work page 2020
-
[28]
D. Niraula, J. Jamaluddin, M. M. Matuszak, R. K. T. Haken, and I. E. Naqa, Quantum deep reinforcement learning for clini- cal decision support in oncology: application to adaptive radio- therapy, Scientific reports 11, 23545 (2021)
work page 2021
-
[29]
A. Sequeira, L. P. Santos, and L. S. Barbosa, Policy gradients using variational quantum circuits, Quantum Machine Intelli- gence 5, 18 (2023)
work page 2023
-
[30]
S. Y .-C. Chen, C.-H. H. Yang, J. Qi, P.-Y . Chen, X. Ma, and H.- S. Goan, Variational quantum circuits for deep reinforcement learning, IEEE Access 8, 141007 (2020)
work page 2020
-
[31]
O. Lockwood and M. Si, Reinforcement learning with quantum variational circuits, in Proceedings of the Sixteenth AAAI Con- ference on Artificial Intelligence and Interactive Digital Enter- tainment, AIIDE’20 (AAAI Press, USA, 2020), V ol. 16, pp. 245-251
work page 2020
-
[32]
O. Lockwood and M. Si, Playing Atari with hybrid quantum- classical reinforcement learning, in NeurIPS 2020 Workshop on Pre-registration in Machine Learning (PMLR, USA, 2021), V ol. 148, pp. 285–301
work page 2020
-
[33]
S. Wu, S. Jin, D. Wen, D. Han, and X. Wang, Quantum re- inforcement learning in continuous action space, Quantum 9, 1660 (2025)
work page 2025
- [34]
-
[35]
S. Jerbi, C. Gyurik, S. C. Marshall, H. J. Briegel, and V . Dun- jko, Parametrized quantum policies for reinforcement learning, in Proceedings of the 35th International Conference on Neural Information Processing Systems, NIPS ‘21 (Curran Associates Inc., Red Hook, NY , 2021), pp. 28362–28375
work page 2021
-
[36]
Y . Kwak, W. J. Yun, S. Jung, J.-K. Kim, and J. Kim, In- troduction to quantum reinforcement learning: Theory and pennylane-based implementation, in 2021 International Con- ference on Information and Communication Technology Con- vergence (ICTC) (IEEE, Korea, 2021), pp. 416–420
work page 2021
-
[37]
Lan, Variational quantum soft actor-critic, arXiv:2112.11921
Q. Lan, Variational quantum soft actor-critic, arXiv:2112.11921
-
[38]
D. Wang, A. Sundaram, R. Kothari, A. Kapoor, and M. Roet- teler, Quantum algorithms for reinforcement learning with a generative model, inProceedings of the 38th International Con- ference on Machine Learning, Proceedings of Machine Learn- ing Research, V ol. 139, edited by M. Meila and T. Zhang (PMLR, 2021), pp. 10916–10926
work page 2021
-
[39]
E. A. Cherrat, I. Kerenidis, and A. Prakash, Quantum reinforce- ment learning via policy iteration, Quantum Machine Intelli- gence 5, 30 (2023)
work page 2023
-
[40]
F. Hua, Y . Jin, Y . Chen, S. Vittal, K. Krsulich, L. S. Bishop, J. Lapeyre, A. Javadi-Abhari, and E. Z. Zhang, CaQR: A compiler-assisted approach for qubit reuse through dynamic circuit, in Proceedings of the 28th ACM International Confer- 17 ence on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2023 (Association for Compu...
work page 2023
-
[41]
M. DeCross, E. Chertkov, M. Kohagen, and M. Foss-Feig, Qubit-reuse compilation with mid-circuit measurement and re- set, Phys. Rev. X 13, 041057 (2023)
work page 2023
-
[42]
R. S. Sutton and A. G. Barto, Reinforcement Learning: An In- troduction (The MIT Press, Cambridge, 2018)
work page 2018
-
[43]
L. Graesser and W. Keng, Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley, USA, 2020)
work page 2020
-
[44]
I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning (The MIT Press, Cambridge, 2016)
work page 2016
-
[45]
Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving
S. Shalev-Shwartz, S. Shammah, and A. Shashua, Safe, multi-agent, reinforcement learning for autonomous driving, arXiv:1610.03295
work page internal anchor Pith review Pith/arXiv arXiv
- [46]
-
[47]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of go without human knowledge, Nature (London) 550, 354 (2017)
work page 2017
-
[48]
N. Brown and T. Sandholm, Superhuman AI for multiplayer poker, Science 365, 885 (2019)
work page 2019
-
[49]
Plaat, Deep Reinforcement Learning (Springer Nature, Sin- gapore, 2022)
A. Plaat, Deep Reinforcement Learning (Springer Nature, Sin- gapore, 2022)
work page 2022
-
[50]
E. Rieffel and W. Polak, Quantum Computing: A Gentle Intro- duction (The MIT Press, Cambridge, 2011)
work page 2011
-
[51]
M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, UK, 2011)
work page 2011
-
[52]
P. W. Shor, Algorithms for quantum computation: discrete log- arithms and factoring, in Proceedings 35th Annual Symposium on Foundations of Computer Science (IEEE, Piscataway, NJ,
-
[53]
P. W. Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer, SIAM Journal on Computing 26, 1484 (1997)
work page 1997
-
[54]
A. Ekert and R. Jozsa, Quantum computation and Shor’s factor- ing algorithm, Rev. Mod. Phys. 68, 733 (1996)
work page 1996
-
[55]
L. K. Grover, A fast quantum mechanical algorithm for database search, in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (ACM, New York, 1996), pp. 212–219
work page 1996
-
[56]
P. Das, A. Locharla, and C. Jones, LILLIPUT: a lightweight low-latency lookup-table decoder for near-term quantum error correction, in Proceedings of the 27th ACM International Con- ference on Architectural Support for Programming Languages and Operating Systems , ASPLOS ‘22 (Association for Com- puting Machinery, USA, 2022), pp. 541–553
work page 2022
-
[57]
M. R. Jokar, R. Rines, G. Pasandi, H. Cong, A. Holmes, Y . Shi, M. Pedram, and F. T. Chong, DigiQ: A scalable digital con- troller for quantum computers using SFQ logic, in 2022 IEEE International Symposium on High-Performance Computer Ar- chitecture (HPCA) (IEEE Computer Society, USA, 2022), pp. 400-414
work page 2022
-
[58]
A. Wu, G. Li, H. Zhang, G. G. Guerreschi, Y . Ding, and Y . Xie, A synthesis framework for stitching surface code with super- conducting quantum devices, in Proceedings of the 49th An- nual International Symposium on Computer Architecture, ISCA ‘22 (Association for Computing Machinery, USA, 2022), pp. 337–350
work page 2022
-
[59]
Y . Huang and M. Martonosi, QDB: From Quantum Algorithms Towards Correct Quantum Programs, in9th Workshop on Eval- uation and Usability of Programming Languages and Tools (PLATEAU 2018), Open Access Series in Informatics (OA- SIcs), V ol. 67, edited by T. Barik, J. Sunshine, and S. Chasins (Schloss Dagstuhl – Leibniz-Zentrum f¨ur Informatik, Dagstuhl, Ger...
work page 2018
-
[60]
J. Liu, G. T. Byrd, and H. Zhou, Quantum circuits for dy- namic runtime assertions in quantum computation, in Proceed- ings of the Twenty-Fifth International Conference on Archi- tectural Support for Programming Languages and Operating Systems, ASPLOS ‘20 (Association for Computing Machinery, USA, 2020), pp. 1017–1030
work page 2020
-
[61]
P. Kaye, R. Laflamme, and M. Mosca,An Introduction to Quan- tum Computing (Oxford University Press, New York, 2007)
work page 2007
-
[62]
C. Guo, Grover’s algorithm – implementations and implica- tions, Highlights in Science, Engineering and Technology 38, 1071 (2023)
work page 2023
-
[63]
M. AbuGhanem, IBM quantum computers: evolution, perfor- mance, and future directions, J Supercomput 81, 687 (2025)
work page 2025
-
[64]
K. Rudinger, G. J. Ribeill, L. C. Govia, M. Ware, E. Nielsen, K. Young, T. A. Ohki, R. Blume-Kohout, and T. Proctor, Char- acterizing midcircuit measurements on a superconducting qubit using gate set tomography, Phys. Rev. Appl.17, 014014 (2022)
work page 2022
-
[65]
L. C. G. Govia, P. Jurcevic, C. J. Wood, N. Kanazawa, S. T. Merkel, and D. C. McKay, A randomized benchmarking suite for mid-circuit measurements, New Journal of Physics 25, 123016 (2023)
work page 2023
- [66]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.