pith. sign in

arxiv: 2412.18208 · v3 · submitted 2024-12-24 · 🪐 quant-ph · cs.LG

Quantum framework for Reinforcement Learning: Integrating Markov decision process, quantum arithmetic, and trajectory search

Pith reviewed 2026-05-23 06:34 UTC · model grok-4.3

classification 🪐 quant-ph cs.LG
keywords quantum reinforcement learningMarkov decision processquantum arithmetictrajectory searchquantum superpositionfully quantum RL
0
0 comments X

The pith

A fully quantum Markov decision process model enables reinforcement learning entirely within the quantum domain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a quantum framework for reinforcement learning that creates a fully quantum version of the Markov decision process. It implements state transitions, return calculations, and trajectory searches using quantum arithmetic and a quantum search algorithm. The central aim is to perform all agent-environment interactions inside the quantum domain with no classical computations required. A sympathetic reader would care because this setup leverages superposition to potentially improve efficiency in RL decision-making tasks.

Core claim

The paper establishes that by modeling the MDP with quantum states and using quantum principles for transitions, rewards, and searching trajectories, the entire RL process can be realized through quantum phenomena without classical intervention, demonstrating quantum enhancement via superposition.

What carries the argument

Quantum model of the Markov decision process that uses quantum arithmetic for state transitions and return calculations together with quantum search for trajectory optimization.

Load-bearing premise

A quantum model of the MDP can be realized with quantum arithmetic and search such that the full RL loop runs without any classical post-processing or measurement that would collapse the claimed advantage.

What would settle it

An implementation of a simple RL task on quantum hardware that requires intermediate measurements to extract actions or rewards, showing that the loop cannot complete without classical steps.

Figures

Figures reproduced from arXiv: 2412.18208 by Masaaki Kondo, Shaswot Shresthamali, Thet Htar Su.

Figure 1
Figure 1. Figure 1: FIG. 1. The agent-environment interaction in a Markov decision pro [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIG. 2. Quantum circuit for Grover’s algorithm on 2 qubits, search [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIG. 3. Graphical representation of a classical MDP with four states [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: FIG. 4. Quantum circuit of the quantum Markov decision process (QMDP) simulating a single interaction between the agent and the environ [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: FIG. 5. State transition heat-map representing the probabilities of [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: FIG. 6. Quantum sample distribution of the QMDP circuit, display [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FIG. 7. Quantum circuit implementation of agent-environment interactions across 3 time steps (t = 0, 1, 2). Each colored block represents a [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: FIG. 8. Quantum circuit for return calculation in the QMDP. The process simulates the overall outcome of the agent-environment interactions [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: FIG. 9. Distribution of quantum trajectories in the QMDP for 3 time steps. The x-axis shows trajectory numbers (see Table [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: FIG. 10. Distribution of quantum trajectories after executing Grover’s algorithm to search the trajectories starting at [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: FIG. 12. Optimal trajectory plot for an agent transitioning from [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗
Figure 11
Figure 11. Figure 11: FIG. 11. Comparison of total rewards for 4 unique trajectories from [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: FIG. 13. Distribution of quantum trajectories after executing Grover’s algorithm to search the trajectories starting from any state and termi [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: FIG. 14. Comparison of total reward for each unique trajectory [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: FIG. 15. Optimal trajectories from 3 different starting states (1, 2, [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
read the original abstract

This paper introduces a quantum framework for addressing reinforcement learning (RL) tasks, grounded in the quantum principles and leveraging a fully quantum model of the classical Markov decision process (MDP). By employing quantum concepts and a quantum search algorithm, this work presents the implementation and optimization of the agent-environment interactions entirely within the quantum domain, eliminating reliance on classical computations. Key contributions include the quantum-based state transitions, return calculation, and trajectory search mechanism that utilize quantum principles to demonstrate the realization of RL processes through quantum phenomena. The implementation emphasizes the fundamental role of quantum superposition in enhancing computational efficiency for RL tasks. Results demonstrate the capacity of a quantum model to achieve quantum enhancement in RL, highlighting the potential of fully quantum implementations in decision-making tasks. This work not only underscores the applicability of quantum computing in machine learning but also contributes to the field of quantum reinforcement learning (QRL) by offering a robust framework for understanding and exploiting quantum computing in RL systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a quantum framework for reinforcement learning that models the classical Markov decision process (MDP) using quantum principles. It claims to implement agent-environment interactions, state transitions, return calculations, and trajectory search entirely within the quantum domain via quantum arithmetic and a quantum search algorithm, thereby eliminating classical computations and achieving quantum enhancement through superposition.

Significance. If the central claim of a self-contained quantum MDP realization (with unitary encodings of transitions and rewards, coherent trajectory search, and policy extraction without measurement-induced collapse or classical post-processing) holds, the work would constitute a notable advance in quantum reinforcement learning by providing an end-to-end quantum RL loop. The emphasis on superposition for efficiency and the avoidance of hybrid classical-quantum interfaces would be a distinguishing contribution if demonstrated.

major comments (2)
  1. [Abstract] Abstract: The assertion that 'the implementation and optimization of the agent-environment interactions [occur] entirely within the quantum domain, eliminating reliance on classical computations' is load-bearing for the central claim but is unsupported; no unitary construction, circuit, or encoding procedure is supplied showing how an arbitrary classical transition kernel P(s'|s,a) and reward function R(s,a) are embedded into quantum arithmetic operations without classical pre-processing to define the oracle or unitary.
  2. [Abstract] Abstract: The statement that 'results demonstrate the capacity of a quantum model to achieve quantum enhancement' lacks any supporting data, benchmark comparisons, circuit diagrams, or verification steps, rendering the enhancement claim unverifiable and preventing assessment of whether the trajectory search (presumably amplitude amplification) preserves coherence across the full RL loop.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and for identifying key points where the manuscript's claims require stronger substantiation. We agree that the abstract assertions about a fully quantum implementation and demonstrated enhancement need explicit support. We will revise the manuscript to address these gaps by adding the requested constructions, circuits, and verification details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that 'the implementation and optimization of the agent-environment interactions [occur] entirely within the quantum domain, eliminating reliance on classical computations' is load-bearing for the central claim but is unsupported; no unitary construction, circuit, or encoding procedure is supplied showing how an arbitrary classical transition kernel P(s'|s,a) and reward function R(s,a) are embedded into quantum arithmetic operations without classical pre-processing to define the oracle or unitary.

    Authors: We acknowledge that the current manuscript describes the quantum MDP at a conceptual level using quantum arithmetic for transitions and rewards but does not supply explicit unitary operators or circuits for arbitrary classical kernels P(s'|s,a) and R(s,a). This is a valid observation. In the revision we will add a dedicated section with the encoding procedure, including the unitary construction that embeds the transition kernel via quantum arithmetic without classical pre-processing of the oracle, and circuit diagrams showing how the agent-environment interaction remains coherent. revision: yes

  2. Referee: [Abstract] Abstract: The statement that 'results demonstrate the capacity of a quantum model to achieve quantum enhancement' lacks any supporting data, benchmark comparisons, circuit diagrams, or verification steps, rendering the enhancement claim unverifiable and preventing assessment of whether the trajectory search (presumably amplitude amplification) preserves coherence across the full RL loop.

    Authors: The manuscript argues for quantum enhancement via superposition in the trajectory search step but indeed provides no numerical benchmarks, classical comparisons, or explicit circuit simulations to verify coherence preservation through the full loop. We agree this renders the claim unverifiable in its present form. The revision will incorporate simulation results on small MDPs, runtime comparisons against classical RL, circuit diagrams for the amplitude amplification step, and analysis confirming that measurements do not collapse the superposition before policy extraction. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; derivation remains self-contained.

full rationale

The abstract and available description introduce a quantum MDP model via quantum arithmetic and trajectory search but supply no equations, parameter fits, or self-citations that reduce any claimed prediction or result to its own inputs by construction. No load-bearing step matches the enumerated patterns (self-definitional, fitted-input-called-prediction, etc.). The central claim of a fully quantum RL loop is presented at a high level without demonstrated reduction to classical pre-processing or renamed empirical patterns. This is the expected honest non-finding when the manuscript does not exhibit the specific reductions required for a positive circularity flag.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated in the provided text.

pith-pipeline@v0.9.0 · 5701 in / 1048 out tokens · 18026 ms · 2026-05-23T06:34:32.824767+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 3 internal anchors

  1. [1]

    R. S. Sutton and A. G. Barto, Reinforcement Learning: An In- troduction (The MIT Press, Cambridge, 2018)

  2. [2]

    Graesser and W

    L. Graesser and W. Keng, Foundations of Deep Reinforcement Learning: Theory and Practice in Python (Addison-Wesley, USA, 2020)

  3. [3]

    Goodfellow, Y

    I. Goodfellow, Y . Bengio, and A. Courville, Deep Learning (The MIT Press, Cambridge, 2016)

  4. [4]

    Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving

    S. Shalev-Shwartz, S. Shammah, and A. Shashua, Safe, multi-agent, reinforcement learning for autonomous driving, arXiv:1610.03295

  5. [5]

    Kober, J

    J. Kober, J. A. Bagnell, and J. Peters, Reinforcement learning in robotics: A survey, The International Journal of Robotics Research 32, 1238 (2013)

  6. [6]

    Silver, J

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. P. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, Mastering the game of go without human knowledge, Nature 550, 354 (2017)

  7. [7]

    Brown and T

    N. Brown and T. Sandholm, Superhuman AI for multiplayer poker, Science 365, 885 (2019)

  8. [8]

    Challenges of Real-World Reinforcement Learning

    G. Dulac-Arnold, D. Mankowitz, and T. Hester, Challenges of real-world reinforcement learning, arXiv:1904.12901

  9. [9]

    Silver, A

    D. Silver, A. Huang, C. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V . Panneershel- vam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N. Kalch- brenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, T. Graepel, and D. Hassabis, Mastering the game of go with deep neural networks and tree search, Nature 529, 484 (2016)

  10. [10]

    T. L. Scholten, C. J. Williams, D. Moody, M. Mosca, W. Hur- ley, W. J. Zeng, M. Troyer, and J. M. Gambetta, Assessing the benefits and risks of quantum computers, arXiv:2401.16317

  11. [11]

    Meyer, C

    N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler, A survey on quantum reinforcement learning, arXiv:2211.03464

  12. [12]

    L. K. Grover, A fast quantum mechanical algorithm for database search, in Proceedings of the Twenty-Eighth Annual ACM Symposium on Theory of Computing (ACM, New York, 1996), pp. 212–219

  13. [13]

    D. Dong, C. Chen, H. Li, and T.-J. Tarn, Quantum reinforce- ment learning, IEEE Transactions on Systems, Man, and Cy- bernetics, Part B (Cybernetics) 38, 1207 (2008)

  14. [14]

    Dao-Yi, C

    D. Dao-Yi, C. Chun-Lin, C. Zong-Hai, and Z. Chen-Bin, Quan- tum mechanics helps in learning for more intelligent robots, Chinese Physics Letters 23, 1691 (2006)

  15. [15]

    Chen and D.-Y

    C.-L. Chen and D.-Y . Dong, Superposition-inspired reinforce- ment learning and quantum reinforcement learning, in Rein- forcement Learning, edited by C. Weber, M. Elshaw, and N. M. Mayer (IntechOpen, Rijeka, 2008), Chap. 4

  16. [16]

    C. L. CHEN, D. Y . DONG, and Z. H. CHEN, Quantum compu- tation for action selection using reinforcement learning, Inter- national Journal of Quantum Information 04, 1071 (2006)

  17. [17]

    D. Dong, C. Chen, J. Chu, and T.-J. Tarn, Robust quantum-inspired reinforcement learning for robot navigation, IEEE/ASME Transactions on Mechatronics 17, 86 (2012)

  18. [18]

    Ganger and W

    M. Ganger and W. Hu, Quantum multiple q-learning, Interna- tional Journal of Intelligence Science 9, 1 (2019)

  19. [19]

    B. Cho, Y . Xiao, P. Hui, and D. Dong, Quantum bandit with amplitude amplification exploration in an adversarial environ- ment, IEEE Transactions on Knowledge and Data Engineering 36, 311 (2024)

  20. [20]

    Q. Wei, H. Ma, C. Chen, and D. Dong, Deep reinforcement learning with quantum-inspired experience replay, IEEE Trans- actions on Cybernetics 52, 9326 (2022)

  21. [21]

    Y . Li, A. H. Aghvami, and D. Dong, Intelligent trajectory plan- ning in UA V-mounted wireless networks: A quantum-inspired reinforcement learning perspective, IEEE Wireless Communi- cations Letters 10, 1994 (2021)

  22. [22]

    J.-A. Li, D. Dong, Z. Wei, Y . Liu, Y . Pan, F. Nori, and 18 X. Zhang, Quantum reinforcement learning during human decision-making, Nature Human Behaviour 4, 294 (2020)

  23. [23]

    Niraula, J

    D. Niraula, J. Jamaluddin, M. M. Matuszak, R. K. T. Haken, and I. E. Naqa, Quantum deep reinforcement learning for clini- cal decision support in oncology: application to adaptive radio- therapy, Scientific reports11, 23545 (2021)

  24. [24]

    Sequeira, L

    A. Sequeira, L. P. Santos, and L. S. Barbosa, Policy gradients using variational quantum circuits, arXiv:2203.10591

  25. [25]

    S. Y .-C. Chen, C.-H. H. Yang, J. Qi, P.-Y . Chen, X. Ma, and H.- S. Goan, Variational quantum circuits for deep reinforcement learning, IEEE Access 8, 141007 (2020)

  26. [26]

    Lockwood and M

    O. Lockwood and M. Si, Reinforcement learning with quantum variational circuits, in Proceedings of the Sixteenth AAAI Con- ference on Artificial Intelligence and Interactive Digital Enter- tainment, AIIDE’20 (AAAI Press, USA, 2020), V ol. 16, pp. 245-251

  27. [27]

    Lockwood and M

    O. Lockwood and M. Si, Playing Atari with hybrid quantum- classical reinforcement learning, in NeurIPS 2020 workshop on pre-registration in machine learning(PMLR, USA, 2021), V ol. 148, pp. 285–301

  28. [28]

    S. Wu, S. Jin, D. Wen, D. Han, and X. Wang, Quan- tum reinforcement learning in continuous action space, arXiv:2012.10711

  29. [29]

    Skolik, S

    A. Skolik, S. Jerbi, and V . Dunjko, Quantum agents in the Gym: a variational quantum algorithm for deep Q-learning, Quantum 6, 720 (2022)

  30. [30]

    Jerbi, C

    S. Jerbi, C. Gyurik, S. C. Marshall, H. J. Briegel, and V . Dun- jko, Parametrized quantum policies for reinforcement learning, in Proceedings of the 35th International Conference on Neural Information Processing Systems , NIPS’21 (Curran Associates Inc., USA, 2021), pp. 28362–28375

  31. [31]

    Y . Kwak, W. J. Yun, S. Jung, J.-K. Kim, and J. Kim, In- troduction to quantum reinforcement learning: Theory and pennylane-based implementation, in 2021 International Con- ference on Information and Communication Technology Con- vergence (ICTC) (IEEE, Korea, 2021), pp. 416–420

  32. [32]

    Lan, Variational quantum soft actor-critic, arXiv:2112.11921

    Q. Lan, Variational quantum soft actor-critic, arXiv:2112.11921

  33. [33]

    D. Wang, A. Sundaram, R. Kothari, A. Kapoor, and M. Roet- teler, Quantum algorithms for reinforcement learning with a generative model, arXiv:2112.08451

  34. [34]

    E. A. Cherrat, I. Kerenidis, and A. Prakash, Quantum reinforce- ment learning via policy iteration, Quantum Machine Intelli- gence 5, 30 (2023)

  35. [35]

    Wiedemann, D

    S. Wiedemann, D. Hein, S. Udluft, and C. Mendl, Quantum policy iteration via amplitude estimation and grover search – towards quantum advantage for reinforcement learning, arXiv:2206.04741

  36. [36]

    Dunjko, J

    V . Dunjko, J. M. Taylor, and H. J. Briegel, Quantum-enhanced machine learning, Phys. Rev. Lett. 117, 130501 (2016)

  37. [37]

    Plaat, Deep Reinforcement Learning (Springer Nature, Sin- gapore, 2022)

    A. Plaat, Deep Reinforcement Learning (Springer Nature, Sin- gapore, 2022)

  38. [38]

    Morales, Grokking Deep Reinforcement Learning(Manning Publications, New York, 2020)

    M. Morales, Grokking Deep Reinforcement Learning(Manning Publications, New York, 2020)

  39. [39]

    Rieffel and W

    E. Rieffel and W. Polak, Quantum Computing: A Gentle Intro- duction (The MIT Press, Cambridge, 2011)

  40. [40]

    M. A. Nielsen and I. L. Chuang, Quantum Computation and Quantum Information (Cambridge University Press, UK, 2011)

  41. [41]

    P. W. Shor, Algorithms for quantum computation: discrete log- arithms and factoring, in Proceedings 35th annual symposium on foundations of computer science (IEEE, USA, 1994), pp. 124–134

  42. [42]

    P. W. Shor, Polynomial-time algorithms for prime factorization and discrete logarithms on a quantum computer, SIAM Journal on Computing 26, 1484 (1997)

  43. [43]

    Ekert and R

    A. Ekert and R. Jozsa, Quantum computation and Shor’s factor- ing algorithm, Rev. Mod. Phys. 68, 733 (1996)

  44. [44]

    Quantum computing with Qiskit

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lish- man, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, Quantum comput- ing with Qiskit, arXiv:2405.08810

  45. [45]

    P. Kaye, R. Laflamme, and M. Mosca, An Introduction to Quantum Computing (Oxford University Press Inc., New York, 2007)

  46. [46]

    Guo, Grover’s algorithm – implementations and implica- tions, Highlights in Science, Engineering and Technology 38, 1071 (2023)

    C. Guo, Grover’s algorithm – implementations and implica- tions, Highlights in Science, Engineering and Technology 38, 1071 (2023)