pith. machine review for the scientific record. sign in

arxiv: 2604.28009 · v1 · submitted 2026-04-30 · 🪐 quant-ph

Recognition: unknown

Learning quantum disentanglement scheduling from reduced states via modular hybrid policies

Authors on Pith no claims yet

Pith reviewed 2026-05-07 06:28 UTC · model grok-4.3

classification 🪐 quant-ph
keywords quantum controlhybrid quantum-classical policiesdisentanglement schedulingreduced density matricesparameterized quantum circuitspartial observationsreinforcement learning for quantum
0
0 comments X

The pith

A modular hybrid policy learns multiqubit disentanglement scheduling from two-qubit reduced states alone, with classical preprocessing as the dominant factor.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies quantum control under restricted information by framing the task as scheduling which qubit pair to disentangle at each step, given only the two-qubit reduced density matrices. It introduces a modular hybrid policy that routes the reduced states through classical preprocessing, a parameterized quantum circuit acting as a nonlinear latent block, and classical postprocessing to output pair-selection probabilities. Benchmarks on 4-, 5-, and 6-qubit instances show that the choice and quality of preprocessing largely determine success, while the quantum module supplies a compact representation whose added value depends on the input features and available model budget. The work also maps a performance-efficiency trade-off across policy families and finds that widening the quantum circuit generally yields more benefit than deepening it.

Core claim

We introduce a modular hybrid quantum-classical policy framework consisting of classical preprocessing, a parameterized quantum circuit as a compact nonlinear latent block, and classical postprocessing for pair-selection probabilities. Benchmarking 4-, 5-, and 6-qubit tasks shows that preprocessing is the dominant factor governing performance under reduced-state observations, while the quantum module provides a conditional compact representation whose utility depends on the input features and model budget. Increasing circuit width is generally more useful than increasing depth.

What carries the argument

The modular hybrid quantum-classical policy, which processes two-qubit reduced density matrices through classical preprocessing into a parameterized quantum circuit latent block and then classical postprocessing to produce action probabilities.

If this is right

  • Classical preprocessing steps must be prioritized when designing controllers that operate on partial quantum observations.
  • Parameterized quantum circuits can function as efficient latent feature extractors, but only when input features and model budget make their nonlinearity advantageous.
  • For hybrid policies of this type, circuit width should be favored over depth when balancing performance against resource cost.
  • Hybrid policies of this modular form provide a concrete route to quantum control on devices where full wave-function readout remains unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dominance of preprocessing suggests that gains on larger systems may come more from improved classical feature engineering than from allocating additional qubits to the quantum module.
  • The conditional utility of the quantum block implies that adaptive policies could dynamically toggle the quantum component on or off depending on observed input statistics.
  • If reduced-state inputs remain informative at scale, the same modular architecture could be tested on hardware for tasks such as variational state preparation where full tomography is prohibitive.

Load-bearing premise

The simulated 4- to 6-qubit disentanglement tasks and the chosen reduced-state inputs are representative of real-world quantum control problems, and observed performance differences arise from policy architecture rather than unstated training details or task-specific biases.

What would settle it

A follow-up experiment in which a purely classical policy without the quantum module matches or exceeds hybrid performance on the same reduced-state tasks, or in which increasing circuit depth produces larger gains than increasing width.

Figures

Figures reproduced from arXiv: 2604.28009 by J. Li, J.-Z. Han, M. Xue, X. Lv, Y.-X. Xiao, Z.-H. Zhang, Z. Zheng.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed hybrid quantum–classical policy network within the partially observed quantum disen￾tanglement framework. The lower panel shows the reinforcement-learning interaction loop between the quantum environment and the agent, where the policy receives reduced-state observations and outputs an action at each step. The upper panel expands the policy network into three modules: a… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of disentanglement performance across preprocessing models under fully entangled 4-, 5-, and 6-qubit settings. The statistics are computed over 500 randomly generated fully entangled initial states for each system size. The left panel shows the disentan￾glement success rate within a maximum budget of 128 steps. The right panel reports the average disentanglement step count over successful episod… view at source ↗
Figure 3
Figure 3. Figure 3: Pattern-wise comparison on 6-qubit disentanglement tasks. Different preprocessing models are compared across representative 6-qubit initial-state patterns. The statistics and plotting conventions are the same as in view at source ↗
Figure 4
Figure 4. Figure 4: Performance–parameter trade-off across preprocessing models and PQC configurations. Different preprocessing models and representative PQC configurations are compared in the success￾rate–parameter plane. The horizontal axis shows the total number of trainable parameters of the complete model, while the vertical axis reports the disentanglement success rate on the 6-qubit fully entan￾gled task. The statistic… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics from single-qubit entropy evolution. Single-qubit average entanglement-entropy distributions are shown at representative training epochs over a total of 8000 iterations. The evaluation uses the same set of 500 initial states for all models, under the same testing setting as in the preceding experiments. Each column corresponds to one preprocessing architecture, and each row corresponds to… view at source ↗
Figure 6
Figure 6. Figure 6: Disentangling steps for the Transformer-based policy on a six-qubit test instance. view at source ↗
Figure 7
Figure 7. Figure 7: Disentangling steps for the LSTM-based policy on a six-qubit test instance. view at source ↗
Figure 8
Figure 8. Figure 8: Disentangling steps for the GRU-based policy on a six-qubit test instance. view at source ↗
Figure 9
Figure 9. Figure 9: Disentangling steps for the 1D-CNN-based policy on a six-qubit test instance. view at source ↗
Figure 10
Figure 10. Figure 10: Disentangling steps for the 2D-CNN-based policy on a six-qubit test instance. view at source ↗
Figure 11
Figure 11. Figure 11: Disentangling steps for the MLP-based policy on a six-qubit test instance. view at source ↗
read the original abstract

Quantum control with restricted state access is central to near-term quantum devices, where full wave-function information is unavailable. We study this problem through multiqubit disentanglement scheduling from partial observations, where a controller receives only two-qubit reduced density matrices and selects which qubit pair to disentangle at each step. We introduce a modular hybrid quantum--classical policy framework consisting of classical preprocessing, a parameterized quantum circuit as a compact nonlinear latent block, and classical postprocessing for pair-selection probabilities. Benchmarking 4-, 5-, and 6-qubit tasks, we find that preprocessing is the dominant factor governing performance under reduced-state observations, while the quantum module provides a conditional compact representation whose utility depends on the input features and model budget. We further identify a performance--efficiency trade-off across policy families and find that increasing circuit width is generally more useful than increasing depth. These results provide practical design principles for hybrid policies in reduced-information quantum control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript introduces a modular hybrid quantum-classical policy for multiqubit disentanglement scheduling under restricted observations consisting only of two-qubit reduced density matrices. The policy comprises classical preprocessing, a parameterized quantum circuit acting as a compact nonlinear latent block, and classical postprocessing to output pair-selection probabilities. Through benchmarking on simulated 4-, 5-, and 6-qubit disentanglement tasks, the authors conclude that classical preprocessing dominates performance, the quantum module offers only conditional utility that depends on input features and model budget, and that increasing circuit width is generally more beneficial than increasing depth for the observed performance-efficiency trade-offs.

Significance. If the empirical results prove robust under proper controls, the work supplies practical design heuristics for hybrid policies in near-term quantum control problems with partial state access. The modular separation of components is a clear strength that enables targeted study of the quantum block's contribution. The identification of preprocessing dominance and width-over-depth preference could inform resource allocation in hybrid RL architectures, though the small system sizes and simulated setting limit immediate applicability to larger or noisy hardware scenarios.

major comments (3)
  1. [Results section] Results section (benchmarking on 4-6 qubit tasks): The central claim that preprocessing is the dominant factor and that the quantum module provides only conditional utility is not supported by ablations that replace the parameterized quantum circuit with a classical network of matched parameter count or capacity. Without such controls, observed performance gaps could be attributable to differences in optimization effort, hyperparameter tuning, or effective model capacity rather than the hybrid architecture itself.
  2. [Experimental setup] Experimental setup and benchmarking paragraphs: No details are supplied on the number of independent training runs, random seeds, error bars, data splits, or statistical significance testing for the reported performance metrics and trade-offs across policy families. This absence prevents evaluation of whether the performance-efficiency conclusions are statistically reliable or sensitive to training stochasticity.
  3. [Methods and results] Methods and results: The reduced-state observation model (two-qubit RDMs) and the specific choice of which pairs are provided at each step are not accompanied by controls showing that the tasks are not specially amenable to classical preprocessing; the paper does not report comparisons against stronger classical baselines or discuss potential task-specific biases that could inflate the reported preprocessing dominance.
minor comments (3)
  1. [Methods] Notation for the two-qubit reduced density matrices and the policy output probabilities should be defined more explicitly in the methods section to avoid ambiguity when describing the input features to the quantum circuit.
  2. [Figures] Figure captions for the performance plots should include the exact number of runs and any error-bar conventions used, as well as the precise hyperparameter settings for each policy family.
  3. [Abstract and introduction] The abstract and introduction would benefit from a brief statement of the total number of trainable parameters in each policy variant to make the width-versus-depth comparison more transparent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments identify key areas where additional controls and statistical details will strengthen the empirical claims about the hybrid policy's design principles. We address each major comment point by point below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Results section] The central claim that preprocessing is the dominant factor and that the quantum module provides only conditional utility is not supported by ablations that replace the parameterized quantum circuit with a classical network of matched parameter count or capacity. Without such controls, observed performance gaps could be attributable to differences in optimization effort, hyperparameter tuning, or effective model capacity rather than the hybrid architecture itself.

    Authors: We agree that parameter-matched ablations are required to isolate the quantum module's contribution from capacity or optimization effects. Our existing comparisons include hybrid variants against classical preprocessing-plus-postprocessing policies, but these do not enforce exact parameter parity in the latent representation block. In the revised manuscript we will add explicit ablations in which the PQC is replaced by classical MLPs (and, where relevant, deeper feed-forward networks) possessing identical parameter counts; we will report the resulting performance deltas and optimization curves to substantiate the conditional utility claim. revision: yes

  2. Referee: [Experimental setup] No details are supplied on the number of independent training runs, random seeds, error bars, data splits, or statistical significance testing for the reported performance metrics and trade-offs across policy families. This absence prevents evaluation of whether the performance-efficiency conclusions are statistically reliable or sensitive to training stochasticity.

    Authors: We thank the referee for highlighting this omission. The original text omitted these details for space. We will expand the experimental-setup and benchmarking sections to state that every reported metric is the mean over 10 independent training runs using distinct random seeds (0-9). Standard-deviation error bars will be added to all plots, and paired t-tests will be performed and reported for the principal performance and efficiency comparisons. Because the environments are fully simulated, there are no fixed data splits; we will instead describe the episode-sampling procedure and confirm that training and evaluation episodes are drawn from independent random initializations. revision: yes

  3. Referee: [Methods and results] The reduced-state observation model (two-qubit RDMs) and the specific choice of which pairs are provided at each step are not accompanied by controls showing that the tasks are not specially amenable to classical preprocessing; the paper does not report comparisons against stronger classical baselines or discuss potential task-specific biases that could inflate the reported preprocessing dominance.

    Authors: We acknowledge that stronger controls are needed to rule out task-specific biases. The current experiments already compare the hybrid policy against purely classical preprocessing-plus-postprocessing baselines, which demonstrate preprocessing dominance. To address the referee's concern we will add, in the revision, results from stronger classical baselines (deeper MLPs, graph neural networks operating on the qubit-interaction graph, and a simple entanglement-heuristic policy derived directly from the RDMs). We will also insert a short discussion of possible observation-model biases and argue why the chosen disentanglement scheduling tasks remain representative of reduced-access quantum control problems. The observation model itself will be clarified: at each step the policy receives the full set of two-qubit RDMs for every pair and must select which pair to act upon. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking of hybrid policies rests on explicit task simulations and modular architecture definitions without reduction to fitted inputs or self-citations.

full rationale

The paper introduces a modular hybrid policy (classical preprocessing + parameterized quantum circuit latent block + classical postprocessing) and reports performance on simulated 4-6 qubit disentanglement scheduling tasks under two-qubit reduced density matrix observations. Central claims identify preprocessing dominance and width-over-depth preference as outcomes of direct benchmarking comparisons across policy families. No derivation chain, uniqueness theorem, or first-principles prediction is asserted that could reduce by construction to the inputs; the framework components are defined explicitly and independently, with results presented as observed empirical patterns rather than renamed known results or self-referential fits. The approach is self-contained against external benchmarks because performance metrics derive from the stated task definitions and architecture choices without load-bearing self-citations or ansatz smuggling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard quantum information assumptions about reduced states and on the trainability of the hybrid policy; no new physical entities are postulated.

free parameters (1)
  • PQC parameters
    Trainable parameters inside the parameterized quantum circuit are optimized during policy learning and directly affect the latent representation.
axioms (1)
  • domain assumption Two-qubit reduced density matrices contain sufficient information to learn effective disentanglement policies
    This premise defines the partial-observation setting and is invoked throughout the problem formulation and benchmarking.

pith-pipeline@v0.9.0 · 5480 in / 1427 out tokens · 66295 ms · 2026-05-07T06:28:05.067896+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

49 extracted references · 4 canonical work pages

  1. [1]

    M. A. Nielsen and I. L. Chuang,Quantum computation and quantum information(Cambridge university press, 2010)

  2. [2]

    Shende, S

    V . Shende, S. Bullock, and I. Markov, Synthesis of quantum- logic circuits, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems25, 1000 (2006)

  3. [3]

    Plesch and i

    M. Plesch and i. c. v. Brukner, Quantum-state preparation with universal gate decompositions, Phys. Rev. A83, 032302 (2011)

  4. [4]

    Zhang, T

    X.-M. Zhang, T. Li, and X. Yuan, Quantum state preparation with optimal circuit depth: Implementations and applications, Phys. Rev. Lett.129, 230504 (2022)

  5. [5]

    Eldredge, Z.-X

    Z. Eldredge, Z.-X. Gong, J. T. Young, A. H. Moosavian, M. Foss-Feig, and A. V . Gorshkov, Fast quantum state trans- fer and entanglement renormalization using long-range interac- tions, Phys. Rev. Lett.119, 170503 (2017)

  6. [6]

    Omran, H

    A. Omran, H. Levine, A. Keesling, G. Semeghini, T. T. Wang, S. Ebadi, H. Bernien, A. S. Zibrov, H. Pichler, S. Choi, J. Cui, M. Rossignolo, P. Rembold, S. Montangero, T. Calarco, M. En- dres, M. Greiner, V . Vuleti´c, and M. D. Lukin, Generation and manipulation of schr ¨odinger cat states in rydberg atom arrays, Science365, 570 (2019)

  7. [7]

    I. S. Madjarov, J. P. Covey, A. L. Shaw, J. Choi, A. Kale, A. Cooper, H. Pichler, V . Schkolnik, J. R. Williams, and M. En- dres, High-fidelity entanglement and detection of alkaline-earth rydberg atoms, Nature Physics16, 857–861 (2020)

  8. [8]

    C. M. Holland, Y . Lu, and L. W. Cheuk, On-demand entan- glement of molecules in a reconfigurable optical tweezer array, Science382, 1143 (2023)

  9. [9]

    Iqbal, N

    M. Iqbal, N. Tantivasadakarn, T. M. Gatterman, J. A. Gerber, K. Gilmore, D. Gresh, A. Hankin, N. Hewitt, C. V . Horst, M. Matheny,et al., Topological order from measurements and feed-forward on a trapped ion quantum computer, Communica- tions Physics7, 205 (2024)

  10. [10]

    Bluvstein, S

    D. Bluvstein, S. J. Evered, A. A. Geim, S. H. Li, H. Zhou, T. Manovitz, S. Ebadi, M. Cain, M. Kalinowski, D. Hangleiter, J. P. Bonilla Ataides, N. Maskara, I. Cong, X. Gao, P. Sales Rodriguez, T. Karolyshyn, G. Semeghini, M. J. Gul- lans, M. Greiner, V . Vuleti´c, and M. D. Lukin, Logical quan- tum processor based on reconfigurable atom arrays, Nature626,...

  11. [11]

    Z. Cai, R. Babbush, S. C. Benjamin, S. Endo, W. J. Huggins, Y . Li, J. R. McClean, and T. E. O’Brien, Quantum error mitiga- tion, Rev. Mod. Phys.95, 045005 (2023)

  12. [12]

    Kraus and J

    B. Kraus and J. I. Cirac, Optimal creation of entanglement using a two-qubit gate, Phys. Rev. A63, 062309 (2001)

  13. [13]

    Watts, J

    P. Watts, J. c. v. Vala, M. M. M ¨uller, T. Calarco, K. B. Wha- ley, D. M. Reich, M. H. Goerz, and C. P. Koch, Optimizing for an arbitrary perfect entangler. i. functionals, Phys. Rev. A91, 062306 (2015)

  14. [14]

    M. H. Goerz, G. Gualdi, D. M. Reich, C. P. Koch, F. Motzoi, K. B. Whaley, J. c. v. Vala, M. M. M ¨uller, S. Montangero, and T. Calarco, Optimizing for an arbitrary perfect entangler. ii. ap- plication, Phys. Rev. A91, 062307 (2015)

  15. [15]

    N. B. Dehaghani, A. P. Aguiar, and R. Wisniewski, Optimizing maximally entangled state generation via pontryagin’s princi- ple, in2025 IEEE International Conference on Quantum Con- trol, Computing and Learning (qCCL)(IEEE, 2025) pp. 112– 117

  16. [16]

    Hauschild, E

    J. Hauschild, E. Leviatan, J. H. Bardarson, E. Altman, M. P. 12 Figure 10. Disentangling steps for the 2D-CNN-based policy on a six-qubit test instance. Figure 11. Disentangling steps for the MLP-based policy on a six-qubit test instance. Zaletel, and F. Pollmann, Finding purifications with minimal entanglement, Phys. Rev. B98, 235163 (2018)

  17. [17]

    Ljubotina, B

    M. Ljubotina, B. Roos, D. A. Abanin, and M. Serbyn, Optimal steering of matrix product states and quantum many-body scars, PRX Quantum3, 030343 (2022)

  18. [18]

    Y . Lu, P. Shi, X.-H. Wang, J. Hu, and S.-J. Ran, Persistent bal- listic entanglement spreading with optimal control in quantum spin chains, Physical Review Letters133, 070402 (2024)

  19. [19]

    Cramer, M

    M. Cramer, M. B. Plenio, S. T. Flammia, R. Somma, D. Gross, S. D. Bartlett, O. Landon-Cardinal, D. Poulin, and Y .-K. Liu, Efficient quantum state tomography, Nature communications1, 149 (2010)

  20. [20]

    Tashev, S

    P. Tashev, S. Petrov, F. Metz, and M. Bukov, Reinforcement learning to disentangle multiqubit quantum states from partial observations (2024), arXiv:2406.07884 [quant-ph]

  21. [21]

    F ¨osel, P

    T. F ¨osel, P. Tighineanu, T. Weiss, and F. Marquardt, Rein- forcement learning with neural networks for quantum feedback, Phys. Rev. X8, 031084 (2018)

  22. [22]

    Bukov, A

    M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, Reinforcement learning in different phases of quantum control, Phys. Rev. X8, 031086 (2018)

  23. [23]

    Bukov, Reinforcement learning for autonomous preparation of floquet-engineered states: Inverting the quantum kapitza os- cillator, Phys

    M. Bukov, Reinforcement learning for autonomous preparation of floquet-engineered states: Inverting the quantum kapitza os- cillator, Phys. Rev. B98, 224305 (2018)

  24. [24]

    Dalgaard, F

    M. Dalgaard, F. Motzoi, J. J. Sørensen, and J. Sherson, Global optimization of quantum dynamics with alphazero deep explo- ration, NPJ quantum information6, 6 (2020)

  25. [25]

    Haug, W.-K

    T. Haug, W.-K. Mok, J.-B. You, W. Zhang, C. Eng Png, and L.-C. Kwek, Classifying global state preparation via deep rein- forcement learning, Machine Learning: Science and Technol- ogy2, 01LT02 (2020)

  26. [26]

    Mackeprang, D

    J. Mackeprang, D. B. R. Dasari, and J. Wrachtrup, A reinforce- ment learning approach for quantum state engineering, Quan- tum Machine Intelligence2, 5 (2020)

  27. [27]

    J. Yao, M. Bukov, and L. Lin, Policy gradient based quantum approximate optimization algorithm, inMathematical and sci- entific machine learning(PMLR, 2020) pp. 605–634

  28. [28]

    J. Yao, P. Kottering, H. Gundlach, L. Lin, and M. Bukov, Noise- robust end-to-end quantum control using deep autoregressive policy networks, inMathematical and scientific machine learn- ing(PMLR, 2022) pp. 1044–1081

  29. [29]

    S.-F. Guo, F. Chen, Q. Liu, M. Xue, J.-J. Chen, J.-H. Cao, T.- W. Mao, M. K. Tey, and L. You, Faster state preparation across quantum phase transition assisted by reinforcement learning, Phys. Rev. Lett.126, 060401 (2021)

  30. [30]

    M. Y . Niu, S. Boixo, V . N. Smelyanskiy, and H. Neven, Univer- sal quantum control through deep reinforcement learning, npj Quantum Information5, 33 (2019)

  31. [31]

    Zhang, P.-L

    Y .-H. Zhang, P.-L. Zheng, Y . Zhang, and D.-L. Deng, Topologi- cal quantum compiling with reinforcement learning, Phys. Rev. Lett.125, 170501 (2020)

  32. [32]

    Bolens and M

    A. Bolens and M. Heyl, Reinforcement learning for digital quantum simulation, Phys. Rev. Lett.127, 110502 (2021)

  33. [33]

    L. Moro, M. G. Paris, M. Restelli, and E. Prati, Quantum compiling by deep reinforcement learning, Communications Physics4, 178 (2021)

  34. [34]

    Z. He, L. Li, S. Zheng, Y . Li, and H. Situ, Variational quantum compiling with double q-learning, New Journal of Physics23, 033002 (2021)

  35. [35]

    Quantum circuit optimization with deep reinforcement learning.arXiv preprint arXiv:2103.07585, 2021

    T. F ¨osel, M. Y . Niu, F. Marquardt, and L. Li, Quantum cir- cuit optimization with deep reinforcement learning (2021), arXiv:2103.07585 [quant-ph]

  36. [36]

    A survey on quantum reinforcement learning,

    N. Meyer, C. Ufrecht, M. Periyasamy, D. D. Scherer, A. Plinge, and C. Mutschler, A survey on quantum reinforcement learning (2024), arXiv:2211.03464 [quant-ph]

  37. [37]

    D. Dong, C. Chen, H. Li, and T.-J. Tarn, Quantum reinforce- ment learning, IEEE Transactions on Systems, Man, and Cy- bernetics, Part B (Cybernetics)38, 1207 (2008)

  38. [38]

    Alexeev, M

    Y . Alexeev, M. H. Farag, T. L. Patti, M. E. Wolf, N. Ares, A. Aspuru-Guzik, S. C. Benjamin, Z. Cai, S. Cao, C. Cham- berland,et al., Artificial intelligence for quantum computing, Nature Communications16, 10829 (2025)

  39. [39]

    Jerbi, C

    S. Jerbi, C. Gyurik, S. Marshall, H. Briegel, and V . Dunjko, Parametrized quantum policies for reinforcement learning, in Advances in Neural Information Processing Systems, V ol. 34, edited by M. Ranzato, A. Beygelzimer, Y . Dauphin, P. Liang, and J. W. Vaughan (Curran Associates, Inc., 2021) pp. 28362– 28375

  40. [40]

    Skolik, S

    A. Skolik, S. Jerbi, and V . Dunjko, Quantum agents in the gym: a variational quantum algorithm for deep q-learning, Quantum 6, 720 (2022)

  41. [41]

    Jin, Z.-W

    Y .-X. Jin, Z.-W. Wang, H.-Z. Xu, W.-F. Zhuang, M.- J. Hu, and D. E. Liu, Ppo-q: Proximal policy optimiza- tion with parametrized quantum policies or values (2025), arXiv:2501.07085 [quant-ph]

  42. [42]

    S. Wu, S. Jin, D. Wen, D. Han, and X. Wang, Quantum re- inforcement learning in continuous action space, Quantum9, 1660 (2025). 13

  43. [43]

    Z. Li, E. Wallace, S. Shen, K. Lin, K. Keutzer, D. Klein, and J. Gonzalez, Train big, then compress: Rethinking model size for efficient training and inference of transformers, inInter- national Conference on machine learning(PMLR, 2020) pp. 5958–5968

  44. [44]

    Lockwood and M

    O. Lockwood and M. Si, Reinforcement learning with quan- tum variational circuit, inProceedings of the AAAI conference on artificial intelligence and interactive digital entertainment, V ol. 16 (2020) pp. 245–251

  45. [45]

    S. Y .-C. Chen, C.-H. H. Yang, J. Qi, P.-Y . Chen, X. Ma, and H.- S. Goan, Variational quantum circuits for deep reinforcement learning, IEEE Access8, 141007 (2020)

  46. [46]

    D. R. Terno, Nonlinear operations in quantum-information the- ory, Phys. Rev. A59, 3320 (1999)

  47. [47]

    L. P. Kaelbling, M. L. Littman, and A. R. Cassandra, Planning and acting in partially observable stochastic domains, Artificial Intelligence101, 99 (1998)

  48. [48]

    Dong and I

    D. Dong and I. R. Petersen, Quantum estimation, control and learning: Opportunities and challenges, Annual Reviews in Control54, 243 (2022)

  49. [49]

    Sotnikov, I

    O. Sotnikov, I. Iakovlev, A. Iliasov, M. Katsnelson, A. Bagrov, and V . Mazurenko, Certification of quantum states with hid- den structure of their bitstrings, npj Quantum Information8, 41 (2022)