Fast State Stabilization using Deep Reinforcement Learning for Measurement-based Quantum Feedback Control

Chunxiang Song; Daoyi Dong; Hidehiro Yonezawa; Yanan Liu

arxiv: 2408.11328 · v2 · submitted 2024-08-21 · 📡 eess.SY · cs.SY

Fast State Stabilization using Deep Reinforcement Learning for Measurement-based Quantum Feedback Control

Chunxiang Song , Yanan Liu , Daoyi Dong , Hidehiro Yonezawa This is my paper

Pith reviewed 2026-05-23 21:48 UTC · model grok-4.3

classification 📡 eess.SY cs.SY

keywords quantum state stabilizationdeep reinforcement learningmeasurement-based feedbackquantum feedback controlentangled statesdecoherence mitigationmulti-qubit systems

0 comments

The pith

Deep reinforcement learning stabilizes random quantum states to target entangled states faster than Lyapunov control using measurement feedback.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that deep reinforcement learning can design measurement-based feedback controls for quantum systems without needing explicit mappings between measurements and actions. The approach drives random initial states to a desired entangled state more quickly than traditional Lyapunov methods or other DRL reward designs. Faster stabilization shortens the period of interaction with the environment and thereby limits decoherence. Simulations confirm success on two-qubit and three-qubit systems together with robustness to imperfect measurements and time delays. A sympathetic reader would care because reduced decoherence time directly supports preservation of quantum resources needed for technology.

Core claim

Applying a deep reinforcement learning algorithm to measurement information drives random initial quantum states to a target entangled state in two- and three-qubit systems with shorter convergence times than Lyapunov feedback control or several alternative DRL formulations, while retaining performance under imperfect measurements and delays in system evolution.

What carries the argument

Deep reinforcement learning policy trained on measurement outcomes to generate control signals for quantum feedback without explicit mapping construction.

Load-bearing premise

The simulated quantum dynamics, measurement model, and environmental interactions match physical hardware closely enough that a policy trained in simulation will work similarly on real devices.

What would settle it

An experiment on a physical two-qubit device that measures whether the learned DRL policy reaches the target state faster than Lyapunov control under actual noise, measurement error, and delay.

Figures

Figures reproduced from arXiv: 2408.11328 by Chunxiang Song, Daoyi Dong, Hidehiro Yonezawa, Yanan Liu.

**Figure 2.** Figure 2: Example Reward Curves for Different Parameter Combinations of [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Reward Function. (a) When the Dρt is between 0 and d, we consider that the current system is almost approaching the target state. A smaller Dρt value corresponds to a larger positive reward. When Dρt = 0, the maximum reward r = RP is obtained. (b) When the Dρt is between d and 1, we consider that there is still distance between the current state and the target state. At Dρt = 1, the maximum negative reward… view at source ↗

**Figure 4.** Figure 4: Evolution of the distance Dρt for 50 random initial states stabilized to the target Two-Qubit state under the control of the DRL agent (blue) and Lyapunov method (orange). The average stabilization time to the target under DRL control is 4.59 a.u., while the Lyapunov method requires an average time of 5.86 a.u.. (The stabilization time is defined as the time when the distance Dρt ≤ 0.001.) The light blue (… view at source ↗

**Figure 5.** Figure 5: Evolution of the distance Dρt for 50 random initial states stabilized to the target GHZ state under the control of the DRL agent (blue) and Lyapunov method (orange). The average stabilization time to the target under DRL control is 10.41 a.u., while the Lyapunov method requires an average time of 12.33 a.u.. (The stabilization time is defined as the time when the distance Dρt ≤ 0.001.) The light blue (oran… view at source ↗

**Figure 7.** Figure 7: One specific (not averaged) evolutionary trajectory of the distance [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: With a measurement efficiency ηc = 0.8, under the control of the DRL agent and the Lyapunov-based method, 50 random initial states stably evolve to the target GHZ state. The light lines represent the average evolution trajectories for different initial states, and the dark blue line represents the average trajectory across all different initial states. ratory environments [41]. As shown in [PITH_FULL_IMAG… view at source ↗

**Figure 10.** Figure 10: The effect of e and f parameters in the PNR reward function on the performance of the DRL algorithm. PNR: e = 2 and f = 10; PNR1: e = 10, f = 2 [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: Effect of reward function on DRL algorithm under partitioning [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

**Figure 12.** Figure 12: Effect of DRL algorithms with non-partitioned reward functions [PITH_FULL_IMAGE:figures/full_fig_p011_12.png] view at source ↗

read the original abstract

The stabilization of quantum states is a fundamental problem for realizing various quantum technologies. Measurement-based-feedback strategies have demonstrated powerful performance, and the construction of quantum control signals using measurement information has attracted great interest. However, the interaction between quantum systems and the environment is inevitable, especially when measurements are introduced, which leads to decoherence. To mitigate decoherence, it is desirable to stabilize quantum systems faster, thereby reducing the time of interaction with the environment. In this paper, we utilize information obtained from measurement and apply deep reinforcement learning (DRL) algorithms, without explicitly constructing specific complex measurement-control mappings, to rapidly drive random initial quantum state to the target state. The proposed DRL algorithm has the ability to speed up the convergence to a target state, which shortens the interaction between quantum systems and their environments to protect coherence. Simulations are performed on two-qubit and three-qubit systems, and the results show that our algorithm can successfully stabilize random initial quantum system to the target entangled state, with a convergence time faster than traditional methods such as Lyapunov feedback control and several DRL algorithms with different reward functions. Moreover, it exhibits robustness against imperfect measurements and delays in system evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DRL gives faster simulated stabilization on 2- and 3-qubit systems than Lyapunov or alternate rewards, but the methods section is too thin to judge reliability.

read the letter

The paper's core result is that a DRL policy, trained on measurement outcomes, drives random initial states to target entangled states quicker than Lyapunov control and several other DRL reward choices in 2- and 3-qubit simulations. It also reports some tolerance to modeled measurement error and evolution delays. That ordering is the main new piece; prior work already used DRL for quantum feedback, but the explicit speed focus and side-by-side reward variants are a reasonable incremental step.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes applying deep reinforcement learning (DRL) to measurement-based quantum feedback control to stabilize random initial states to target entangled states in two- and three-qubit systems. Simulations are used to claim faster convergence times than Lyapunov feedback control and alternative DRL reward designs, along with robustness to modeled measurement imperfections and evolution delays.

Significance. If the simulation results hold under the reported conditions, the work supplies an empirical demonstration that DRL can outperform both classical Lyapunov control and other DRL variants on small-scale quantum stabilization tasks, potentially shortening interaction times with the environment. The explicit comparisons across reward functions and the inclusion of robustness tests constitute a strength of the empirical evaluation.

major comments (3)

[§4 and §5] §4 (Simulation Setup) and §5 (Results): The central claim of faster convergence rests on simulation outcomes, yet the manuscript supplies no quantitative specification of the DRL network architecture (layers, units), training procedure (optimizer, episode count, batch size), or statistical measures (mean and standard deviation of convergence time over repeated trials) needed to substantiate the reported performance ordering versus Lyapunov control and other DRL variants.
[§3.2] §3.2 (Reward Function Design): The paper states that several DRL algorithms with different reward functions were compared, but the explicit mathematical forms of those reward functions and the rationale for their selection are not provided; without these definitions it is impossible to assess whether the reported speed-up is attributable to the proposed reward or to other implementation choices.
[§5.3] §5.3 (Robustness Tests): Robustness against imperfect measurements and delays is asserted, but the specific ranges of measurement error probabilities and delay durations, together with the quantitative metrics used to quantify degradation, are not reported, leaving the scope of the robustness claim unclear.

minor comments (2)

[Figures in §5] Figure captions and axis labels in the simulation results should explicitly state the number of Monte Carlo runs and the precise definition of convergence time.
[§2] The system model section would benefit from a compact table listing the qubit Hamiltonians, measurement operators, and target states used for the two- and three-qubit cases.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which will help improve the reproducibility and clarity of the manuscript. We address each major comment below and will incorporate the requested details in the revised version.

read point-by-point responses

Referee: [§4 and §5] §4 (Simulation Setup) and §5 (Results): The central claim of faster convergence rests on simulation outcomes, yet the manuscript supplies no quantitative specification of the DRL network architecture (layers, units), training procedure (optimizer, episode count, batch size), or statistical measures (mean and standard deviation of convergence time over repeated trials) needed to substantiate the reported performance ordering versus Lyapunov control and other DRL variants.

Authors: We agree that these implementation and statistical details are required for full reproducibility and to substantiate the performance claims. In the revised manuscript we will add a complete description of the DRL network architecture (number of layers and units), the training procedure (optimizer, episode count, batch size), and statistical measures (mean and standard deviation of convergence times over repeated independent trials) in §§4 and 5. revision: yes
Referee: [§3.2] §3.2 (Reward Function Design): The paper states that several DRL algorithms with different reward functions were compared, but the explicit mathematical forms of those reward functions and the rationale for their selection are not provided; without these definitions it is impossible to assess whether the reported speed-up is attributable to the proposed reward or to other implementation choices.

Authors: We acknowledge that the explicit mathematical expressions and design rationale are missing. In the revision we will insert the precise formulas for each reward function examined in §3.2 together with a paragraph explaining the motivation behind each choice, enabling readers to evaluate the contribution of the proposed reward. revision: yes
Referee: [§5.3] §5.3 (Robustness Tests): Robustness against imperfect measurements and delays is asserted, but the specific ranges of measurement error probabilities and delay durations, together with the quantitative metrics used to quantify degradation, are not reported, leaving the scope of the robustness claim unclear.

Authors: We agree that the tested ranges and evaluation metrics must be stated explicitly. The revised §5.3 will report the specific intervals of measurement-error probabilities and delay durations examined, as well as the quantitative metrics (e.g., mean convergence time and success rate) used to measure performance degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper reports empirical simulation outcomes on 2- and 3-qubit systems in which a trained DRL policy reaches target entangled states faster than Lyapunov feedback and alternative DRL reward designs, plus robustness to modeled imperfections. These results are generated by executing the learned policy inside the same simulator used for training; the reported convergence times and robustness metrics are not algebraically forced by the reward function or by any self-citation chain, nor do they rename a known pattern as a derivation. The abstract and reader summary contain no load-bearing self-citation, uniqueness theorem, or fitted-parameter prediction that reduces to the input by construction; the performance ordering is therefore an independent empirical observation within the simulation protocol.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim depends on the fidelity of the quantum simulation model and on the ability of DRL training to discover effective policies; both rest on standard quantum mechanics plus standard RL assumptions plus many tunable hyperparameters whose values are not reported.

free parameters (2)

DRL network architecture and learning hyperparameters
Network depth, width, learning rate, discount factor, and exploration schedule are chosen to achieve the reported performance.
Reward function design parameters
The paper compares several reward functions, implying that the functional form and scaling constants were selected or tuned.

axioms (2)

domain assumption The quantum system obeys the standard Lindblad master equation or equivalent Markovian evolution under the chosen Hamiltonian and measurement operators.
Required for all simulation results; invoked implicitly throughout the abstract's description of stabilization dynamics.
domain assumption The control problem can be cast as a Markov decision process with the chosen observation and action spaces.
Necessary to apply standard DRL algorithms to the feedback task.

pith-pipeline@v0.9.0 · 5746 in / 1580 out tokens · 40434 ms · 2026-05-23T21:48:27.223661+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning-Based Design of LQG Controllers in Quantum Coherent Feedback
quant-ph 2025-02 unverdicted novelty 5.0

A customized differential evolution algorithm designs LQG controllers for a quantum optical system, achieving lower performance indices while satisfying physical realizability constraints.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Dong and I

D. Dong and I. R. Petersen, Learning and Robust Control in Quantum Technology. Springer Nature, 2023

work page 2023
[2]

Quantum teleportation using three- particle entanglement,

A. Karlsson and M. Bourennane, “Quantum teleportation using three- particle entanglement,” Physical Review A , vol. 58, no. 6, p. 4394, 1998

work page 1998
[3]

Quantum algorithms: entanglement–enhanced information processing,

A. Ekert and R. Jozsa, “Quantum algorithms: entanglement–enhanced information processing,” Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 356, no. 1743, pp. 1769–1782, 1998

work page 1998
[4]

On the role of entanglement in quantum- computational speed-up,

R. Jozsa and N. Linden, “On the role of entanglement in quantum- computational speed-up,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences , vol. 459, no. 2036, pp. 2011–2032, 2003

work page 2036
[5]

Quantum experiments and graphs. III. high-dimensional and multiparticle entanglement,

X. Gu, L. Chen, A. Zeilinger, and M. Krenn, “Quantum experiments and graphs. III. high-dimensional and multiparticle entanglement,” Physical Review A , vol. 99, no. 3, p. 032338, 2019

work page 2019
[6]

Rapid Lyapunov control of finite-dimensional quantum systems,

S. Kuang, D. Dong, and I. R. Petersen, “Rapid Lyapunov control of finite-dimensional quantum systems,” Automatica, vol. 81, pp. 164– 175, 2017

work page 2017
[7]

Two- step feedback preparation of entanglement for qubit systems with time delay,

Y . Liu, D. Dong, S. Kuang, I. R. Petersen, and H. Yonezawa, “Two- step feedback preparation of entanglement for qubit systems with time delay,” Automatica, vol. 125, p. 109174, 2021

work page 2021
[8]

Lyapunov-based feedback preparation of GHZ entanglement of N-qubit systems,

Y . Liu, S. Kuang, and S. Cong, “Lyapunov-based feedback preparation of GHZ entanglement of N-qubit systems,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 3827–3839, 2016

work page 2016
[9]

Teaching lasers to control molecules,

R. S. Judson and H. Rabitz, “Teaching lasers to control molecules,” Physical Review Letters , vol. 68, no. 10, p. 1500, 1992

work page 1992
[10]

Quantum estimation, control and learn- ing: opportunities and challenges,

D. Dong and I. R. Petersen, “Quantum estimation, control and learn- ing: opportunities and challenges,” Annual Reviews in Control, vol. 54, pp. 243–251, 2022

work page 2022
[11]

Learning control of quantum systems,

D. Dong, “Learning control of quantum systems,” in Encyclopedia of Systems and Control , J. Baillieul and T. Samad, Eds. Springer London, 2020, https://doi.org/10.1007/978-1-4471-5102-9 100161-1

work page doi:10.1007/978-1-4471-5102-9 2020
[12]

Genetic algorithm optimization of laser pulses for molecular quantum state excitation,

S. Sharma, H. Singh, and G. G. Balint-Kurti, “Genetic algorithm optimization of laser pulses for molecular quantum state excitation,” The Journal of Chemical Physics , vol. 132, no. 6, p. 064108, 2010

work page 2010
[13]

O. M. Shir, Niching in derandomized evolution strategies and its applications in quantum control . Leiden University, 2008

work page 2008
[14]

Sampling-based learning control for quantum systems with uncertainties,

D. Dong, M. A. Mabrok, I. R. Petersen, B. Qi, C. Chen, and H. Rabitz, “Sampling-based learning control for quantum systems with uncertainties,” IEEE Transactions on Control Systems Technology , vol. 23, no. 6, pp. 2155–2166, 2015

work page 2015
[15]

Quantum theory of continuous feedback,

H. M. Wiseman, “Quantum theory of continuous feedback,” Physical Review A, vol. 49, no. 3, p. 2133, 1994

work page 1994
[16]

Reinforcement learning in different phases of quantum control,

M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, “Reinforcement learning in different phases of quantum control,” Physical Review X , vol. 8, no. 3, p. 031086, 2018

work page 2018
[17]

Model-free quantum control with reinforcement learning,

V . Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios, and M. De- voret, “Model-free quantum control with reinforcement learning,” Physical Review X , vol. 12, no. 1, p. 011059, 2022

work page 2022
[18]

Reinforcement-learning-assisted quantum optimization,

M. M. Wauters, E. Panizon, G. B. Mbeng, and G. E. Santoro, “Reinforcement-learning-assisted quantum optimization,”Physical Re- view Research, vol. 2, no. 3, p. 033446, 2020

work page 2020
[19]

Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving,

J. Yao, L. Lin, and M. Bukov, “Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving,” Physical Review X, vol. 11, no. 3, p. 031070, 2021

work page 2021
[20]

Measurement-based feedback quantum control with deep reinforce- ment learning for a double-well nonlinear potential,

S. Borah, B. Sarma, M. Kewming, G. J. Milburn, and J. Twamley, “Measurement-based feedback quantum control with deep reinforce- ment learning for a double-well nonlinear potential,” Physical Review Letters, vol. 127, no. 19, p. 190403, 2021

work page 2021
[21]

Deep reinforcement learning for quantum state preparation with weak nonlinear measure- ments,

R. Porotti, A. Essig, B. Huard, and F. Marquardt, “Deep reinforcement learning for quantum state preparation with weak nonlinear measure- ments,” Quantum, vol. 6, p. 747, 2022

work page 2022
[22]

Preparation of cavity-Fock-state superpositions by reinforcement learning exploiting measurement backaction,

A. Perret and Y . B ´erub´e-Lauzi`ere, “Preparation of cavity-Fock-state superpositions by reinforcement learning exploiting measurement backaction,” Physical Review A , vol. 109, no. 2, p. 022609, 2024

work page 2024
[23]

Quantum feedback control and classical control theory,

A. C. Doherty, S. Habib, K. Jacobs, H. Mabuchi, and S. M. Tan, “Quantum feedback control and classical control theory,” Physical Review A, vol. 62, no. 1, p. 012105, 2000

work page 2000
[24]

A straightforward introduction to contin- uous quantum measurement,

K. Jacobs and D. A. Steck, “A straightforward introduction to contin- uous quantum measurement,” Contemporary Physics , vol. 47, no. 5, pp. 279–303, 2006

work page 2006
[25]

H. M. Wiseman and G. J. Milburn, Quantum Measurement and Control. Cambridge University Press, 2009

work page 2009
[26]

Reinforcement learning and Markov decision processes,

M. Van Otterlo and M. Wiering, “Reinforcement learning and Markov decision processes,” Reinforcement Learning: State-of-the-Art , pp. 3– 42, 2012

work page 2012
[27]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Policy gradi- ent methods for reinforcement learning with function approximation,

R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy gradi- ent methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems , vol. 12, 1999

work page 1999
[29]

Actor-critic algorithms,

V . Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in Neural Information Processing Systems , vol. 12, 1999

work page 1999
[30]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Quantum state tomography via linear regression estimation,

B. Qi, Z. Hou, L. Li, D. Dong, G. Xiang, and G. Guo, “Quantum state tomography via linear regression estimation,” Scientific Reports, vol. 3, no. 1, p. 3496, 2013

work page 2013
[32]

Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise,

J. A. Smolin, J. M. Gambetta, and G. Smith, “Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise,” Physical Review Letters, vol. 108, no. 7, p. 070502, 2012

work page 2012
[33]

Inverse reward design,

D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017
[34]

Stable-baselines3: Reliable reinforcement learning im- plementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning im- plementations,” Journal of Machine Learning Research , vol. 22, no. 268, pp. 1–8, 2021

work page 2021
[35]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul ˜ao, A. Kallinteris, M. Krimmel, A. KG et al., “Gymnasium: A standard interface for reinforcement learning environments,” arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Stabilizing feedback controls for quantum systems,

M. Mirrahimi and R. van Handel, “Stabilizing feedback controls for quantum systems,” SIAM Journal on Control and Optimization , vol. 46, no. 2, pp. 445–467, 2007

work page 2007
[37]

Going beyond bell’s theorem,

D. M. Greenberger, M. A. Horne, and A. Zeilinger, “Going beyond bell’s theorem,” inBell’s theorem, quantum theory and conceptions of the universe. Springer, 1989, pp. 69–72

work page 1989
[38]

Three qubits can be entangled in two inequivalent ways,

W. D ¨ur, G. Vidal, and J. I. Cirac, “Three qubits can be entangled in two inequivalent ways,” Physical Review A, vol. 62, no. 6, p. 062314, 2000

work page 2000
[39]

14-qubit entanglement: Creation and coherence,

T. Monz, P. Schindler, J. T. Barreiro, M. Chwalla, D. Nigg, W. A. Coish, M. Harlander, W. H¨ansel, M. Hennrich, and R. Blatt, “14-qubit entanglement: Creation and coherence,” Physical Review Letters , vol. 106, no. 13, p. 130506, 2011

work page 2011
[40]

Rapid feedback stabilization of quantum systems with application to preparation of multiqubit entangled states,

S. Kuang, G. Li, Y . Liu, X. Sun, and S. Cong, “Rapid feedback stabilization of quantum systems with application to preparation of multiqubit entangled states,” IEEE Transactions on Cybernetics , vol. 52, no. 10, pp. 11 213–11 225, 2021

work page 2021
[41]

Quantum feedback: theory, experiments, and applications,

J. Zhang, Y .-x. Liu, R.-B. Wu, K. Jacobs, and F. Nori, “Quantum feedback: theory, experiments, and applications,” Physics Reports, vol. 679, pp. 1–60, 2017

work page 2017
[42]

Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,

T. Xie, Y . Ma, and Y .-X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019
[43]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[44]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estima- tion,” arXiv preprint arXiv:1506.02438 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[45]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,” ACM Transactions on Graphics (TOG) , vol. 37, no. 4, pp. 1–14, 2018

work page 2018

[1] [1]

Dong and I

D. Dong and I. R. Petersen, Learning and Robust Control in Quantum Technology. Springer Nature, 2023

work page 2023

[2] [2]

Quantum teleportation using three- particle entanglement,

A. Karlsson and M. Bourennane, “Quantum teleportation using three- particle entanglement,” Physical Review A , vol. 58, no. 6, p. 4394, 1998

work page 1998

[3] [3]

Quantum algorithms: entanglement–enhanced information processing,

A. Ekert and R. Jozsa, “Quantum algorithms: entanglement–enhanced information processing,” Philosophical Transactions of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences, vol. 356, no. 1743, pp. 1769–1782, 1998

work page 1998

[4] [4]

On the role of entanglement in quantum- computational speed-up,

R. Jozsa and N. Linden, “On the role of entanglement in quantum- computational speed-up,” Proceedings of the Royal Society of London. Series A: Mathematical, Physical and Engineering Sciences , vol. 459, no. 2036, pp. 2011–2032, 2003

work page 2036

[5] [5]

Quantum experiments and graphs. III. high-dimensional and multiparticle entanglement,

X. Gu, L. Chen, A. Zeilinger, and M. Krenn, “Quantum experiments and graphs. III. high-dimensional and multiparticle entanglement,” Physical Review A , vol. 99, no. 3, p. 032338, 2019

work page 2019

[6] [6]

Rapid Lyapunov control of finite-dimensional quantum systems,

S. Kuang, D. Dong, and I. R. Petersen, “Rapid Lyapunov control of finite-dimensional quantum systems,” Automatica, vol. 81, pp. 164– 175, 2017

work page 2017

[7] [7]

Two- step feedback preparation of entanglement for qubit systems with time delay,

Y . Liu, D. Dong, S. Kuang, I. R. Petersen, and H. Yonezawa, “Two- step feedback preparation of entanglement for qubit systems with time delay,” Automatica, vol. 125, p. 109174, 2021

work page 2021

[8] [8]

Lyapunov-based feedback preparation of GHZ entanglement of N-qubit systems,

Y . Liu, S. Kuang, and S. Cong, “Lyapunov-based feedback preparation of GHZ entanglement of N-qubit systems,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 3827–3839, 2016

work page 2016

[9] [9]

Teaching lasers to control molecules,

R. S. Judson and H. Rabitz, “Teaching lasers to control molecules,” Physical Review Letters , vol. 68, no. 10, p. 1500, 1992

work page 1992

[10] [10]

Quantum estimation, control and learn- ing: opportunities and challenges,

D. Dong and I. R. Petersen, “Quantum estimation, control and learn- ing: opportunities and challenges,” Annual Reviews in Control, vol. 54, pp. 243–251, 2022

work page 2022

[11] [11]

Learning control of quantum systems,

D. Dong, “Learning control of quantum systems,” in Encyclopedia of Systems and Control , J. Baillieul and T. Samad, Eds. Springer London, 2020, https://doi.org/10.1007/978-1-4471-5102-9 100161-1

work page doi:10.1007/978-1-4471-5102-9 2020

[12] [12]

Genetic algorithm optimization of laser pulses for molecular quantum state excitation,

S. Sharma, H. Singh, and G. G. Balint-Kurti, “Genetic algorithm optimization of laser pulses for molecular quantum state excitation,” The Journal of Chemical Physics , vol. 132, no. 6, p. 064108, 2010

work page 2010

[13] [13]

O. M. Shir, Niching in derandomized evolution strategies and its applications in quantum control . Leiden University, 2008

work page 2008

[14] [14]

Sampling-based learning control for quantum systems with uncertainties,

D. Dong, M. A. Mabrok, I. R. Petersen, B. Qi, C. Chen, and H. Rabitz, “Sampling-based learning control for quantum systems with uncertainties,” IEEE Transactions on Control Systems Technology , vol. 23, no. 6, pp. 2155–2166, 2015

work page 2015

[15] [15]

Quantum theory of continuous feedback,

H. M. Wiseman, “Quantum theory of continuous feedback,” Physical Review A, vol. 49, no. 3, p. 2133, 1994

work page 1994

[16] [16]

Reinforcement learning in different phases of quantum control,

M. Bukov, A. G. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, “Reinforcement learning in different phases of quantum control,” Physical Review X , vol. 8, no. 3, p. 031086, 2018

work page 2018

[17] [17]

Model-free quantum control with reinforcement learning,

V . Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsioutsios, and M. De- voret, “Model-free quantum control with reinforcement learning,” Physical Review X , vol. 12, no. 1, p. 011059, 2022

work page 2022

[18] [18]

Reinforcement-learning-assisted quantum optimization,

M. M. Wauters, E. Panizon, G. B. Mbeng, and G. E. Santoro, “Reinforcement-learning-assisted quantum optimization,”Physical Re- view Research, vol. 2, no. 3, p. 033446, 2020

work page 2020

[19] [19]

Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving,

J. Yao, L. Lin, and M. Bukov, “Reinforcement learning for many-body ground-state preparation inspired by counterdiabatic driving,” Physical Review X, vol. 11, no. 3, p. 031070, 2021

work page 2021

[20] [20]

Measurement-based feedback quantum control with deep reinforce- ment learning for a double-well nonlinear potential,

S. Borah, B. Sarma, M. Kewming, G. J. Milburn, and J. Twamley, “Measurement-based feedback quantum control with deep reinforce- ment learning for a double-well nonlinear potential,” Physical Review Letters, vol. 127, no. 19, p. 190403, 2021

work page 2021

[21] [21]

Deep reinforcement learning for quantum state preparation with weak nonlinear measure- ments,

R. Porotti, A. Essig, B. Huard, and F. Marquardt, “Deep reinforcement learning for quantum state preparation with weak nonlinear measure- ments,” Quantum, vol. 6, p. 747, 2022

work page 2022

[22] [22]

Preparation of cavity-Fock-state superpositions by reinforcement learning exploiting measurement backaction,

A. Perret and Y . B ´erub´e-Lauzi`ere, “Preparation of cavity-Fock-state superpositions by reinforcement learning exploiting measurement backaction,” Physical Review A , vol. 109, no. 2, p. 022609, 2024

work page 2024

[23] [23]

Quantum feedback control and classical control theory,

A. C. Doherty, S. Habib, K. Jacobs, H. Mabuchi, and S. M. Tan, “Quantum feedback control and classical control theory,” Physical Review A, vol. 62, no. 1, p. 012105, 2000

work page 2000

[24] [24]

A straightforward introduction to contin- uous quantum measurement,

K. Jacobs and D. A. Steck, “A straightforward introduction to contin- uous quantum measurement,” Contemporary Physics , vol. 47, no. 5, pp. 279–303, 2006

work page 2006

[25] [25]

H. M. Wiseman and G. J. Milburn, Quantum Measurement and Control. Cambridge University Press, 2009

work page 2009

[26] [26]

Reinforcement learning and Markov decision processes,

M. Van Otterlo and M. Wiering, “Reinforcement learning and Markov decision processes,” Reinforcement Learning: State-of-the-Art , pp. 3– 42, 2012

work page 2012

[27] [27]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[28] [28]

Policy gradi- ent methods for reinforcement learning with function approximation,

R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour, “Policy gradi- ent methods for reinforcement learning with function approximation,” Advances in Neural Information Processing Systems , vol. 12, 1999

work page 1999

[29] [29]

Actor-critic algorithms,

V . Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances in Neural Information Processing Systems , vol. 12, 1999

work page 1999

[30] [30]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Quantum state tomography via linear regression estimation,

B. Qi, Z. Hou, L. Li, D. Dong, G. Xiang, and G. Guo, “Quantum state tomography via linear regression estimation,” Scientific Reports, vol. 3, no. 1, p. 3496, 2013

work page 2013

[32] [32]

Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise,

J. A. Smolin, J. M. Gambetta, and G. Smith, “Efficient method for computing the maximum-likelihood quantum state from measurements with additive gaussian noise,” Physical Review Letters, vol. 108, no. 7, p. 070502, 2012

work page 2012

[33] [33]

Inverse reward design,

D. Hadfield-Menell, S. Milli, P. Abbeel, S. J. Russell, and A. Dragan, “Inverse reward design,” Advances in Neural Information Processing Systems, vol. 30, 2017

work page 2017

[34] [34]

Stable-baselines3: Reliable reinforcement learning im- plementations,

A. Raffin, A. Hill, A. Gleave, A. Kanervisto, M. Ernestus, and N. Dormann, “Stable-baselines3: Reliable reinforcement learning im- plementations,” Journal of Machine Learning Research , vol. 22, no. 268, pp. 1–8, 2021

work page 2021

[35] [35]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goul ˜ao, A. Kallinteris, M. Krimmel, A. KG et al., “Gymnasium: A standard interface for reinforcement learning environments,” arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Stabilizing feedback controls for quantum systems,

M. Mirrahimi and R. van Handel, “Stabilizing feedback controls for quantum systems,” SIAM Journal on Control and Optimization , vol. 46, no. 2, pp. 445–467, 2007

work page 2007

[37] [37]

Going beyond bell’s theorem,

D. M. Greenberger, M. A. Horne, and A. Zeilinger, “Going beyond bell’s theorem,” inBell’s theorem, quantum theory and conceptions of the universe. Springer, 1989, pp. 69–72

work page 1989

[38] [38]

Three qubits can be entangled in two inequivalent ways,

W. D ¨ur, G. Vidal, and J. I. Cirac, “Three qubits can be entangled in two inequivalent ways,” Physical Review A, vol. 62, no. 6, p. 062314, 2000

work page 2000

[39] [39]

14-qubit entanglement: Creation and coherence,

T. Monz, P. Schindler, J. T. Barreiro, M. Chwalla, D. Nigg, W. A. Coish, M. Harlander, W. H¨ansel, M. Hennrich, and R. Blatt, “14-qubit entanglement: Creation and coherence,” Physical Review Letters , vol. 106, no. 13, p. 130506, 2011

work page 2011

[40] [40]

Rapid feedback stabilization of quantum systems with application to preparation of multiqubit entangled states,

S. Kuang, G. Li, Y . Liu, X. Sun, and S. Cong, “Rapid feedback stabilization of quantum systems with application to preparation of multiqubit entangled states,” IEEE Transactions on Cybernetics , vol. 52, no. 10, pp. 11 213–11 225, 2021

work page 2021

[41] [41]

Quantum feedback: theory, experiments, and applications,

J. Zhang, Y .-x. Liu, R.-B. Wu, K. Jacobs, and F. Nori, “Quantum feedback: theory, experiments, and applications,” Physics Reports, vol. 679, pp. 1–60, 2017

work page 2017

[42] [42]

Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,

T. Xie, Y . Ma, and Y .-X. Wang, “Towards optimal off-policy evaluation for reinforcement learning with marginalized importance sampling,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019

[43] [43]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018

work page 2018

[44] [44]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- dimensional continuous control using generalized advantage estima- tion,” arXiv preprint arXiv:1506.02438 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[45] [45]

Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,

X. B. Peng, P. Abbeel, S. Levine, and M. Van de Panne, “Deepmimic: Example-guided deep reinforcement learning of physics-based charac- ter skills,” ACM Transactions on Graphics (TOG) , vol. 37, no. 4, pp. 1–14, 2018

work page 2018