arxiv: 2604.19990 · v1 · submitted 2026-04-21 · 🪐 quant-ph

Recognition: unknown

Reinforcement Learning for Robust Calibration of Multi-Qudit Quantum Gates

Amine Jaouadi , Sahel Ashhab

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:07 UTC · model grok-4.3

classification 🪐 quant-ph

keywords reinforcement learningquantum gatesqutritsoptimal controlgate calibrationrobustnessqudit systemsmodel mismatch

0 comments

The pith

Reinforcement learning learns small corrections to optimal control pulses to produce robust controlled-phase gates on qutrits despite parameter uncertainties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Qutrits offer advantages over qubits for quantum information but their dense energy levels make high-fidelity gates hard to achieve when real devices deviate from ideal models. The work first applies optimal control to create strong nominal pulses for a perfect system, then uses contextual reinforcement learning to find small adjustments that depend on specific parameter drifts. These adjustments keep the original pulse quality intact while adding resistance to mismatches. Readers would care because the method shows how to adapt sensitive theoretical designs for practical hardware without starting over each time parameters change.

Core claim

Optimal control is first used to design high-fidelity control pulses for a nominal system model. Reinforcement learning is then employed as a calibration stage that learns small residual corrections to these pulses in the presence of static model mismatch, thereby preserving good gate performance under realistic parameter uncertainties. By learning structured, low-dimensional residual corrections conditioned on device-specific parameter variations, reinforcement learning enhances the transfer robustness of nominally optimal but parameter-sensitive control solutions across ensembles of devices.

What carries the argument

Contextual deep reinforcement learning that learns low-dimensional residual corrections to the nominal optimal control pulses, conditioned on static parameter variations.

If this is right

The reinforcement learning step complements rather than replaces the optimal control design.
Gate performance stays high under realistic static model mismatches and parameter uncertainties.
Transfer robustness improves across different device instances through the structured corrections.
Overall sensitivity to parameter fluctuations is reduced in a systematic way.
Reinforcement learning functions as a practical calibration tool for high-dimensional quantum gates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage approach could extend to other multi-qudit gates and operations beyond controlled-phase.
Corrections pre-learned on simulations might let hardware teams deploy gates faster with only light on-device fine-tuning.
The low-dimensional correction idea points to a broader pattern for making any sensitive quantum control solution more adaptable.
Testing the method on systems with time-varying noise would reveal whether the current static-mismatch focus needs expansion.

Load-bearing premise

Small residual corrections learned by contextual reinforcement learning from simulated parameter variations will transfer to real hardware without extensive additional training or invalidating the original pulses.

What would settle it

Apply the RL-corrected pulses to physical two-qutrit hardware with measured static parameter drifts and check whether gate fidelity stays close to the simulated robust values; a large drop would show the corrections do not transfer.

Figures

Figures reproduced from arXiv: 2604.19990 by Amine Jaouadi, Sahel Ashhab.

**Figure 2.** Figure 2: FIG. 2: Reinforcement-learning workflow for residual [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3: Convergence of the GRAPE optimization on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4: Learning curves on the nominal device [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5: Average gate fidelity on the nominal device for [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6: Gate fidelity on a single static-noise device with [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: FIG. 8: Ensemble-averaged fidelity and standard [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7: Training curves for SAC, TD3, DDPG, and [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9: Ensemble-averaged gate fidelity under imperfect [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10: Same as in Fig [PITH_FULL_IMAGE:figures/full_fig_p010_10.png] view at source ↗

**Figure 11.** Figure 11: FIG. 11: Drive [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 12.** Figure 12: FIG. 12: Top: OCT (GRAPE) convergence for [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: FIG. 13: Training curves for [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 15.** Figure 15: FIG. 15: Optimized control drive [PITH_FULL_IMAGE:figures/full_fig_p014_15.png] view at source ↗

**Figure 14.** Figure 14: FIG. 14: Top: ensemble-averaged fidelity for [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

read the original abstract

Higher-dimensional quantum systems, such as qudits, offer architectural and algorithmic advantages over qubits, but their increased spectral crowding and limited controllability render high-fidelity quantum gates particularly challenging. We propose a hybrid optimization framework that integrates optimal control theory methods with contextual deep reinforcement learning to achieve robust controlled-phase gates on two qutrits. Optimal control is first used to design high-fidelity control pulses for a nominal system model. Reinforcement learning is then employed as a calibration stage that learns small residual corrections to these pulses in the presence of static model mismatch, thereby preserving good gate performance under realistic parameter uncertainties. By learning structured, low-dimensional residual corrections conditioned on device-specific parameter variations, reinforcement learning enhances the transfer robustness of nominally optimal but parameter-sensitive control solutions across ensembles of devices. Crucially, the reinforcement learning step in our framework does not compete with the optimal control step but provides the adaptability required for realistic hardware, systematically reducing the sensitivity to parameter fluctuations. Our results establish reinforcement learning as a practical and scalable ingredient for robust calibration of quantum gates in high-dimensional systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a simulation-only hybrid of optimal control plus contextual RL for two-qutrit controlled-phase gates that adds small residual corrections under static parameter mismatch.

read the letter

The core contribution is a two-stage pipeline: first run optimal control on a nominal model to get a high-fidelity pulse, then train a contextual RL agent to learn low-dimensional corrections when device parameters deviate. The RL step is conditioned on the specific variations, so it does not overwrite the nominal solution but only patches it. This staged approach is the main concrete element that is not just a restatement of prior single-method work on quantum calibration. It is presented clearly for the controlled-phase gate on two qutrits and the authors show, in simulation, that the corrections reduce sensitivity to the modeled mismatches while keeping gate fidelity reasonable. That is useful as an existence proof that RL can be slotted in as a calibration layer rather than a full replacement. The writing stays focused on the engineering problem of parameter uncertainty in higher-dimensional systems. The main limitation is that all results stay inside simulation. The training uses ensembles of static parameter variations, but there are no measurements on real superconducting or trapped-ion hardware, no time-dependent noise, and no drift. It is therefore open whether the learned corrections survive when the actual device deviates from the simulated ensemble in ways the model did not capture. Without that transfer data the robustness claim remains provisional. The paper is aimed at researchers already working on qudit control or on RL-assisted quantum calibration; a reader looking for a ready-to-deploy method will find it short on experimental grounding. A serious editor should send it to review because the pipeline is specific, the problem is timely, and the simulation evidence is at least internally consistent, but the referees will need to press on the simulation-to-hardware step.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes a hybrid optimization framework that first applies optimal control theory to generate high-fidelity nominal pulses for controlled-phase gates on two qutrits, then uses contextual deep reinforcement learning to learn small residual corrections that compensate for static parameter mismatches. The RL stage is conditioned on device-specific variations and is intended to improve robustness across ensembles without replacing or competing with the optimal-control solution.

Significance. If the RL corrections prove transferable, the approach could supply a practical calibration layer for parameter-sensitive qudit gates, addressing a recognized bottleneck in high-dimensional quantum control. The separation of nominal OC design from low-dimensional RL adaptation is a conceptually clean division of labor that may generalize beyond the two-qutrit case examined.

major comments (2)

[Results and Discussion] The central claim that the learned corrections transfer to realistic hardware while preserving nominal pulses rests entirely on simulated parameter ensembles; no experimental data on superconducting or trapped-ion qudit devices are presented to test whether unmodeled noise, drift, or dynamics invalidate the corrections or require extensive retraining.
[Methods] Quantitative details on the RL training (reward function, number of episodes, network architecture, and how contextual parameter vectors are encoded) are insufficient to assess whether the reported robustness gains are reproducible or merely artifacts of the chosen simulation model.

minor comments (2)

Figure captions and axis labels should explicitly state the range of parameter variations used in the training and test ensembles.
The abstract would benefit from one or two concrete performance numbers (e.g., average fidelity improvement or sensitivity reduction) to substantiate the robustness claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. We address each major comment below, providing clarifications and indicating planned revisions where the manuscript can be strengthened without altering its core simulation-based scope.

read point-by-point responses

Referee: [Results and Discussion] The central claim that the learned corrections transfer to realistic hardware while preserving nominal pulses rests entirely on simulated parameter ensembles; no experimental data on superconducting or trapped-ion qudit devices are presented to test whether unmodeled noise, drift, or dynamics invalidate the corrections or require extensive retraining.

Authors: We acknowledge that the reported results rely exclusively on numerical simulations of static parameter ensembles, as the manuscript presents a theoretical framework for hybrid optimal-control and contextual RL calibration. This design choice enables controlled, systematic evaluation of robustness across mismatch distributions that would be difficult to access experimentally in a single study. We agree that hardware validation is ultimately required to assess unmodeled effects such as drift and dynamics. In the revised manuscript we will add a new subsection in the Discussion that outlines a concrete experimental roadmap for superconducting qutrit platforms, including protocols for initial pulse transfer, on-device RL fine-tuning, and monitoring for retraining triggers. This addition will clarify the intended transition from simulation to experiment without claiming current hardware results. revision: partial
Referee: [Methods] Quantitative details on the RL training (reward function, number of episodes, network architecture, and how contextual parameter vectors are encoded) are insufficient to assess whether the reported robustness gains are reproducible or merely artifacts of the chosen simulation model.

Authors: We accept this criticism and will expand the Methods section substantially in the revision. The updated text will specify: (i) the reward function as a weighted sum of negative gate infidelity (computed via process fidelity) and an L2 pulse-energy penalty with explicit coefficients; (ii) training performed for 10^5 episodes using proximal policy optimization with a batch size of 256 and early stopping based on validation infidelity; (iii) the contextual policy network architecture consisting of a 3-layer MLP (128-128-64 units, ReLU activations) with the contextual parameter vector (normalized detuning and coupling deviations) concatenated to the state observation; and (iv) the precise encoding scheme and hyperparameter values used. These additions will allow independent reproduction and direct assessment of whether the robustness improvements are model-dependent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes a hybrid workflow: optimal control generates nominal pulses for a model, then contextual RL learns small residual corrections from independent ensembles of simulated parameter variations. No equations, fitted parameters, or predictions are shown that reduce by construction to the inputs (e.g., no self-definitional scaling or renaming of known results). The central claim rests on simulation outcomes for transfer robustness, which are externally falsifiable and not tautological. No load-bearing self-citations or uniqueness theorems imported from the authors' prior work appear in the provided text. This is the normal case of a methodological proposal whose results are not forced by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on standard quantum-control assumptions and the domain premise that static mismatches are learnable; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

standard math Standard assumptions of quantum mechanics and optimal control theory hold for the nominal system model.
Used to generate the initial high-fidelity pulses.
domain assumption Static model mismatch can be represented by a low-dimensional set of parameter variations that an RL agent can learn to correct.
Central premise enabling the calibration stage.

pith-pipeline@v0.9.0 · 5479 in / 1295 out tokens · 56350 ms · 2026-05-10T02:07:20.322301+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Computational and physical complexity of synthesizing random multi-qudit quantum states and unitary operators
quant-ph 2026-05 unverdicted novelty 5.0

Computational complexity of random multi-qudit states and unitaries scales exponentially with qudit number, while physical complexity scales more slowly.

Reference graph

Works this paper leans on

51 extracted references · 42 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Sample a device instance by drawing (δω 1, δω2, δg) from the noise distribution
[2]

Construct the effective HamiltonianH 0(λ) with λ= (ω 1, ω2, χ1, χ2, g)
[3]

Compute the OCT baseline fidelity for this device, FOCT(δω1, δω2, δg) =F avg U[ϵ OCT;λ], U CZ3 .(12)
[4]

Provide the agent with a normalized context vector o= δω1 σω , δω2 σω , δg σg ,(13) which lies in a bounded subset ofR 3
[5]

The agent outputs an actiona∈[−1,1] 2K, corre- sponding to scaled cosine coefficients for the two drives: ci =αa i, i= 1,2,(14) with a global coefficient scaleα= 0.03. 5
[6]

(8), form the total pulsesϵ tot, propagate the system, and compute FRL(δω1, δω2, δg) =F avg U[ϵ tot;λ], U CZ3 .(15)

Build residual pulses via Eq. (8), form the total pulsesϵ tot, propagate the system, and compute FRL(δω1, δω2, δg) =F avg U[ϵ tot;λ], U CZ3 .(15)
[7]

By construction,r >0 if and only if the RL-corrected pulses outperform the OCT baseline on that particular device instance

Return a scalar reward r=F RL −F OCT,(16) and terminate the episode. By construction,r >0 if and only if the RL-corrected pulses outperform the OCT baseline on that particular device instance. This reward shaping makes the learning problem explicitlyresidual: the agent is incentivized to discover corrections that enhance robustness rather than to reproduc...
[8]

S. S. Bullock, D. P. O’Leary, and G. K. Bren- nen, Asymptotically optimal quantum circuits ford- level systems, Phys. Rev. Lett.94, 230502 (2005), doi:10.1103/PhysRevLett.94.230502

work page doi:10.1103/physrevlett.94.230502 2005
[9]

B. P. Lanyon, M. Barbieri, M. P. Almeida,et al., Simplifying quantum logic using higher-dimensional Hilbert spaces, Nat. Phys.5, 134–140 (2009), doi:10.1038/nphys1150

work page doi:10.1038/nphys1150 2009
[10]

Y. Chi, J. Huang, Z. Zhang,et al., A programmable qudit-based quantum processor, Nat. Commun.13, 1166 (2022), doi:10.1038/s41467-022-28767-x

work page doi:10.1038/s41467-022-28767-x 2022
[11]

Bianchetti, S

R. Bianchetti, S. M. Girvin, M. H. Devoret, R. J. Schoelkopf, and A. Wallraff, Control and tomography of a three-level superconducting artifi- 15 cial atom, Phys. Rev. Lett.105, 223601 (2010), doi:10.1103/PhysRevLett.105.223601

work page doi:10.1103/physrevlett.105.223601 2010
[12]

M. S. Blok, V. V. Ramasesh, T. Schuster, K. O’Brien, J. M. Kreikebaum, D. Dahlen, A. Morvan, B. Yoshida, N. Y. Yao, and I. Siddiqi, Quantum information scram- bling on a superconducting qutrit processor, Phys. Rev. X11, 021010 (2021), doi:10.1103/PhysRevX.11.021010

work page doi:10.1103/physrevx.11.021010 2021
[13]

T. Roy, Z. Li, E. Kapit, and D. I. Schuster, Two- qutrit quantum algorithms on a programmable supercon- ducting processor, Phys. Rev. Appl.19, 064024 (2023), doi:10.1103/PhysRevApplied.19.064024

work page doi:10.1103/physrevapplied.19.064024 2023
[14]

N. Goss, A. Morvan, B. Marinelli,et al., High-fidelity qutrit entangling gates for superconducting circuits, Nat. Commun.13, 7481 (2022), doi:10.1038/s41467-022- 34851-z

work page doi:10.1038/s41467-022- 2022
[15]

Kononenko, M

M. Kononenko, M. A. Yurtalan, S. Ren, J. Shi, S. Ashhab, and A. Lupascu, Characterization of con- trol in a superconducting qutrit using randomized benchmarking, Phys. Rev. Res.3, L042007 (2021), doi:10.1103/PhysRevResearch.3.L042007

work page doi:10.1103/physrevresearch.3.l042007 2021
[16]

S. J. Glaser, U. Boscain, T. Calarco,et al., Training Schr¨ odinger’s cat: quantum optimal control, Eur. Phys. J. D69, 279 (2015), doi:10.1140/epjd/e2015-60464-1

work page doi:10.1140/epjd/e2015-60464-1 2015
[17]

Morvan, V

A. Morvan, V. V. Ramasesh, M. S. Blok, J. M. Kreike- baum, K. O’Brien, L. Chen, B. K. Mitchell, R. K. Naik, D. I. Santiago, and I. Siddiqi, Qutrit randomized benchmarking, Phys. Rev. Lett.126, 210504 (2021), doi:10.1103/PhysRevLett.126.210504

work page doi:10.1103/physrevlett.126.210504 2021
[18]

M. A. Yurtalan, J. Shi, M. Kononenko, A. Lupascu, and S. Ashhab, Implementation of a Walsh–Hadamard gate in a superconducting qutrit, Phys. Rev. Lett.125, 180504 (2020), doi:10.1103/PhysRevLett.125.180504

work page doi:10.1103/physrevlett.125.180504 2020
[19]

Ringbauer, M

M. Ringbauer, M. Meth, L. Postler,et al., A universal qudit quantum processor with trapped ions, Nat. Phys. 18, 1053–1057 (2022), doi:10.1038/s41567-022-01658-0

work page doi:10.1038/s41567-022-01658-0 2022
[20]

Basyildiz, Z

B. Basyildiz, Z. Gong, and S. Ashhab, Speed limits of two-qutrit gates, arXiv:2510.07742 [quant-ph] (2025). https://arxiv.org/abs/2510.07742

work page arXiv 2025
[21]

J. Q. You, X. Hu, S. Ashhab, and F. Nori, Low- decoherence flux qubit, Phys. Rev. B75, 140515 (2007), doi:10.1103/PhysRevB.75.140515

work page doi:10.1103/physrevb.75.140515 2007
[22]

Subramanian and A

M. Subramanian and A. Lupascu, Efficient two- qutrit gates in superconducting circuits using para- metric coupling, Phys. Rev. A108, 062616 (2023), doi:10.1103/PhysRevA.108.062616

work page doi:10.1103/physreva.108.062616 2023
[23]

R. W. Heeres, P. Reinhold, N. Ofek,et al., Implementing a universal gate set on a logical qubit encoded in an os- cillator, Nat. Commun.8, 94 (2017), doi:10.1038/s41467- 017-00045-1

work page doi:10.1038/s41467- 2017
[24]

P. M. Poggi, G. De Chiara, S. Campbell, and A. Kiely, Universally robust quantum con- trol, Phys. Rev. Lett.132, 193801 (2024), doi:10.1103/PhysRevLett.132.193801

work page doi:10.1103/physrevlett.132.193801 2024
[25]

Jaouadi, E

A. Jaouadi, E. Barrez, Y. Justum, and M. Desouter- Lecomte, Quantum gates in hyperfine levels of ultra- cold alkali dimers by revisiting constrained-phase opti- mal control design, J. Chem. Phys.139, 014310 (2013), doi:10.1063/1.4812317

work page doi:10.1063/1.4812317 2013
[26]

Implementing Quantum Gates and Algorithms in Ultracold Polar Molecules,

S. Vranckx, A. Jaouadi, P. Pellegrini, L. Bomble, N. Vaeck, and M. Desouter-Lecomte, “Implementing Quantum Gates and Algorithms in Ultracold Polar Molecules,” N. Lorente and C. Joachim (Springer, Berlin, Heidelberg, 2013). doi:10.1007/978-3-642-33137-4 21

work page doi:10.1007/978-3-642-33137-4 2013
[27]

Khaneja, T

N. Khaneja, T. Reiss, C. Kehlet, T. Schulte-Herbr¨ uggen, and S. J. Glaser, Optimal control of coupled spin dy- namics: design of NMR pulse sequences by gradient as- cent algorithms, J. Magn. Reson.172, 296–305 (2005), doi:10.1016/j.jmr.2004.11.004

work page doi:10.1016/j.jmr.2004.11.004 2005
[28]

C. P. Koch, U. Boscain, T. Calarco, M. J. Goerz, S. J. Glaser, S. Hegerfeldt, M. Horn, D. Jaksch, M. K. Olsen, and A. Roux, Quantum optimal control in quantum technologies: Strategic report on current sta- tus, visions and goals for research in Europe, EPJ Quan- tum Technol.9, 19 (2022), doi:10.1140/epjqt/s40507- 022-00138-x

work page doi:10.1140/epjqt/s40507- 2022
[29]

D. J. Egger and F. K. Wilhelm, Optimized controlled-Z gates for two superconducting qubits coupled through a resonator, Supercond. Sci. Technol.27, 014001 (2014), doi:10.1088/0953-2048/27/1/014001

work page doi:10.1088/0953-2048/27/1/014001 2014
[30]

Kelly, R

J. Kelly, R. Barends, B. Campbell, Y. Chen, Z. Chen, B. Chiaro, A. Dunsworth, A. G. Fowler, I.-C. Hoi, E. Jef- frey, A. Megrant, J. Mutus, C. Neill, P. J. J. O’Malley, C. Quintana, P. Roushan, D. Sank, A. Vainsencher, J. Wenner, T. C. White, A. N. Cleland, and J. M. Mar- tinis, Optimal quantum control using randomized benchmarking, Phys. Rev. Lett.112, 24...

work page doi:10.1103/physrevlett.112.240504 2014
[31]

Ashhab, P

S. Ashhab, P. C. de Groot, and F. Nori, Speed limits for quantum gates in multiqubit systems, Phys. Rev. A85, 052327 (2012), doi:10.1103/PhysRevA.85.052327

work page doi:10.1103/physreva.85.052327 2012
[32]

Ghosh, A

J. Ghosh, A. Galiautdinov, Z. Zhou, A. N. Ko- rotkov, J. M. Martinis, and M. R. Geller, High-fidelity controlled-σZ gate for resonator-based superconducting quantum computers, Phys. Rev. A87, 022309 (2013), doi:10.1103/PhysRevA.87.022309

work page doi:10.1103/physreva.87.022309 2013
[33]

Motzoi, J

F. Motzoi, J. M. Gambetta, P. Rebentrost, and F. K. Wilhelm, Simple pulses for elimination of leakage in weakly nonlinear qubits, Phys. Rev. Lett.103, 110501 (2009), doi:10.1103/PhysRevLett.103.110501

work page doi:10.1103/physrevlett.103.110501 2009
[34]

Ashhab, F

S. Ashhab, F. Yoshihara, T. Fuse, N. Yamamoto, A. Lu- pascu, and K. Semba, Speed limits for two-qubit gates with weakly anharmonic qubits, Phys. Rev. A105, 042614 (2022), doi:10.1103/PhysRevA.105.042614

work page doi:10.1103/physreva.105.042614 2022
[35]

D. J. Egger and F. K. Wilhelm, Adaptive hybrid optimal quantum control for imprecisely character- ized systems, Phys. Rev. Lett.112, 240503 (2014), doi:10.1103/PhysRevLett.112.240503

work page doi:10.1103/physrevlett.112.240503 2014
[36]

Bukov, A

M. Bukov, A. G. R. Day, D. Sels, P. Weinberg, A. Polkovnikov, and P. Mehta, Reinforcement learning in different phases of quantum control, Phys. Rev. X8, 031086 (2018), doi:10.1103/PhysRevX.8.031086

work page doi:10.1103/physrevx.8.031086 2018
[37]

M. Y. Niu, S. Boixo, V. Smelyanskiy, and H. Neven, Universal quantum control through deep reinforce- ment learning, npj Quantum Inf.5, 33 (2019), doi:10.1038/s41534-019-0141-3

work page doi:10.1038/s41534-019-0141-3 2019
[38]

Chattopadhyay, M

Y. Liu, Superconducting quantum computing optimiza- tion based on multi-objective deep reinforcement learn- ing, Sci. Rep.15, 3828 (2025), doi:10.1038/s41598-024- 73456-y

work page doi:10.1038/s41598-024- 2025
[39]

Jaouadi, E

A. Jaouadi, E. Mangaud, and M. Desouter-Lecomte, Re- exploring control strategies in a non-Markovian open quantum system by reinforcement learning, Phys. Rev. A 109, 013104 (2024), doi:10.1103/PhysRevA.109.013104

work page doi:10.1103/physreva.109.013104 2024
[40]

Haarnoja, A

T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, Soft actor-critic: Off-policy maximum entropy deep reinforce- ment learning with a stochastic actor, inProceedings of the 35th International Conference on Machine Learning (ICML), Proc. Mach. Learn. Res.80, 1861–1870 (2018). 16

2018
[41]

Fujimoto, H

S. Fujimoto, H. van Hoof, and D. Meger, Addressing function approximation error in actor-critic methods, in Proceedings of the 35th International Conference on Ma- chine Learning (ICML), Proc. Mach. Learn. Res.80, 1587–1596 (2018)

2018
[42]

T. P. Lillicrapet al., Continuous control with deep rein- forcement learning, arXiv:1509.02971

work page internal anchor Pith review arXiv
[43]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv
[44]

S. Li, Y. Fan, X. Li,et al., Robust quantum control using reinforcement learning from demonstration, npj Quan- tum Inf.11, 124 (2025), doi:10.1038/s41534-025-01065-2

work page doi:10.1038/s41534-025-01065-2 2025
[45]

M. A. Nielsen, A simple formula for the average gate fidelity of a quantum dynamical operation, Phys. Lett. A 303, 249–252 (2002), doi:10.1016/S0375-9601(02)01272- 0

work page doi:10.1016/s0375-9601(02)01272- 2002
[46]

J. R. Johansson, P. D. Nation, and F. Nori, QuTiP 2: A Python framework for the dynamics of open quan- tum systems, Comput. Phys. Commun.184, 1234–1240 (2013), doi:10.1016/j.cpc.2012.11.019

work page doi:10.1016/j.cpc.2012.11.019 2013
[47]

Lindoy, Deep Lall, Sebastian E

A. Agarwal, L. P. Lindoy, D. Lall, S. E. de Graaf, T. Lind- str¨ om, and I. Rungger, “Fast-tracking and disentangling of qubit noise fluctuations using minimal-data averaging and hierarchical discrete fluctuation auto-segmentation,” arXiv:2505.23622 (2025). doi:10.48550/arXiv.2505.23622

work page doi:10.48550/arxiv.2505.23622 2025
[48]

Decoherence benchmarking of su- perconducting qubits,

J. J. Burnett, A. Bengtsson, M. Scigliuzzo, J. Bylan- der, and P. Delsing, “Decoherence benchmarking of su- perconducting qubits,” npj Quantum Inf.5, 54 (2019). doi:10.1038/s41534-019-0168-5

work page doi:10.1038/s41534-019-0168-5 2019
[49]

Y. Baum, M. Amico, S. Howell, M. Hush, M. Li- uzzi, P. Mundada, T. Merkh, A. R. R. Carvalho, and M. J. Biercuk,Experimental deep reinforcement learn- ing for error-robust gate-set design on a superconduct- ing quantum computer, PRX Quantum2, 040324 (2021). https://doi.org/10.1103/PRXQuantum.2.040324

work page doi:10.1103/prxquantum.2.040324 2021
[50]

V. V. Sivak, A. Eickbusch, H. Liu, B. Royer, I. Tsiout- sios, and M. H. Devoret,Model-free quantum control with reinforcement learning, Phys. Rev. X12, 011059 (2022). https://doi.org/10.1103/PhysRevX.12.011059

work page doi:10.1103/physrevx.12.011059 2022
[51]

H. N. Nguyen, F. Motzoi, M. Metcalf, K. B. Wha- ley, M. Bukov, and M. Schmitt,Reinforcement learning pulses for transmon qubit entangling gates, Mach. Learn.: Sci. Technol.5, 025066 (2024). https://doi.org/10.1088/2632-2153/ad4f4d

work page doi:10.1088/2632-2153/ad4f4d 2024