Investigation of Automated Design of Quantum Circuits for Imaginary Time Evolution Methods Using Deep Reinforcement Learning

Ryo Suzuki; Shohei Watabe

arxiv: 2604.07951 · v1 · submitted 2026-04-09 · 🪐 quant-ph · cs.AI· cs.LG

Investigation of Automated Design of Quantum Circuits for Imaginary Time Evolution Methods Using Deep Reinforcement Learning

Ryo Suzuki , Shohei Watabe This is my paper

Pith reviewed 2026-05-10 17:52 UTC · model grok-4.3

classification 🪐 quant-ph cs.AIcs.LG

keywords quantum circuit designdeep reinforcement learningvariational imaginary time evolutionNISQ devicesMax-Cut optimizationmolecular ground statesDDQNansatz optimization

0 comments

The pith

Deep reinforcement learning automates design of quantum circuits for imaginary time evolution, yielding circuits with far fewer gates and less depth than standard manual ansatze.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a Double Deep Q-Network can treat quantum circuit construction for Variational Imaginary Time Evolution as a multi-objective task that balances energy accuracy against gate count and depth. By adding adoptive thresholds, the agent discovers circuit layouts that reduce hardware overhead while matching or exceeding the accuracy of hand-designed hardware-efficient ansatze. In Max-Cut problems the learned circuits average 37 percent fewer gates and 43 percent less depth; for the H2 molecule the same approach reaches the full configuration interaction limit with a markedly shallower circuit. The work positions reinforcement learning as a route to hardware-aware quantum algorithm design on NISQ devices.

Core claim

Framing circuit design for VITE as a reinforcement learning problem, the DDQN agent with adoptive thresholds learns to build ansatze that simultaneously lower energy expectation values and circuit complexity, producing structures that use approximately 37 percent fewer gates and 43 percent less depth than hardware-efficient ansatze on Max-Cut instances and that reach the Full-CI limit for H2 with significantly reduced depth.

What carries the argument

Double Deep Q-Network agent equipped with adoptive thresholds that performs multi-objective optimization over circuit actions to minimize both energy and resource cost.

If this is right

VITE implementations become feasible on devices with tighter gate budgets because the learned circuits lower both count and depth.
Non-intuitive circuit topologies that manual design overlooks can be found systematically for combinatorial optimization and molecular problems.
The same automated pipeline can be applied to other variational quantum algorithms that suffer from ansatz overhead.
Hardware-aware circuit search becomes a practical step before running quantum algorithms on near-term devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended to variational quantum eigensolvers by swapping the imaginary-time update rule for the standard VQE cost function.
Combining the learned circuits with error-mitigation techniques might amplify their advantage on noisy hardware beyond what simulation predicts.
The method suggests a general strategy for co-designing algorithms and hardware constraints rather than fixing the ansatz first.

Load-bearing premise

The simulation environment and reward shaping used to train the agent faithfully represent the performance gains that would appear on real noisy quantum hardware.

What would settle it

Implement the DDQN-discovered circuits on actual NISQ hardware for the same Max-Cut or H2 instances and measure whether their energy accuracy and noise resilience remain superior to hardware-efficient ansatze of comparable or greater depth.

Figures

Figures reproduced from arXiv: 2604.07951 by Ryo Suzuki, Shohei Watabe.

**Figure 2.** Figure 2: FIG. 2. Schematics of Quantum Circuit Design Workflow. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3. Quantum circuit for 4-qubit hardware-efficient SU(2) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: FIG. 5. (a) Example of 4-qubit quantum circuit, (b) List rep [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: FIG. 6. Episode-dependence of (a) the expectation value of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: FIG. 7. The smallest quantum circuit designed for the Max [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 9.** Figure 9: FIG. 9. Episode-dependence of (a) the expectation value of [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: FIG. 10. Episode-dependence of (a) the expectation value [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: FIG. 11. (a)-(c) Examples of circuits reaching [PITH_FULL_IMAGE:figures/full_fig_p010_11.png] view at source ↗

read the original abstract

Efficient ground state search is fundamental to advancing combinatorial optimization problems and quantum chemistry. While the Variational Imaginary Time Evolution (VITE) method offers a useful alternative to Variational Quantum Eigensolver (VQE), and Quantum Approximate Optimization Algorithm (QAOA), its implementation on Noisy Intermediate-Scale Quantum (NISQ) devices is severely limited by the gate counts and depth of manually designed ansatz. Here, we present an automated framework for VITE circuit design using Double Deep-Q Networks (DDQN). Our approach treats circuit construction as a multi-objective optimization problem, simultaneously minimizing energy expectation values and optimizing circuit complexity. By introducing adoptive thresholds, we demonstrate significant hardware overhead reductions. In Max-Cut problems, our agent autonomously discovered circuits with approximately 37\% fewer gates and 43\% less depth than standard hardware-efficient ansatz on average. For molecular hydrogen ($H_2$), the DDQN also achieved the Full-CI limit, with maintaining a significantly shallower circuit. These results suggest that deep reinforcement learning can be helpful to find non-intuitive, optimal circuit structures, providing a pathway toward efficient, hardware-aware quantum algorithm design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an automated framework using Double Deep Q-Networks (DDQN) with adoptive thresholds to design quantum circuits for Variational Imaginary Time Evolution (VITE). It frames circuit construction as a multi-objective optimization minimizing both energy expectation and circuit complexity, and reports that the agent discovers circuits with ~37% fewer gates and ~43% less depth than hardware-efficient ansatze on Max-Cut instances, while also reaching the Full-CI limit for H2 with a shallower circuit.

Significance. If the empirical gains prove robust, the work would demonstrate that reinforcement learning can identify non-intuitive, hardware-aware circuit structures for VITE, offering a pathway to reduce NISQ overhead in combinatorial optimization and quantum chemistry. The approach builds on existing RL-for-quantum-circuit literature by incorporating explicit multi-objective rewards and adoptive thresholds, but the lack of statistical validation and ablation studies limits the strength of the headline claims.

major comments (2)

[Abstract and Results] Abstract and Results section: the headline quantitative claims (37% fewer gates, 43% less depth on Max-Cut; Full-CI with shallower circuit on H2) are reported as averages without error bars, number of independent trials, statistical significance tests, or baseline implementation details, making it impossible to determine whether the improvements exceed the variability inherent to RL training and threshold tuning.
[Methodology] Methodology (adoptive thresholds and reward formulation): the adoptive thresholds and multi-objective reward weighting are presented as central to the performance gains, yet no ablation or sensitivity analysis is shown to establish that the reported savings are not artifacts of the specific threshold schedule, hyperparameter choices, or reward shaping; this directly affects the claim that the DDQN autonomously discovers superior circuits.

minor comments (2)

[Methodology] Clarify whether 'adoptive thresholds' is intended to mean 'adaptive thresholds' and provide the precise functional form or pseudocode for the threshold update rule.
[Introduction] The abstract and introduction would benefit from explicit citations to prior RL-based quantum circuit design works to better situate the novelty of the adoptive-threshold mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive review of our manuscript. The comments highlight important aspects of statistical reporting and methodological robustness that we have addressed in the revision. Our point-by-point responses follow.

read point-by-point responses

Referee: [Abstract and Results] Abstract and Results section: the headline quantitative claims (37% fewer gates, 43% less depth on Max-Cut; Full-CI with shallower circuit on H2) are reported as averages without error bars, number of independent trials, statistical significance tests, or baseline implementation details, making it impossible to determine whether the improvements exceed the variability inherent to RL training and threshold tuning.

Authors: We agree that the original presentation of the headline metrics would be strengthened by explicit statistical context. In the revised manuscript we have added error bars (standard deviation over 20 independent DDQN training runs) to the reported gate-count and depth reductions for Max-Cut, and we state that the Full-CI result for H2 was obtained in all runs with the shallower circuit. We have also expanded the baseline description in Section 3 to include the precise gate set and connectivity assumptions used for the hardware-efficient ansatz. These changes allow readers to evaluate the improvements against observed training variability. revision: yes
Referee: [Methodology] Methodology (adoptive thresholds and reward formulation): the adoptive thresholds and multi-objective reward weighting are presented as central to the performance gains, yet no ablation or sensitivity analysis is shown to establish that the reported savings are not artifacts of the specific threshold schedule, hyperparameter choices, or reward shaping; this directly affects the claim that the DDQN autonomously discovers superior circuits.

Authors: We acknowledge that a dedicated ablation study would further substantiate the role of the adaptive thresholds and reward design. In the revised version we have added a sensitivity analysis (new Supplementary Figure S1) that varies the adaptive threshold schedule and the energy-versus-complexity reward weights by ±20 %. The gate and depth savings remain within 5 % of the reported values across this range, indicating that the discovered circuits are not artifacts of the exact hyperparameter settings. We have also clarified the reward formulation in Section 2.3. A full combinatorial ablation of every hyperparameter combination was not performed owing to computational cost, but the sensitivity results support the robustness of the central claims. revision: partial

Circularity Check

0 steps flagged

No circularity; empirical RL results with no derivation chain

full rationale

The paper describes an empirical RL framework (DDQN with multi-objective reward and adoptive thresholds) for discovering VITE circuits, reporting measured outcomes such as gate/depth reductions on Max-Cut and Full-CI achievement on H2. No mathematical derivation, first-principles equations, or predictive claims exist that could reduce to fitted inputs or self-citations by construction. Performance numbers are direct simulation outputs, not renamed fits or self-referential definitions. The method's hyperparameters are part of the experimental setup rather than load-bearing premises that loop back to the results. This is a standard empirical demonstration paper with self-contained experimental validation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The work rests on standard quantum simulation assumptions and RL training procedures. The adoptive thresholds and multi-objective reward are introduced ad hoc for this task.

free parameters (2)

adoptive thresholds
Tuned parameters that control when the agent stops adding gates during circuit construction.
DDQN hyperparameters
Learning rate, network architecture, exploration parameters, and reward weights chosen or fitted for the VITE task.

axioms (1)

domain assumption Classical simulation of the quantum circuits accurately reflects VITE performance on NISQ hardware.
Invoked when evaluating energy expectation values and comparing gate counts.

pith-pipeline@v0.9.0 · 5504 in / 1429 out tokens · 37401 ms · 2026-05-10T17:52:51.699434+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 2 internal anchors

[1]

Reward Function The rewardR t is designed to prioritize both the min- imization of the energy expectation value and the com- pactness of the circuit structure. We first employ the following reward: Rt = (Et−1 −E t) +c(g max −g)Θ(E threshold −E t),(11) whereE t−1 andE t are the energy expectation values before and after the actionA t, respectively. The ter...

work page
[2]

The threshold is initialized at 0.0 since the ground state energies in both benchmark models are known to be negative

Evolving Threshold We implement an evolving thresholding mechanism forE threshold. The threshold is initialized at 0.0 since the ground state energies in both benchmark models are known to be negative. To drive the agent toward the lower energy state,E threshold is updated every 10 suc- cessful episodes toE best −ϵ, whereE best is the minimum energy achie...

work page
[3]

minimizing the gate count

Numerical Setup and Hyperparameters As mentioned earlier, the design environment is con- figured for a 4-qubit system with a maximum depth of 10 and a maximum gate count of 30. The learning process is evaluated based on the energy expectation value, gate count, circuit depth, and cumulative reward. For the DDQN hyperparameters, the learning rateαis set to...

work page
[4]

Preskill, Quantum computing in the nisq era and be- yond, Quantum2, 79 (2018)

J. Preskill, Quantum computing in the nisq era and be- yond, Quantum2, 79 (2018)

work page 2018
[5]

Ostaszewski, L

M. Ostaszewski, L. M. Trenkwalder, W. Masarczyk, E. Scerri, and V. Dunjko, Reinforcement learning for op- timization of variational quantum circuit architectures, Advances in Neural Information Processing Systems34, 18182 (2021)

work page 2021
[6]

Quantum circuit optimization with deep reinforcement learning,

T. F¨ osel, M. Y. Niu, F. Marquardt, and L. Li, Quan- tum circuit optimization with deep reinforcement learn- ing, arXiv preprint arXiv:2103.07585 (2021)

work page arXiv 2021
[7]

K¨ olle, T

M. K¨ olle, T. Schubert, P. Altmann, M. Zorn, J. Stein, and C. Linnhoff-Popien, A reinforcement learning envi- ronment for directed quantum circuit synthesis, arXiv preprint arXiv:2401.07054 (2024)

work page arXiv 2024
[8]

Kimura, K

T. Kimura, K. Shiba, C.-C. Chen, M. Sogabe, K. Sakamoto, and T. Sogabe, Quantum circuit architec- tures via quantum observable markov decision process planning, Journal of Physics Communications6, 075006 (2022)

work page 2022
[9]

Cerezo, A

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio,et al., Variational quantum algorithms, Nature Reviews Physics3, 625 (2021)

work page 2021
[10]

Kandala, A

A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Hardware- efficient variational quantum eigensolver for small molecules and quantum magnets, nature549, 242 (2017)

work page 2017
[11]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

A Quantum Approximate Optimization Algorithm

E. Farhi, J. Goldstone, and S. Gutmann, A quan- tum approximate optimization algorithm, arXiv preprint arXiv:1411.4028 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Motta, C

M. Motta, C. Sun, A. T. Tan, M. J. O’Rourke, E. Ye, A. J. Minnich, F. G. Brandao, and G. K.-L. Chan, De- termining eigenstates and thermal states on a quantum computer using quantum imaginary time evolution, Na- ture Physics16, 205 (2020)

work page 2020
[14]

X. Yuan, S. Endo, Q. Zhao, Y. Li, and S. C. Benjamin, Theory of variational quantum simulation, Quantum3, 191 (2019)

work page 2019
[15]

Wick, Properties of bethe-salpeter wave functions, Physical Review96, 1124 (1954)

G.-C. Wick, Properties of bethe-salpeter wave functions, Physical Review96, 1124 (1954)

work page 1954
[16]

A. D. McLachlan, A variational solution of the time- dependent schrodinger equation, Molecular Physics8, 39 (1964)

work page 1964
[17]

Broeckhove, L

J. Broeckhove, L. Lathouwers, E. Kesteloot, and P. Van Leuven, On the equivalence of time-dependent variational principles, Chemical physics letters149, 547 (1988)

work page 1988
[18]

L. P. Kaelbling, M. L. Littman, and A. W. Moore, Rein- forcement learning: A survey, Journal of artificial intelli- gence research4, 237 (1996)

work page 1996
[19]

Tokic, Adaptiveε-greedy exploration in reinforcement learning based on value differences, inAnnual conference on artificial intelligence(Springer, 2010) pp

M. Tokic, Adaptiveε-greedy exploration in reinforcement learning based on value differences, inAnnual conference on artificial intelligence(Springer, 2010) pp. 203–210

work page 2010
[20]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver, Deep reinforce- ment learning with double q-learning, inProceedings of 11 the AAAI conference on artificial intelligence, Vol. 30 (2016)

work page 2016
[21]

M. X. Goemans and D. P. Williamson, Improved approx- imation algorithms for maximum cut and satisfiability problems using semidefinite programming, Journal of the ACM (JACM)42, 1115 (1995)

work page 1995
[22]

J. T. Seeley, M. J. Richard, and P. J. Love, The bravyi- kitaev transformation for quantum computation of elec- tronic structure, The Journal of chemical physics137 (2012)

work page 2012
[23]

Fradkin, Jordan-wigner transformation for quantum- spin systems in two dimensions and fractional statistics, Physical review letters63, 322 (1989)

E. Fradkin, Jordan-wigner transformation for quantum- spin systems in two dimensions and fractional statistics, Physical review letters63, 322 (1989)

work page 1989

[1] [1]

Reward Function The rewardR t is designed to prioritize both the min- imization of the energy expectation value and the com- pactness of the circuit structure. We first employ the following reward: Rt = (Et−1 −E t) +c(g max −g)Θ(E threshold −E t),(11) whereE t−1 andE t are the energy expectation values before and after the actionA t, respectively. The ter...

work page

[2] [2]

The threshold is initialized at 0.0 since the ground state energies in both benchmark models are known to be negative

Evolving Threshold We implement an evolving thresholding mechanism forE threshold. The threshold is initialized at 0.0 since the ground state energies in both benchmark models are known to be negative. To drive the agent toward the lower energy state,E threshold is updated every 10 suc- cessful episodes toE best −ϵ, whereE best is the minimum energy achie...

work page

[3] [3]

minimizing the gate count

Numerical Setup and Hyperparameters As mentioned earlier, the design environment is con- figured for a 4-qubit system with a maximum depth of 10 and a maximum gate count of 30. The learning process is evaluated based on the energy expectation value, gate count, circuit depth, and cumulative reward. For the DDQN hyperparameters, the learning rateαis set to...

work page

[4] [4]

Preskill, Quantum computing in the nisq era and be- yond, Quantum2, 79 (2018)

J. Preskill, Quantum computing in the nisq era and be- yond, Quantum2, 79 (2018)

work page 2018

[5] [5]

Ostaszewski, L

M. Ostaszewski, L. M. Trenkwalder, W. Masarczyk, E. Scerri, and V. Dunjko, Reinforcement learning for op- timization of variational quantum circuit architectures, Advances in Neural Information Processing Systems34, 18182 (2021)

work page 2021

[6] [6]

Quantum circuit optimization with deep reinforcement learning,

T. F¨ osel, M. Y. Niu, F. Marquardt, and L. Li, Quan- tum circuit optimization with deep reinforcement learn- ing, arXiv preprint arXiv:2103.07585 (2021)

work page arXiv 2021

[7] [7]

K¨ olle, T

M. K¨ olle, T. Schubert, P. Altmann, M. Zorn, J. Stein, and C. Linnhoff-Popien, A reinforcement learning envi- ronment for directed quantum circuit synthesis, arXiv preprint arXiv:2401.07054 (2024)

work page arXiv 2024

[8] [8]

Kimura, K

T. Kimura, K. Shiba, C.-C. Chen, M. Sogabe, K. Sakamoto, and T. Sogabe, Quantum circuit architec- tures via quantum observable markov decision process planning, Journal of Physics Communications6, 075006 (2022)

work page 2022

[9] [9]

Cerezo, A

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio,et al., Variational quantum algorithms, Nature Reviews Physics3, 625 (2021)

work page 2021

[10] [10]

Kandala, A

A. Kandala, A. Mezzacapo, K. Temme, M. Takita, M. Brink, J. M. Chow, and J. M. Gambetta, Hardware- efficient variational quantum eigensolver for small molecules and quantum magnets, nature549, 242 (2017)

work page 2017

[11] [11]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, Proximal policy optimization algorithms, arXiv preprint arXiv:1707.06347 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

A Quantum Approximate Optimization Algorithm

E. Farhi, J. Goldstone, and S. Gutmann, A quan- tum approximate optimization algorithm, arXiv preprint arXiv:1411.4028 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Motta, C

M. Motta, C. Sun, A. T. Tan, M. J. O’Rourke, E. Ye, A. J. Minnich, F. G. Brandao, and G. K.-L. Chan, De- termining eigenstates and thermal states on a quantum computer using quantum imaginary time evolution, Na- ture Physics16, 205 (2020)

work page 2020

[14] [14]

X. Yuan, S. Endo, Q. Zhao, Y. Li, and S. C. Benjamin, Theory of variational quantum simulation, Quantum3, 191 (2019)

work page 2019

[15] [15]

Wick, Properties of bethe-salpeter wave functions, Physical Review96, 1124 (1954)

G.-C. Wick, Properties of bethe-salpeter wave functions, Physical Review96, 1124 (1954)

work page 1954

[16] [16]

A. D. McLachlan, A variational solution of the time- dependent schrodinger equation, Molecular Physics8, 39 (1964)

work page 1964

[17] [17]

Broeckhove, L

J. Broeckhove, L. Lathouwers, E. Kesteloot, and P. Van Leuven, On the equivalence of time-dependent variational principles, Chemical physics letters149, 547 (1988)

work page 1988

[18] [18]

L. P. Kaelbling, M. L. Littman, and A. W. Moore, Rein- forcement learning: A survey, Journal of artificial intelli- gence research4, 237 (1996)

work page 1996

[19] [19]

Tokic, Adaptiveε-greedy exploration in reinforcement learning based on value differences, inAnnual conference on artificial intelligence(Springer, 2010) pp

M. Tokic, Adaptiveε-greedy exploration in reinforcement learning based on value differences, inAnnual conference on artificial intelligence(Springer, 2010) pp. 203–210

work page 2010

[20] [20]

Van Hasselt, A

H. Van Hasselt, A. Guez, and D. Silver, Deep reinforce- ment learning with double q-learning, inProceedings of 11 the AAAI conference on artificial intelligence, Vol. 30 (2016)

work page 2016

[21] [21]

M. X. Goemans and D. P. Williamson, Improved approx- imation algorithms for maximum cut and satisfiability problems using semidefinite programming, Journal of the ACM (JACM)42, 1115 (1995)

work page 1995

[22] [22]

J. T. Seeley, M. J. Richard, and P. J. Love, The bravyi- kitaev transformation for quantum computation of elec- tronic structure, The Journal of chemical physics137 (2012)

work page 2012

[23] [23]

Fradkin, Jordan-wigner transformation for quantum- spin systems in two dimensions and fractional statistics, Physical review letters63, 322 (1989)

E. Fradkin, Jordan-wigner transformation for quantum- spin systems in two dimensions and fractional statistics, Physical review letters63, 322 (1989)

work page 1989