QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

Abraham Itzhak Weinberg

arxiv: 2605.16429 · v1 · pith:VV6F6ZIXnew · submitted 2026-05-14 · 💻 cs.LG · cs.AI

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

Abraham Itzhak Weinberg This is my paper

Pith reviewed 2026-05-20 20:14 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords quantum amplitude estimationfokker-planck equationcontinuous reinforcement learningexploration bonuspolicy optimizationquadratic speedupstationary distributionmultimodal rewards

0 comments

The pith

QuantFPFlow uses quantum amplitude estimation to replace classical O(1/ε²) computation of the Fokker-Planck partition function with O(1/ε) scaling, enabling an exploration bonus that improves global optimum discovery in continuous RL.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces QuantFPFlow to integrate quantum amplitude estimation into the Fokker-Planck formulation for stochastic policy optimization in continuous spaces. It establishes that this substitution reduces the cost of estimating the partition function Z from quadratic to linear dependence on precision, while the resulting stationary distribution supplies an exploration bonus that steers agents away from local optima. A sympathetic reader would care because continuous RL frequently fails in multimodal reward landscapes, and the method pairs the speedup with diffusion matching to keep policy variance from collapsing. Demonstrations on a designed task show higher rates of global optimum discovery and improved scaling with dimension.

Core claim

QuantFPFlow integrates quantum amplitude estimation into the Fokker-Planck formulation of stochastic policy optimisation, replacing the classical O(1/ε²) estimation of the partition function Z = ∫ e^{-V(x)/D} dx with a Grover-amplified amplitude estimator that achieves O(1/ε). The estimated stationary distribution ρ* drives the exploration bonus R_aug = R_env + α log(1/ρ*(s)), which steers the agent toward globally optimal regions while FP diffusion matching constrains policy variance. On a continuous-control task designed to expose local-optima failure, the approach discovers the global optimum 10.4 percent more frequently than Soft Actor-Critic while maintaining higher policy entropy.

What carries the argument

The Grover-amplified quantum amplitude estimator applied to the Fokker-Planck partition function Z, whose output stationary distribution supplies the exploration bonus.

If this is right

Agents discover the global optimum 10.4 percent more frequently while mean reward rises modestly.
Policy entropy stays near 6.5 nats rather than collapsing to 1.5 nats.
Dimensional scaling improves from O(d^0.76) to O(d^0.35).
The quantum-inspired classical simulation already realizes the O(1/ε) algorithmic structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same partition-function estimator could be reused in other continuous optimization settings that rely on equilibrium distributions.
Full fault-tolerant quantum hardware would turn the quadratic speedup into a practical advantage for higher-dimensional tasks.
FP diffusion matching might be combined with other entropy-regularization schemes to further stabilize training.

Load-bearing premise

The stationary distribution estimated from the Fokker-Planck equation can be turned into an exploration bonus that reliably improves global search without introducing instabilities that erase the reported gains.

What would settle it

A direct numerical check showing that the quantum-inspired estimator for Z fails to exhibit linear scaling with 1/ε or that adding the log(1/ρ*) bonus produces no increase in global-optimum discovery rate on the designed multimodal task.

Figures

Figures reproduced from arXiv: 2605.16429 by Abraham Itzhak Weinberg.

**Figure 2.** Figure 2: Learning curves on the multimodal continuous control environment. [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: Exploration analysis. Left: reward decomposition; environment reward (dashed) and quantum exploration bonus (shaded, right axis). Centre: policy entropy; QuantFPFlow maintains H(π) ≈ 6.5 nats while SAC collapses to 1.5 nats. Right: state-visitation density of the trained policy along x1; QuantFPFlow maintains mass at both optima. 7.5 to 1.5 nats—a 5× reduction despite explicit entropy regularisation. This … view at source ↗

**Figure 4.** Figure 4: Fokker–Planck framework. Left: multimodal potential V (x). Centre: stationary distribution ρ ∗ (x)—ground truth (dashed), QuantFPFlow estimate (red), classical FP solver (blue). Right: FP evolution heatmap showing convergence to the stationary distribution [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Complexity and efficiency analysis. Left: query complexity for partition function estimation; shaded region is the quantum advantage gap (Theorem 1). Centre: sample efficiency over 400 episodes. Right: policy entropy during training. 5.6 Dimensionality Scaling and Qubit Ablation [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Scaling and ablation. Left: computation time vs. state dimensionality (log–log); QuantFPFlow grows as O(d 0.35) vs. classical O(d 0.76). Right: stationary distribution MSE vs. qubit count; performance stabilises at ≥ 5 qubits [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: FP vector field and 2D stationary distribution. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: provides additional evidence that QuantFPFlow avoids the mode collapse that afflicts SAC and DDPG. The left panel compares policy distributions on the multimodal FP potential: SAC collapses to a single mode at x ≈ 1.8, while QuantFPFlow covers both modes of ρ ∗ (x). The centre panel shows KL divergence from the true stationary distribution over 300 training episodes: QuantFPFlow KL decreases to near-zero w… view at source ↗

read the original abstract

We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function $Z = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x}$ at cost $\calO(1/\varepsilon^{2})$; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving $\calO(1/\varepsilon)$ -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the $\calO(1/\varepsilon)$ algorithmic structure. The estimated stationary distribution $\rhostar$ drives a theoretically grounded exploration bonus $\Raug = \Renv + \alpha\log(1/\rhostar(s))$. This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward $1{,}295.7 \pm 423.2$ versus $1{,}284.0 \pm 474.0$ for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near $H(\pi)\approx 6.5$\,nats throughout training, whereas SAC collapses to $1.5$\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of $\calO(d^{0.35})$ for QuantFPFlow versus $\calO(d^{0.76})$ for classical FP estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

QuantFPFlow folds quantum amplitude estimation into Fokker-Planck RL to speed up partition function estimates and add an entropy-preserving bonus, but the gains rest on classical simulation and thin experimental reporting. The integration looks new: they replace the classical O(1/ε²) integral for Z with a Grover-style estimator that gives O(1/ε) structure even in the simulated version, then feed the resulting stationary distribution into an additive exploration term R_aug = R_env + α log(1/ρ*). On the custom multimodal task this produces modestly higher mean reward than SAC, a 10% lift in global-optimum discovery rate, and policy entropy that stays near 6.5 nats instead of collapsing. The reported O(d^0.35) scaling versus O(d^0.76) for plain FP estimation is a concrete data point that stands out from the abstract alone. The thinking here is straightforward and connects the right pieces from quantum RL and FP-based control without obvious internal contradictions. The experiments at least report numbers with error bars and track entropy over training, which is more than many short abstracts manage. The main weaknesses are the missing pieces. No task definition, hyperparameter protocol, or significance tests appear in the provided text, so it is hard to judge whether the 3% success-rate gap is reliable or just noise. Because the quantum estimator is only classically simulated, the O(1/ε) claim is an algorithmic property by design rather than a measured wall-clock advantage. The state-preparation cost for the integral over a learned potential is a legitimate open issue; if that step scales with dimension or requires expensive variational optimization, the net complexity could easily exceed the claimed quadratic saving. Alpha is listed as a free parameter, and without sensitivity checks it is unclear how much of the reported edge depends on post-hoc fitting. This paper is aimed at people already working on quantum-inspired methods or physics-based exploration in continuous control. A reader looking for new ways to keep entropy alive in high-dimensional landscapes might borrow the bonus construction or the scaling experiment. The work shows clear engagement with the literature and honest reporting of what was simulated, so it is coherent on its own terms. I would send it to peer review. The core combination is fresh enough and the scaling result concrete enough that referees should be asked to verify the derivations and request fuller experimental controls.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces QuantFPFlow, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker-Planck formulation of stochastic policy optimisation. It claims to replace the classical O(1/ε²) estimation of the partition function Z = ∫ e^{-V(x)/D} dx with a Grover-amplified amplitude estimator achieving O(1/ε) quadratic speedup, demonstrated via quantum-inspired classical simulation. The estimated stationary distribution ρ* is used to construct an exploration bonus R_aug = R_env + α log(1/ρ*(s)) that steers agents toward global optima while FP diffusion matching constrains policy variance. On a custom continuous-control task, QuantFPFlow reports mean reward 1,295.7 ± 423.2 (vs. 1,284.0 ± 474.0 for SAC), discovers the global optimum 10.4% more frequently (33.9% vs. 30.7%), maintains policy entropy near 6.5 nats (vs. SAC collapse to 1.5 nats), and exhibits computational scaling O(d^{0.35}) versus O(d^{0.76}) for classical FP estimation.

Significance. If the central claim of a net O(1/ε) complexity holds after accounting for state preparation, the framework could enable more efficient exploration in multimodal continuous RL landscapes via theoretically grounded bonuses derived from the stationary distribution. The reported O(d^{0.35}) scaling in the quantum-inspired simulation and the entropy maintenance are promising empirical observations that could influence hybrid quantum-classical RL methods. However, the significance is tempered by the simulation-based demonstration and the need for explicit verification that preparation costs do not negate the speedup.

major comments (3)

[Abstract] Abstract: the claim that the quantum-inspired classical simulation 'already exhibits the O(1/ε) algorithmic structure' is presented as evidence of quadratic speedup, yet the structure appears by construction in the simulation design rather than being validated against an independent classical baseline that does not embed the Grover amplification; this makes the speedup claim dependent on the simulation method itself.
[Dimensionality experiments] Dimensionality experiments: the reported O(d^{0.35}) scaling for QuantFPFlow is promising, but without the explicit state-preparation routine for the amplitude corresponding to Z = ∫ e^{-V(x)/D} dx (especially when V is a learned neural network), it is impossible to confirm that the net complexity remains O(1/ε) rather than being dominated by preparation whose cost may scale exponentially in d or require variational optimization exceeding the savings.
[Method (exploration bonus)] Exploration bonus construction: the bonus R_aug = R_env + α log(1/ρ*(s)) relies on a free parameter α whose value is not shown to be derived independently of the reported performance metrics; this creates a potential circularity when claiming that FP diffusion matching prevents premature convergence, as the gains may partly result from post-hoc tuning of α rather than the stationary distribution alone.

minor comments (2)

[Abstract] The abstract reports concrete performance numbers but omits any reference to the specific continuous-control task definition, number of runs, or statistical tests used to support the mean reward and success-rate differences.
[Preliminaries] Notation for ρ* and V(x) is introduced without an early equation reference; adding a numbered definition in the preliminaries would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on manuscript arXiv:2605.16429. We respond point-by-point to the major comments below, indicating where we will revise the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that the quantum-inspired classical simulation 'already exhibits the O(1/ε) algorithmic structure' is presented as evidence of quadratic speedup, yet the structure appears by construction in the simulation design rather than being validated against an independent classical baseline that does not embed the Grover amplification; this makes the speedup claim dependent on the simulation method itself.

Authors: We thank the referee for highlighting this potential source of ambiguity. The quantum-inspired simulation is deliberately constructed to replicate the sampling and Grover-amplification steps of quantum amplitude estimation in order to exhibit the O(1/ε) complexity structure that would be realized on quantum hardware. It is not intended as an empirical demonstration of speedup relative to a non-quantum baseline. We will revise the abstract and the relevant methodological paragraphs to state explicitly that the simulation illustrates the algorithmic complexity of the quantum procedure rather than claiming an observed classical speedup. revision: partial
Referee: [Dimensionality experiments] Dimensionality experiments: the reported O(d^{0.35}) scaling for QuantFPFlow is promising, but without the explicit state-preparation routine for the amplitude corresponding to Z = ∫ e^{-V(x)/D} dx (especially when V is a learned neural network), it is impossible to confirm that the net complexity remains O(1/ε) rather than being dominated by preparation whose cost may scale exponentially in d or require variational optimization exceeding the savings.

Authors: This is a valid and important observation. The reported scaling and O(1/ε) estimation cost are obtained under the quantum-inspired classical simulation, which approximates the amplitude estimation step without incurring the overhead of actual quantum state preparation. For a fault-tolerant quantum implementation, preparing a state whose amplitudes encode the integrand e^{-V(x)/D} (with V a neural network) remains an open technical challenge whose cost must be analyzed separately. We will add a dedicated paragraph in the dimensionality-experiments section that (i) states the simulation assumptions, (ii) references existing techniques for function encoding (e.g., via quantum random access memory or variational state preparation), and (iii) clarifies that the claimed O(1/ε) complexity applies to the estimation phase once an efficient preparation oracle is available. revision: yes
Referee: [Method (exploration bonus)] Exploration bonus construction: the bonus R_aug = R_env + α log(1/ρ*(s)) relies on a free parameter α whose value is not shown to be derived independently of the reported performance metrics; this creates a potential circularity when claiming that FP diffusion matching prevents premature convergence, as the gains may partly result from post-hoc tuning of α rather than the stationary distribution alone.

Authors: We agree that α is a hyperparameter whose selection must be justified independently of final performance numbers to avoid any perception of circularity. In the original experiments α was chosen via a modest grid search on a held-out validation set to balance the magnitude of the log-density bonus against the environmental reward. For the revision we will (i) report the sensitivity of key metrics (global-optimum discovery rate and entropy maintenance) across a range of α values, and (ii) provide a simple heuristic for setting α based on the diffusion coefficient D and the empirical variance of the estimated stationary density, thereby grounding the choice in the underlying Fokker–Planck theory rather than solely in post-hoc performance. revision: partial

Circularity Check

1 steps flagged

O(1/ε) structure exhibited by design in quantum-inspired classical simulation; α parameter tied to reported gains

specific steps

fitted input called prediction [Abstract]
"While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the O(1/ε) algorithmic structure."

The simulation is built to replicate the Grover-amplified amplitude estimator's complexity, so the 'exhibition' of O(1/ε) scaling follows by construction from the choice to simulate that specific algorithmic structure rather than emerging as a prediction from independent data or derivation.

full rationale

The paper's core claim of replacing O(1/ε²) classical estimation with O(1/ε) quantum amplitude estimation is demonstrated via a quantum-inspired classical simulation that is explicitly constructed to exhibit the target complexity scaling. This makes the exhibited structure a direct consequence of the simulation design rather than an independent empirical or derivational outcome. The exploration bonus construction further relies on a tunable α whose independence from the performance metrics is not established in the provided text. No load-bearing self-citation chains or self-definitional equations were identified in the abstract or described derivation; the circularity is partial and limited to the simulation-based demonstration of the speedup.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the applicability of the Fokker-Planck equation to continuous policy optimization and on standard quantum amplitude estimation results; α is introduced without independent derivation.

free parameters (1)

α
Coefficient scaling the exploration bonus term; its value is chosen to produce the reported performance gains.

axioms (1)

domain assumption The Fokker-Planck equation accurately models the evolution of the policy probability density under stochastic policy optimization in continuous state spaces.
Invoked to justify the stationary distribution ρ* and the diffusion-matching constraint on policy variance.

pith-pipeline@v0.9.0 · 5866 in / 1512 out tokens · 58332 ms · 2026-05-20T20:14:42.851743+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 1 internal anchor

[2]

2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages=

Advances in quantum reinforcement learning , author=. 2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages=. 2017 , organization=

work page 2017
[3]

2006 , publisher=

Controlled Markov processes and viscosity solutions , author=. 2006 , publisher=

work page 2006
[4]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018
[5]

PRX Quantum , volume=

Quantum enhancements for deep reinforcement learning in large spaces , author=. PRX Quantum , volume=. 2021 , publisher=

work page 2021
[6]

Journal of statistical mechanics: theory and experiment , volume=

Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=

work page
[7]

Japanese journal of mathematics , volume=

Mean field games , author=. Japanese journal of mathematics , volume=. 2007 , publisher=

work page 2007
[8]

2020 , month=sep # " 15", publisher=

Continuous control with deep reinforcement learning , author=. 2020 , month=sep # " 15", publisher=

work page 2020
[9]

International conference on machine learning , pages=

Curiosity-driven exploration by self-supervised prediction , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017
[10]

Proceedings of the national academy of sciences , volume=

Efficient computation of optimal actions , author=. Proceedings of the national academy of sciences , volume=. 2009 , publisher=

work page 2009
[11]

The Fokker-Planck equation: methods of solution and applications , pages=

Fokker-planck equation , author=. The Fokker-Planck equation: methods of solution and applications , pages=. 1989 , publisher=

work page 1989
[12]

Quantum Amplitude Amplification and Estimation

Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude amplification and estimation. arXiv preprint quant-ph/0005055 , 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000
[13]

Advances in quantum reinforcement learning

Vedran Dunjko, Jacob M Taylor, and Hans J Briegel. Advances in quantum reinforcement learning. In 2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages 282--287. IEEE, 2017

work page 2017
[14]

Controlled Markov processes and viscosity solutions

Wendell H Fleming and H Mete Soner. Controlled Markov processes and viscosity solutions . Springer, 2006

work page 2006
[15]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861--1870. Pmlr, 2018

work page 2018
[16]

Quantum enhancements for deep reinforcement learning in large spaces

Sofiene Jerbi, Lea M Trenkwalder, Hendrik Poulsen Nautrup, Hans J Briegel, and Vedran Dunjko. Quantum enhancements for deep reinforcement learning in large spaces. PRX Quantum , 2(1):010328, 2021

work page 2021
[17]

Path integrals and symmetry breaking for optimal control theory

Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment , 2005(11):P11011--P11011, 2005

work page 2005
[18]

Mean field games

Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese journal of mathematics , 2(1):229--260, 2007

work page 2007
[19]

Continuous control with deep reinforcement learning, September 15 2020

Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daniel Pieter Wierstra. Continuous control with deep reinforcement learning, September 15 2020. US Patent 10,776,692

work page 2020
[20]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning , pages 2778--2787. PMLR, 2017

work page 2017
[21]

Fokker-planck equation

Hannes Risken. Fokker-planck equation. In The Fokker-Planck equation: methods of solution and applications , pages 63--95. Springer, 1989

work page 1989
[22]

Efficient computation of optimal actions

Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences , 106(28):11478--11483, 2009

work page 2009

[1] [2]

2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages=

Advances in quantum reinforcement learning , author=. 2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages=. 2017 , organization=

work page 2017

[2] [3]

2006 , publisher=

Controlled Markov processes and viscosity solutions , author=. 2006 , publisher=

work page 2006

[3] [4]

International conference on machine learning , pages=

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor , author=. International conference on machine learning , pages=. 2018 , organization=

work page 2018

[4] [5]

PRX Quantum , volume=

Quantum enhancements for deep reinforcement learning in large spaces , author=. PRX Quantum , volume=. 2021 , publisher=

work page 2021

[5] [6]

Journal of statistical mechanics: theory and experiment , volume=

Path integrals and symmetry breaking for optimal control theory , author=. Journal of statistical mechanics: theory and experiment , volume=

work page

[6] [7]

Japanese journal of mathematics , volume=

Mean field games , author=. Japanese journal of mathematics , volume=. 2007 , publisher=

work page 2007

[7] [8]

2020 , month=sep # " 15", publisher=

Continuous control with deep reinforcement learning , author=. 2020 , month=sep # " 15", publisher=

work page 2020

[8] [9]

International conference on machine learning , pages=

Curiosity-driven exploration by self-supervised prediction , author=. International conference on machine learning , pages=. 2017 , organization=

work page 2017

[9] [10]

Proceedings of the national academy of sciences , volume=

Efficient computation of optimal actions , author=. Proceedings of the national academy of sciences , volume=. 2009 , publisher=

work page 2009

[10] [11]

The Fokker-Planck equation: methods of solution and applications , pages=

Fokker-planck equation , author=. The Fokker-Planck equation: methods of solution and applications , pages=. 1989 , publisher=

work page 1989

[11] [12]

Quantum Amplitude Amplification and Estimation

Gilles Brassard, Peter Hoyer, Michele Mosca, and Alain Tapp. Quantum amplitude amplification and estimation. arXiv preprint quant-ph/0005055 , 2000

work page internal anchor Pith review Pith/arXiv arXiv 2000

[12] [13]

Advances in quantum reinforcement learning

Vedran Dunjko, Jacob M Taylor, and Hans J Briegel. Advances in quantum reinforcement learning. In 2017 IEEE international conference on systems, man, and cybernetics (SMC) , pages 282--287. IEEE, 2017

work page 2017

[13] [14]

Controlled Markov processes and viscosity solutions

Wendell H Fleming and H Mete Soner. Controlled Markov processes and viscosity solutions . Springer, 2006

work page 2006

[14] [15]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning , pages 1861--1870. Pmlr, 2018

work page 2018

[15] [16]

Quantum enhancements for deep reinforcement learning in large spaces

Sofiene Jerbi, Lea M Trenkwalder, Hendrik Poulsen Nautrup, Hans J Briegel, and Vedran Dunjko. Quantum enhancements for deep reinforcement learning in large spaces. PRX Quantum , 2(1):010328, 2021

work page 2021

[16] [17]

Path integrals and symmetry breaking for optimal control theory

Hilbert J Kappen. Path integrals and symmetry breaking for optimal control theory. Journal of statistical mechanics: theory and experiment , 2005(11):P11011--P11011, 2005

work page 2005

[17] [18]

Mean field games

Jean-Michel Lasry and Pierre-Louis Lions. Mean field games. Japanese journal of mathematics , 2(1):229--260, 2007

work page 2007

[18] [19]

Continuous control with deep reinforcement learning, September 15 2020

Timothy Paul Lillicrap, Jonathan James Hunt, Alexander Pritzel, Nicolas Manfred Otto Heess, Tom Erez, Yuval Tassa, David Silver, and Daniel Pieter Wierstra. Continuous control with deep reinforcement learning, September 15 2020. US Patent 10,776,692

work page 2020

[19] [20]

Curiosity-driven exploration by self-supervised prediction

Deepak Pathak, Pulkit Agrawal, Alexei A Efros, and Trevor Darrell. Curiosity-driven exploration by self-supervised prediction. In International conference on machine learning , pages 2778--2787. PMLR, 2017

work page 2017

[20] [21]

Fokker-planck equation

Hannes Risken. Fokker-planck equation. In The Fokker-Planck equation: methods of solution and applications , pages 63--95. Springer, 1989

work page 1989

[21] [22]

Efficient computation of optimal actions

Emanuel Todorov. Efficient computation of optimal actions. Proceedings of the national academy of sciences , 106(28):11478--11483, 2009

work page 2009