arxiv: 2604.05398 · v1 · submitted 2026-04-07 · 🧮 math.OC · cs.LG

Recognition: 2 theorem links

· Lean Theorem

An Actor-Critic Framework for Continuous-Time Jump-Diffusion Controls with Normalizing Flows

Liya Guo , Ruimeng Hu , Xu Yang , Yi Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:40 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords actor-criticjump-diffusionnormalizing flowscontinuous-time controlstochastic gamespolicy gradiententropy regularizationportfolio optimization

0 comments

The pith

Actor-critic method with time-inhomogeneous q-function and normalizing flows solves jump-diffusion controls.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an actor-critic approach as a mesh-free solver for entropy-regularized stochastic control problems involving time-dependent jump-diffusion dynamics. It derives a policy gradient from a time-inhomogeneous little q-function and an occupation measure, then uses conditional normalizing flows to parameterize flexible stochastic policies with exact likelihoods. This targets settings like portfolio optimization and multi-agent games where explicit time dependence and discontinuous jumps make classical methods impractical. A reader would care because these dynamics model real financial and economic shocks, and the framework aims to deliver stable learning and accurate policies that scale with dimension.

Core claim

The central claim is that the actor-critic framework built on a time-inhomogeneous little q-function and occupation measure yields a policy-gradient representation that accommodates time-dependent drift, volatility, and jump terms, while conditional normalizing flows parameterize the actor to enable expressive non-Gaussian policies with exact likelihood evaluation for entropy regularization and optimization.

What carries the argument

The time-inhomogeneous little q-function paired with conditional normalizing flows for policy parameterization in the actor-critic loop.

Load-bearing premise

The policy-gradient representation derived from the little q-function and occupation measure remains valid and numerically stable when explicit time dependence and jump discontinuities are present, and normalizing flows approximate the optimal policies with negligible error.

What would settle it

Numerical results on the Merton portfolio optimization problem showing large deviation from the known explicit optimal policy under jumps would falsify the accuracy claim.

Figures

Figures reproduced from arXiv: 2604.05398 by Liya Guo, Ruimeng Hu, Xu Yang, Yi Zhu.

**Figure 2.** Figure 2: Predicted (a) mean of stochastic policy, (b) value function, and (c) state trajectory for the entropy [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Predicted (a) optimal control / mean of stochastic policy, (b) value function, and (c) state trajectory for [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Predicted (a) optimal control, (b) value function, and (c) state trajectories for the standard Merton [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: Plots of (a) densities 𝑝(𝑢| 𝑡, 𝒙) of the stochastic policy 𝜋(𝑢|𝑡, 𝒙) at 𝑡 = 𝑇/4, 𝑇/2, 3𝑇/4, 𝑇 and some 𝒙, (b) value function, and (c) state trajectories for the entropy-regularized Merton problem (𝛾 = 0.05) on horizon 𝑇 = 10. The total iterations is 𝑁itr = 2,000 and 𝛿𝑡 = 0.05. The goal of this multi-agent game is to find a Nash equilibrium, namely a collection of controls 𝒖 ∗ = (𝑢 ∗ 1 , . . . , 𝑢∗ 𝑛 ) such… view at source ↗

**Figure 6.** Figure 6: Comparison of (a) optimal control, (b) value function and (c) state trajectories vs. benchmarks for [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: RMSEs of the learned control and value functions across agents ( [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

read the original abstract

Continuous-time stochastic control with time-inhomogeneous jump-diffusion dynamics is central in finance and economics, but computing optimal policies is difficult under explicit time dependence, discontinuous shocks, and high dimensionality. We propose an actor-critic framework that serves as a mesh-free solver for entropy-regularized control problems and stochastic games with jumps. The approach is built on a time-inhomogeneous little q-function and an appropriate occupation measure, yielding a policy-gradient representation that accommodates time-dependent drift, volatility, and jump terms. To represent expressive stochastic policies in continuous-action spaces, we parameterize the actor using conditional normalizing flows, enabling flexible non-Gaussian policies while retaining exact likelihood evaluation for entropy regularization and policy optimization. We validate the method on time-inhomogeneous linear-quadratic control, Merton portfolio optimization, and a multi-agent portfolio game, using explicit solutions or high-accuracy benchmarks. Numerical results demonstrate stable learning under jump discontinuities, accurate approximation of optimal stochastic policies, and favorable scaling with respect to dimension and number of agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable actor-critic setup with normalizing flows for time-inhomogeneous jump-diffusion control, but the numerical support is thin and the key gradient identity needs a hard look.

read the letter

The main takeaway is that this work combines a time-inhomogeneous little q-function, an occupation-measure policy gradient, and conditional normalizing flows to produce a mesh-free solver for entropy-regularized jump-diffusion problems and games. That specific mix is new relative to earlier Gaussian-policy or time-homogeneous approaches, and it directly targets the finance and economics settings where explicit time dependence and jumps make classical methods impractical at scale. They show the method on three standard test cases with known solutions or high-accuracy benchmarks, and the learning appears stable even with discontinuities. The choice of flows is sensible because it keeps exact likelihoods for the entropy term while allowing non-Gaussian policies. Those are the concrete strengths. The experiments stop at qualitative statements of stability and accuracy without error tables, convergence rates, or ablation results, so it is difficult to judge how well the approximation holds as dimension or jump intensity grows. More critically, the policy-gradient formula depends on differentiating the expected reward under the occupation measure induced by the little q-function. Under explicit time dependence in the drift, volatility, and jump intensity, plus discontinuous jumps, the derivation must correctly handle the time-inhomogeneous Dynkin formula and any compensator terms; if those steps are incomplete, the gradient estimator is biased even if the reported runs look reasonable. The stress-test note flags exactly this risk, and without seeing the full proof or additional diagnostics it is hard to rule out. This paper is for people already working on continuous-time stochastic control or reinforcement learning in finance who need practical high-dimensional solvers. A reader who knows the little-q and normalizing-flow literature will extract the most value. It has enough technical coherence and addresses a real gap to deserve a serious referee, though the authors should be asked for quantitative error analysis and a clear derivation of the gradient identity before acceptance.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes an actor-critic framework for entropy-regularized continuous-time stochastic control and games with time-inhomogeneous jump-diffusion dynamics. It derives a policy-gradient representation from a time-inhomogeneous little q-function and occupation measure, parameterizes stochastic policies via conditional normalizing flows for flexible non-Gaussian actions with exact likelihoods, and reports numerical validation on time-inhomogeneous linear-quadratic control, Merton portfolio optimization, and a multi-agent portfolio game against explicit solutions or benchmarks.

Significance. If the central derivation holds and the numerical claims are substantiated, the work would provide a practical mesh-free solver for high-dimensional jump-diffusion control problems common in finance, with the normalizing-flow actor enabling expressive policies and entropy regularization. The extension to multi-agent games is a positive feature. The approach builds on standard stochastic-control objects without obvious circularity, but its impact depends on rigorous confirmation that the gradient estimator remains unbiased under explicit time dependence and jumps.

major comments (2)

[Policy-gradient derivation] The policy-gradient representation (derived from the little q-function and occupation measure) must be shown to remain valid under explicit time dependence in the drift, volatility, and jump intensity together with discontinuous jumps. The derivation needs to confirm that the time-inhomogeneous Dynkin formula is applied correctly to both diffusion and jump components of the generator, with no residual boundary or compensator terms in the occupation-measure integral; any tacit assumption of time-homogeneity or Lipschitz continuity violated by the jumps would bias the actor-critic updates.
[Numerical experiments and validation] The abstract reports validation against explicit solutions or high-accuracy benchmarks on three problems, yet no quantitative error tables, convergence rates, ablation studies, or stability metrics under jump discontinuities are referenced. This absence prevents assessment of whether the observed performance reflects correctness of the representation or implicit regularization effects.

minor comments (2)

[Abstract] The abstract is information-dense; splitting the contribution list into bullet points would improve readability.
[Preliminaries] Notation for the little q-function and occupation measure should be introduced with explicit definitions and contrasted with the standard Q-function to avoid reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below and outline the revisions that will be incorporated to enhance the rigor and clarity of the work.

read point-by-point responses

Referee: [Policy-gradient derivation] The policy-gradient representation (derived from the little q-function and occupation measure) must be shown to remain valid under explicit time dependence in the drift, volatility, and jump intensity together with discontinuous jumps. The derivation needs to confirm that the time-inhomogeneous Dynkin formula is applied correctly to both diffusion and jump components of the generator, with no residual boundary or compensator terms in the occupation-measure integral; any tacit assumption of time-homogeneity or Lipschitz continuity violated by the jumps would bias the actor-critic updates.

Authors: We appreciate the referee's emphasis on rigorously establishing the validity of the policy-gradient representation under explicit time dependence and jumps. The derivation in Section 3 applies the time-inhomogeneous Dynkin formula to the little q-function, with the generator explicitly including both the diffusion terms (time-dependent drift and volatility) and the jump component via the compensator of the Poisson random measure. The occupation measure is defined such that boundary and residual compensator terms integrate to zero by construction. Nevertheless, to eliminate any ambiguity regarding regularity conditions or potential tacit assumptions, we will add a dedicated appendix that provides a complete, self-contained derivation. This appendix will step through the application of the Dynkin formula to both continuous and discontinuous parts, verify cancellation of all residual terms in the occupation-measure integral, and state the precise Lipschitz and integrability conditions under which the estimator remains unbiased. We believe this addition will fully address the concern. revision: yes
Referee: [Numerical experiments and validation] The abstract reports validation against explicit solutions or high-accuracy benchmarks on three problems, yet no quantitative error tables, convergence rates, ablation studies, or stability metrics under jump discontinuities are referenced. This absence prevents assessment of whether the observed performance reflects correctness of the representation or implicit regularization effects.

Authors: We agree that quantitative metrics are essential for a thorough assessment of the numerical results. While the manuscript presents figures illustrating performance on the time-inhomogeneous LQ control, Merton problem, and multi-agent game, we acknowledge that explicit error tables, convergence rates, ablation studies, and stability metrics under jumps are not included. In the revised manuscript we will add: (i) tables reporting L2 policy errors and value-function errors relative to the explicit solutions or high-accuracy benchmarks for each example; (ii) convergence plots showing the decay of the actor-critic loss and gradient variance; (iii) ablation studies isolating the effects of jump intensity and time-inhomogeneity; and (iv) stability metrics such as the variance of the policy-gradient estimator across runs with varying jump discontinuities. These additions will allow readers to evaluate the method's accuracy and robustness directly. revision: yes

Circularity Check

0 steps flagged

No circularity: policy-gradient derived from standard objects; normalizing flows are independent parameterization.

full rationale

The paper presents the time-inhomogeneous little q-function and occupation-measure policy gradient as obtained by applying the Dynkin formula to the entropy-regularized objective under jump-diffusion dynamics; this is an explicit derivation from classical stochastic-control identities rather than a re-statement of the target result. Conditional normalizing flows are introduced solely as a flexible, exact-likelihood policy class for continuous actions and entropy terms, with no claim that the optimal policy is defined in terms of the flow architecture. No load-bearing self-citation chain appears; the central representation is not obtained by fitting parameters to the numerical examples or by renaming a known empirical pattern. The validations against explicit LQ solutions and Merton benchmarks therefore constitute external checks rather than tautological confirmation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard stochastic-control and reinforcement-learning assumptions with no new free parameters or invented entities declared in the abstract.

axioms (1)

domain assumption Existence and differentiability of a time-inhomogeneous little q-function for the entropy-regularized jump-diffusion control problem.
Invoked to obtain the policy-gradient representation that accommodates time-dependent drift, volatility, and jumps.

pith-pipeline@v0.9.0 · 5480 in / 1269 out tokens · 28306 ms · 2026-05-10T19:40:19.138627+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a continuous-time little q-function, define an appropriate time-dependent occupation measure, and establish its structural properties. This representation leads to a general policy gradient formula...
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the β-discounted occupation measure of X_π is defined by μ_π,t,x(A) := E[∫_t^∞ e^{-β(s-t)} 1_{(s,X_π_s)∈A} ds]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 5 canonical work pages

[1]

Using large language models to automate and expedite reinforcement learning with reward machine

Shayan Meshkat Alsadat, Jean-Rapha ¨el Gaglione, Daniel Neider, Ufuk Topcu, and Zhe Xu. Using large language models to automate and expedite reinforcement learning with reward machine. In2025 American Control Conference (ACC), pages 206–211. IEEE, 2025

2025
[2]

Multi-agent reinforcement learning in non-cooperative stochastic games using large language models.IEEE Control Systems Letters, 8:2757–2762, 2024

Shayan Meshkat Alsadat and Zhe Xu. Multi-agent reinforcement learning in non-cooperative stochastic games using large language models.IEEE Control Systems Letters, 8:2757–2762, 2024

2024
[3]

Cambridge University Press, 2009

David Applebaum.L ´evy Processes and Stochastic Calculus. Cambridge University Press, 2009

2009
[4]

Optimal investment strategies under the relative performance in jump-diffusion markets.Decisions in Economics and Finance, 48(1):179–204, 2025

Burcu Aydo˘gan and Mogens Steffensen. Optimal investment strategies under the relative performance in jump-diffusion markets.Decisions in Economics and Finance, 48(1):179–204, 2025

2025
[5]

Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 2012

Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems.IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 2012

2012
[6]

Entropy-regularized mean-variance portfolio optimization with jumps.arXiv preprint arXiv:2312.13409, 2023

Christian Bender and Nguyen Tran Thuan. Entropy-regularized mean-variance portfolio optimization with jumps.arXiv preprint arXiv:2312.13409, 2023

work page arXiv 2023
[7]

The periodic riccati equation

Sergio Bittanti, Patrizio Colaneri, and Giuseppe De Nicolao. The periodic riccati equation. InThe Riccati Equation, pages 127–162. Springer, 1991. 19

1991
[8]

Continuous-time q-learning for jump-diffusion models under tsallis entropy.arXiv preprint arXiv:2407.03888, 2024

Lijun Bo, Yijie Huang, Xiang Yu, and Tingting Zhang. Continuous-time q-learning for jump-diffusion models under tsallis entropy.arXiv preprint arXiv:2407.03888, 2024

work page arXiv 2024
[9]

Flowpg: action-constrained policy gradient with normalizing flows.Advances in Neural Information Processing Systems, 36:20118–20132, 2023

Janaka Brahmanage, Jiajing Ling, and Akshat Kumar. Flowpg: action-constrained policy gradient with normalizing flows.Advances in Neural Information Processing Systems, 36:20118–20132, 2023

2023
[10]

and Zhou, T

Wei Cai, Shuixin Fang, Wenzhong Zhang, and Tao Zhou. Martingale deep learning for very high dimensional quasi-linear partial differential equations and stochastic optimal controls.arXiv preprint arXiv:2408.14395, 2024

work page arXiv 2024
[11]

Deep learning for continuous-time stochastic control with jumps

Patrick Cheridito, Jean-Loup Dupret, and Donatien Hainaut. Deep learning for continuous-time stochastic control with jumps. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[12]

Learning equilibrium mean-variance strategy.Mathematical Finance, 33(4):1166–1212, 2023

Min Dai, Yuchao Dong, and Yanwei Jia. Learning equilibrium mean-variance strategy.Mathematical Finance, 33(4):1166–1212, 2023

2023
[13]

Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching.Applied Mathematics & Optimization, 91(1):9, 2025

Robert Denkert, Huyˆen Pham, and Xavier Warin. Control randomisation approach for policy gradient and application to reinforcement learning in optimal switching.Applied Mathematics & Optimization, 91(1):9, 2025

2025
[14]

Reinforcement learning in continuous time and space.Neural Computation, 12(1):219–245, 2000

Kenji Doya. Reinforcement learning in continuous time and space.Neural Computation, 12(1):219–245, 2000

2000
[15]

Cambridge University Press, Cambridge, 2015

Jinqiao Duan.An Introduction to Stochastic Dynamics. Cambridge University Press, Cambridge, 2015

2015
[16]

Neural spline flows.Advances in Neural Information Processing Systems, 32, 2019

Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows.Advances in Neural Information Processing Systems, 32, 2019

2019
[17]

Reinforcement learning for jump-diffusions, with financial applications.Mathematical Finance, 2026

Xuefeng Gao, Lingfei Li, and Xun Yu Zhou. Reinforcement learning for jump-diffusions, with financial applications.Mathematical Finance, 2026

2026
[18]

Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

Xin Guo, Anran Hu, and Yufei Zhang. Reinforcement learning for linear-convex models with jumps via stability analysis of feedback controls.SIAM Journal on Control and Optimization, 61(2):755–787, 2023

2023
[19]

Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. InInternational conference on machine learning, pages 1861–1870. Pmlr, 2018

2018
[20]

Springer, 1993

Ernst Hairer, Gerhard Wanner, and Syvert P Nørsett.Solving ordinary differential equations I: Nonstiff problems. Springer, 1993

1993
[21]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

2016
[22]

Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning.arXiv preprint arXiv:2503.09981, 2025

Yanwei Jia, Du Ouyang, and Yufei Zhang. Accuracy of discretely sampled stochastic policies in continuous-time reinforcement learning.arXiv preprint arXiv:2503.09981, 2025

work page arXiv 2025
[23]

Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach.Journal of Machine Learning Research, 23(154):1–55, 2022

Yanwei Jia and Xun Yu Zhou. Policy evaluation and temporal-difference learning in continuous time and space: A martingale approach.Journal of Machine Learning Research, 23(154):1–55, 2022. 20

2022
[24]

Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms.Journal of Machine Learning Research, 23(275):1–50, 2022

Yanwei Jia and Xun Yu Zhou. Policy gradient and actor-critic learning in continuous time and space: Theory and algorithms.Journal of Machine Learning Research, 23(275):1–50, 2022

2022
[25]

q-learning in continuous time.Journal of Machine Learning Research, 24(161):1–61, 2023

Yanwei Jia and Xun Yu Zhou. q-learning in continuous time.Journal of Machine Learning Research, 24(161):1–61, 2023

2023
[26]

q-learning in continuous time

Yanwei Jia and Xun Yu Zhou. Erratum to “q-learning in continuous time”.Journal of Machine Learning Research, 2025

2025
[27]

Robust reinforcement learning under diffusion models for data with jumps.arXiv preprint arXiv:2411.11697, 2024

Chenyang Jiang, Donggyu Kim, Alejandra Quintos, and Yazhen Wang. Robust reinforcement learning under diffusion models for data with jumps.arXiv preprint arXiv:2411.11697, 2024

work page arXiv 2024
[28]

Multiagent relative investment games in a jump diffusion market with deep reinforcement learning algorithm.SIAM Journal on Financial Mathematics, 16(2):707–746, 2025

Liwei Lu, Ruimeng Hu, Xu Yang, and Yi Zhu. Multiagent relative investment games in a jump diffusion market with deep reinforcement learning algorithm.SIAM Journal on Financial Mathematics, 16(2):707–746, 2025

2025
[29]

Robert C. Merton. Optimum consumption and portfolio rules in a continuous-time model.Journal of Economic Theory, 3(4):373–413, 1971

1971
[30]

Learning optimal deterministic policies with stochastic policy gradients

Alessandro Montenegro, Marco Mussi, Alberto Maria Metelli, and Matteo Papini. Learning optimal deterministic policies with stochastic policy gradients. In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning, v...

2024
[31]

Springer, 2007

Bernt Øksendal and Agnes Sulem.Applied Stochastic Control of Jump Diffusions, volume 3. Springer, 2007

2007
[32]

Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, and Balaji Lakshmi- narayanan. Normalizing flows for probabilistic modeling and inference.Journal of Machine Learning Research, 22(57):1–64, 2021

2021
[33]

Springer Science & Business Media, 2009

Huyˆen Pham.Continuous-Time Stochastic Control and Optimization with Financial Applications, volume 61. Springer Science & Business Media, 2009

2009
[34]

Maziar Raissi, Paris Perdikaris, and George E Karniadakis. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations.Journal of Computational Physics, 378:686–707, 2019

2019
[35]

Jordan, and Pieter Abbeel

John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. InProceedings of the International Conference on Learning Representations (ICLR), 2016. Poster

2016
[36]

Policy gradient methods for reinforcement learning with function approximation.Advances in Neural Information Processing Systems, 12, 1999

Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation.Advances in Neural Information Processing Systems, 12, 1999

1999
[37]

Making deep q-learning methods robust to time discretization

Corentin Tallec, L´eonard Blier, and Yann Ollivier. Making deep q-learning methods robust to time discretization. InInternational Conference on Machine Learning, pages 6096–6104. PMLR, 2019

2019
[38]

Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020

Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research, 21(198):1–34, 2020. 21

2020
[39]

Continuous-time mean–variance portfolio selection: A reinforcement learning framework.Mathematical Finance, 30(4):1273–1308, 2020

Haoran Wang and Xun Yu Zhou. Continuous-time mean–variance portfolio selection: A reinforcement learning framework.Mathematical Finance, 30(4):1273–1308, 2020

2020
[40]

Continuous time q-learning for mean-field control problems.Applied Mathematics & Optimization, 91(1):10, 2025

Xiaoli Wei and Xiang Yu. Continuous time q-learning for mean-field control problems.Applied Mathematics & Optimization, 91(1):10, 2025

2025
[41]

Policy optimization for continuous reinforcement learning

Hanyang Zhao, Wenpin Tang, and David Yao. Policy optimization for continuous reinforcement learning. Advances in Neural Information Processing Systems, 36:13637–13663, 2023

2023
[42]

Mo Zhou, Jiequn Han, and Jianfeng Lu. Actor-critic method for high dimensional static hamilton– jacobi–bellman partial differential equations based on neural networks.SIAM Journal on Scientific Computing, 43(6):A4043–A4066, 2021. A More Numerical Details A.1 Neural Network (NN) Architectures and Experimental Details Critic network.We parameterize the crit...

2021