Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

Xiang Yu; Xin Guo; Yijie Huang

arxiv: 2606.11798 · v1 · pith:33N6USFMnew · submitted 2026-06-10 · 💱 q-fin.CP · cs.LG· math.OC

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

Xin Guo , Yijie Huang , Xiang Yu This is my paper

Pith reviewed 2026-06-27 07:54 UTC · model grok-4.3

classification 💱 q-fin.CP cs.LGmath.OC

keywords deterministic policy gradienttime-inconsistent controlreinforcement learningequilibrium policiesextended Hamilton-Jacobi-Bellmanmean-variance portfolionon-exponential discountingactor-critic iterations

0 comments

The pith

A two-stage actor-critic algorithm learns deterministic equilibrium policies for time-inconsistent control problems by recasting them as an auxiliary time-consistent problem plus fixed-point updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a continuous-time model-free reinforcement learning algorithm that finds deterministic equilibrium policies in control problems exhibiting time-inconsistency. It transforms the original problem into an equivalent two-stage problem using the extended Hamilton-Jacobi-Bellman system, then alternates between deterministic policy gradient steps on an auxiliary time-consistent problem and inner fixed-point iterations to recover the auxiliary functions. Convergence of the inner iterations is established under mild model assumptions. The approach is shown to handle multiple sources of time-inconsistency in a single framework and is demonstrated on mean-variance portfolio selection and optimal tracking under non-exponential discounting.

Core claim

By repeating actor-critic style iterations across two stages, the algorithm learns the equilibrium under different sources of time-inconsistency in a unified manner: the first stage applies deterministic policy gradient to an auxiliary time-consistent control problem for given auxiliary functions, while the second stage uses inner fixed-point iterations and martingale characterizations to update those auxiliary functions, with the extended Hamilton-Jacobi-Bellman system guaranteeing equivalence to the original problem.

What carries the argument

The extended Hamilton-Jacobi-Bellman system that recasts the time-inconsistent problem into an equivalent two-stage problem whose inner fixed-point iterations converge under mild assumptions.

Load-bearing premise

The extended Hamilton-Jacobi-Bellman system allows an exact recasting of the original time-inconsistent problem into an equivalent two-stage problem whose inner fixed-point iterations converge.

What would settle it

Apply the algorithm to a low-dimensional time-inconsistent problem whose equilibrium policy is known in closed form and check whether the learned policy converges to that known equilibrium within numerical error.

Figures

Figures reproduced from arXiv: 2606.11798 by Xiang Yu, Xin Guo, Yijie Huang.

**Figure 2.** Figure 2: (a): The learnt equilibrium value function vs the true equilibrium value function; and [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of policy error between the DPG-FPI algorithm and the q-learning algorithm with stochastic policies. The simulation parameters are set as: for panel (a), r = 0.02, b = 0.1, σ = 0.3, γ = 2; for panel (b), r = 0.05, b = 0.1, σ = 0.25, γ = 1. All other parameters remain the same as before. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of value function error between the DPG-FPI algorithm and the q-learning algorithm with stochastic policies at t = 0.5. The simulation parameters are set as: for panel (a), r = 0.02, b = 0.1, σ = 0.3, γ = 2; for panel (b), r = 0.05, b = 0.1, σ = 0.25, γ = 1. All other parameters remain the same as before. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗

**Figure 5.** Figure 5: (a): The learnt equilibrium value function vs the true equilibrium value function and [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗

**Figure 6.** Figure 6: (a): Variance of learnt (equilibrium) value function. (b): Variance of learnt (equilib [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

read the original abstract

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a two-stage DPG-plus-fixed-point algorithm for equilibrium policies in time-inconsistent stochastic control, but the claimed convergence of the inner iterations rests on details not visible in the abstract.

read the letter

The main takeaway is that this work recasts general time-inconsistent control into an equivalent two-stage problem via the extended HJB system, then alternates deterministic policy gradient on the auxiliary time-consistent problem with martingale-based fixed-point updates on the auxiliary functions. That combination is presented as a way to learn equilibrium policies in a unified manner across sources of inconsistency.

The algorithmic structure itself is the clearest new piece. Prior work on time-consistent RL and on specific time-inconsistent cases exists, but the explicit two-stage loop that keeps the outer actor-critic updates separate from the inner fixed-point solve for the auxiliaries does not appear in the references given. The financial examples (mean-variance portfolio choice and non-exponential discounting tracking) are standard test cases, and showing the same code path on both is useful for practitioners who face these problems.

The soft spot is the convergence claim. The abstract states that mild model assumptions suffice for convergence of the inner fixed-point iterations and that this yields the equilibrium, yet supplies no derivation outline, contraction mapping argument, or error bound. Without those steps visible, it is difficult to judge whether the equivalence to the original time-inconsistent problem is exact in the continuous-time stochastic setting or whether the iteration is guaranteed to converge for the parameter regimes used in the examples. The numerical illustrations are described only qualitatively, with no reported metrics or comparison baselines.

The paper is aimed at researchers who already work on RL methods for continuous-time finance problems and who need a computational route for equilibrium rather than time-consistent policies. A reader who wants to implement or extend the method would find the high-level structure helpful, but would still need to verify the fixed-point analysis themselves.

I would send it to peer review. The topic is relevant and the algorithmic idea is concrete enough to be worth referee time, even if the convergence argument will probably require expansion and the experiments will need quantitative checks.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a continuous-time model-free reinforcement learning algorithm based on deterministic policy gradient (DPG) to learn deterministic equilibrium policies for general time-inconsistent stochastic control problems. It recasts the original problem as an equivalent two-stage problem via the extended Hamilton-Jacobi-Bellman system: for fixed auxiliary functions the first stage solves an auxiliary time-consistent problem with DPG, while the second stage updates the auxiliary functions via inner fixed-point iterations (with martingale characterizations); the stages are alternated in an actor-critic loop. Convergence of the inner iterations is asserted under mild model assumptions, and the method is illustrated on mean-variance portfolio management and optimal tracking portfolio selection under non-exponential discounting.

Significance. If the claimed exact equivalence and convergence of the inner fixed-point iterations hold in the continuous-time stochastic setting, the work would supply a unified model-free RL framework for equilibrium computation under multiple sources of time-inconsistency, extending DPG methods to an important class of financial control problems.

major comments (2)

[§3] §3 (theoretical contribution on convergence): the manuscript asserts that the extended HJB system yields an exact two-stage recasting whose inner fixed-point iterations converge under the stated mild model assumptions, yet supplies neither the derivation of the fixed-point map, error bounds, nor verification that the conditions suffice for the continuous-time Itô-process setting used in the financial examples; this equivalence is load-bearing for both the unified treatment and the outer actor-critic guarantee.
[Numerical experiments] Numerical experiments section (mean-variance and tracking examples): the illustrations report only qualitative behavior; no quantitative metrics (e.g., distance to known equilibrium, iteration counts to convergence, or comparison against analytic solutions) are supplied to confirm that the learned policy satisfies the equilibrium condition.

minor comments (2)

[Algorithm description] Notation for the auxiliary functions and the two-stage operator should be introduced with explicit definitions before the algorithm pseudocode.
[Abstract] The abstract states convergence is established, but the precise statement of the mild assumptions (e.g., Lipschitz constants, discount factors, or moment bounds) appears only later; a forward reference or boxed statement would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [§3] §3 (theoretical contribution on convergence): the manuscript asserts that the extended HJB system yields an exact two-stage recasting whose inner fixed-point iterations converge under the stated mild model assumptions, yet supplies neither the derivation of the fixed-point map, error bounds, nor verification that the conditions suffice for the continuous-time Itô-process setting used in the financial examples; this equivalence is load-bearing for both the unified treatment and the outer actor-critic guarantee.

Authors: The extended HJB system and the resulting two-stage recasting are derived in Section 3 of the manuscript, where the fixed-point map on the auxiliary functions is obtained directly from the martingale characterization of the equilibrium condition. Convergence of the inner iterations is proved under the listed mild assumptions by showing that the map is a contraction in an appropriate function space. We acknowledge that the presentation can be strengthened by making the derivation of the fixed-point operator more explicit and by adding a brief verification that the assumptions are compatible with the Itô-process dynamics in the examples. Quantitative error bounds are not currently derived; we will add a remark on this point and note it as a direction for future work rather than claiming rates in the present version. revision: partial
Referee: [Numerical experiments] Numerical experiments section (mean-variance and tracking examples): the illustrations report only qualitative behavior; no quantitative metrics (e.g., distance to known equilibrium, iteration counts to convergence, or comparison against analytic solutions) are supplied to confirm that the learned policy satisfies the equilibrium condition.

Authors: We agree that quantitative metrics would strengthen the numerical section. In the revised manuscript we will add, for both examples, (i) the distance between the learned policy and the known analytic equilibrium (where available), (ii) the number of outer actor-critic iterations required for stabilization of the auxiliary functions, and (iii) a direct check that the learned policy satisfies the equilibrium condition via the martingale characterization. These additions will be placed in the existing numerical section without altering the qualitative illustrations. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external HJB recasting and independent convergence claim

full rationale

The paper recasts the time-inconsistent control problem into an equivalent two-stage formulation via the extended Hamilton-Jacobi-Bellman system, applies deterministic policy gradient to the auxiliary time-consistent problem for fixed auxiliaries, and uses inner fixed-point iterations (with martingale characterizations) to update the auxiliaries. It claims convergence of those iterations under separately stated mild model assumptions as a theoretical contribution. No equations reduce the target equilibrium to a quantity defined by the algorithm's own outputs or fitted parameters by construction, no load-bearing self-citation chain is invoked to justify the equivalence or convergence, and the method is presented as using standard RL updates on an externally motivated reformulation. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an extended HJB system that exactly recasts the time-inconsistent problem and on convergence of the inner fixed-point iterations under mild model assumptions; both are stated without further derivation in the abstract.

axioms (2)

domain assumption The extended Hamilton-Jacobi-Bellman system recasts the original time-inconsistent problem into an equivalent two-stage problem.
Invoked in the abstract to justify the two-stage reformulation.
domain assumption Inner fixed-point iterations converge under mild model assumptions.
Stated as the theoretical contribution in the abstract.

pith-pipeline@v0.9.1-grok · 5708 in / 1377 out tokens · 18004 ms · 2026-06-27T07:54:36.447899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

[1]

and Murgoci, A

Bj¨ ork, T. and Murgoci, A. (2014). A theory of Markovian time-inconsistent stochastic control in discrete time.Finance and Stochastics,18(3): 545-592. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2017). On time-inconsistent stochastic control in continuous time. Finance and Stochastics,21(2): 331-360. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2021).Time-Incon...

2014
[2]

and Yu, X

Bo, L., Huang, Y. and Yu, X. (2025). On optimal tracking portfolio in incomplete markets: The reinforce- ment learning approach.SIAM Journal on Control and Optimization,63(1): 321-348

2025
[3]

and Zhang, T

Bo, L., Huang, Y., Yu, X. and Zhang, T. (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy.Preprint, available at arXiv:2407.03888

arXiv 2024
[4]

and Yang, Z

Cao, H., Dong, Y. and Yang, Z. (2025). A two-fold randomization framework for impulse control problems. Preprint, available at arXiv:2509.12018

Pith/arXiv arXiv 2025
[5]

and Zhang, Y

Cheng, Z., Guo, X. and Zhang, Y. (2025). Deterministic policy gradient for reinforcement learning with continuous time and space.Preprint, available at arXiv:2509.23711

arXiv 2025
[6]

and Jia, Y

Dai, M., Dong, Y. and Jia, Y. (2023). Learning equilibrium mean-variance strategy.Mathematical Finance, 33(4), 1166-1212

2023
[7]

and Li, L

Dai, M., Dong, Y. and Li, L. (2025). Reinforcement learning for arbitrage strategies in stock index futures. Preprint, available at SSRN 5403455

2025
[8]

Dai, M., Sun, Y., Xu, Z. Q. and Zhou, X. Y. (2026). Learning to optimally stop diffusion processes, with financial applications.Management Science, available at:https://doi.org/10.1287/mnsc.2024. 07614

work page doi:10.1287/mnsc.2024 2026
[9]

and Xu, R

Dianetti, J., Ferrari, G. and Xu, R. (2024). Exploratory optimal stopping: A singular control formulation. Preprint, available at arXiv:2408.09335

arXiv 2024
[10]

Dong, Y. (2024). Randomized optimal stopping problem in continuous time and reinforcement learning algorithm.SIAM Journal on Control and Optimization,62(3): 1590-1614. 39

2024
[11]

and Lazrak, A

Ekeland, I. and Lazrak, A. (2006). Being serious about non-commitment: subgame perfect equilibrium in continuous time.Preprint, available at arXiv:math/0604264

Pith/arXiv arXiv 2006
[12]

and Zhou, X

Gao, X., Li, L. and Zhou, X. Y. (2026). Reinforcement learning for jump-diffusions, with financial appli- cations.Mathematical Finance, available at:https://doi.org/10.1111/mafi.70027

work page doi:10.1111/mafi.70027 2026
[13]

and Zariphopoulou, T

Guo, X., Xu, R. and Zariphopoulou, T. (2022). Entropy regularization for mean field games with learning. Mathematics of Operations Research,47(4), 3239-3260

2022
[14]

and Zhou, Z

Huang, Y., Li, M., Yu, X. and Zhou, Z. (2025). Continuous-time reinforcement learning for optimal switch- ing over multiple regimes.Preprint, available at arXiv:2512.04697

arXiv 2025
[15]

and Zhang, K

Huang, Y.-J., Yu, X. and Zhang, K. (2026). Policy iteration achieves regularized equilibrium under time inconsistency.Preprint, available at arXiv:2603.06145

arXiv 2026
[16]

and Zhang, Y

Jia, Y., Ouyang, D. and Zhang, Y. (2025). Accuracy of discretely sampled stochastic policies in continuous- time reinforcement learning.Preprint, available at arXiv:2503.09981

arXiv 2025
[17]

and Zhou, X

Jia, Y. and Zhou, X. Y. (2023). q-Learning in continuous time.Journal of Machine Learning Research, 24(161): 1-61

2023
[18]

(2019).Stochastic Flows and Jump-Diffusions

Kunita, K. (2019).Stochastic Flows and Jump-Diffusions. Springer-Verlag, New York

2019
[19]

and Zhang, Y

Sethi, D., ˇSiˇ ska, D. and Zhang, Y. (2025). Entropy annealing for policy mirror descent in continuous time and space.SIAM Journal on Control and Optimization,63(4), 3006-3041

2025
[20]

Strotz, R. H. (1955). Myopia and inconsistency in dynamic utility maximization.Review of Economic Studies,23(3): 165-180

1955
[21]

and Zhang, Y

Szpruch, L., Treetanthiploet, T. and Zhang, Y. (2024): Optimal scheduling of entropy regularization for continuous-time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1), 135-166

2024
[22]

Tang, W., Zhang, Y. P. and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence.SIAM Journal on Control and Optimization,60(6), 3191-3216

2022
[23]

and Zhou, X

Wang, H., Zariphopoulou, T. and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research,21(198): 1-34

2020
[24]

and Zhou, Z

Wang, Z., Yu, X., Zhang, J. and Zhou, Z. (2026). Equilibrium under time-inconsistency: A new existence theory by vanishing entropy regularization. Preprint, available at arXiv:2603.10321

Pith/arXiv arXiv 2026
[25]

and Yu, X

Wei, X. and Yu, X. (2025). Continuous-Time q-learning for mean-field control problems.Applied Mathe- matics&Optimization, 91(1):10

2025
[26]

and Zhou, X

Yao, D., Zhang, S. and Zhou, X. Y. (2006). Tracking a financial benchmark using a few assets.Operations Research,54(2): 232-246. 40

2006

[1] [1]

and Murgoci, A

Bj¨ ork, T. and Murgoci, A. (2014). A theory of Markovian time-inconsistent stochastic control in discrete time.Finance and Stochastics,18(3): 545-592. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2017). On time-inconsistent stochastic control in continuous time. Finance and Stochastics,21(2): 331-360. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2021).Time-Incon...

2014

[2] [2]

and Yu, X

Bo, L., Huang, Y. and Yu, X. (2025). On optimal tracking portfolio in incomplete markets: The reinforce- ment learning approach.SIAM Journal on Control and Optimization,63(1): 321-348

2025

[3] [3]

and Zhang, T

Bo, L., Huang, Y., Yu, X. and Zhang, T. (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy.Preprint, available at arXiv:2407.03888

arXiv 2024

[4] [4]

and Yang, Z

Cao, H., Dong, Y. and Yang, Z. (2025). A two-fold randomization framework for impulse control problems. Preprint, available at arXiv:2509.12018

Pith/arXiv arXiv 2025

[5] [5]

and Zhang, Y

Cheng, Z., Guo, X. and Zhang, Y. (2025). Deterministic policy gradient for reinforcement learning with continuous time and space.Preprint, available at arXiv:2509.23711

arXiv 2025

[6] [6]

and Jia, Y

Dai, M., Dong, Y. and Jia, Y. (2023). Learning equilibrium mean-variance strategy.Mathematical Finance, 33(4), 1166-1212

2023

[7] [7]

and Li, L

Dai, M., Dong, Y. and Li, L. (2025). Reinforcement learning for arbitrage strategies in stock index futures. Preprint, available at SSRN 5403455

2025

[8] [8]

Dai, M., Sun, Y., Xu, Z. Q. and Zhou, X. Y. (2026). Learning to optimally stop diffusion processes, with financial applications.Management Science, available at:https://doi.org/10.1287/mnsc.2024. 07614

work page doi:10.1287/mnsc.2024 2026

[9] [9]

and Xu, R

Dianetti, J., Ferrari, G. and Xu, R. (2024). Exploratory optimal stopping: A singular control formulation. Preprint, available at arXiv:2408.09335

arXiv 2024

[10] [10]

Dong, Y. (2024). Randomized optimal stopping problem in continuous time and reinforcement learning algorithm.SIAM Journal on Control and Optimization,62(3): 1590-1614. 39

2024

[11] [11]

and Lazrak, A

Ekeland, I. and Lazrak, A. (2006). Being serious about non-commitment: subgame perfect equilibrium in continuous time.Preprint, available at arXiv:math/0604264

Pith/arXiv arXiv 2006

[12] [12]

and Zhou, X

Gao, X., Li, L. and Zhou, X. Y. (2026). Reinforcement learning for jump-diffusions, with financial appli- cations.Mathematical Finance, available at:https://doi.org/10.1111/mafi.70027

work page doi:10.1111/mafi.70027 2026

[13] [13]

and Zariphopoulou, T

Guo, X., Xu, R. and Zariphopoulou, T. (2022). Entropy regularization for mean field games with learning. Mathematics of Operations Research,47(4), 3239-3260

2022

[14] [14]

and Zhou, Z

Huang, Y., Li, M., Yu, X. and Zhou, Z. (2025). Continuous-time reinforcement learning for optimal switch- ing over multiple regimes.Preprint, available at arXiv:2512.04697

arXiv 2025

[15] [15]

and Zhang, K

Huang, Y.-J., Yu, X. and Zhang, K. (2026). Policy iteration achieves regularized equilibrium under time inconsistency.Preprint, available at arXiv:2603.06145

arXiv 2026

[16] [16]

and Zhang, Y

Jia, Y., Ouyang, D. and Zhang, Y. (2025). Accuracy of discretely sampled stochastic policies in continuous- time reinforcement learning.Preprint, available at arXiv:2503.09981

arXiv 2025

[17] [17]

and Zhou, X

Jia, Y. and Zhou, X. Y. (2023). q-Learning in continuous time.Journal of Machine Learning Research, 24(161): 1-61

2023

[18] [18]

(2019).Stochastic Flows and Jump-Diffusions

Kunita, K. (2019).Stochastic Flows and Jump-Diffusions. Springer-Verlag, New York

2019

[19] [19]

and Zhang, Y

Sethi, D., ˇSiˇ ska, D. and Zhang, Y. (2025). Entropy annealing for policy mirror descent in continuous time and space.SIAM Journal on Control and Optimization,63(4), 3006-3041

2025

[20] [20]

Strotz, R. H. (1955). Myopia and inconsistency in dynamic utility maximization.Review of Economic Studies,23(3): 165-180

1955

[21] [21]

and Zhang, Y

Szpruch, L., Treetanthiploet, T. and Zhang, Y. (2024): Optimal scheduling of entropy regularization for continuous-time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1), 135-166

2024

[22] [22]

Tang, W., Zhang, Y. P. and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence.SIAM Journal on Control and Optimization,60(6), 3191-3216

2022

[23] [23]

and Zhou, X

Wang, H., Zariphopoulou, T. and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research,21(198): 1-34

2020

[24] [24]

and Zhou, Z

Wang, Z., Yu, X., Zhang, J. and Zhou, Z. (2026). Equilibrium under time-inconsistency: A new existence theory by vanishing entropy regularization. Preprint, available at arXiv:2603.10321

Pith/arXiv arXiv 2026

[25] [25]

and Yu, X

Wei, X. and Yu, X. (2025). Continuous-Time q-learning for mean-field control problems.Applied Mathe- matics&Optimization, 91(1):10

2025

[26] [26]

and Zhou, X

Yao, D., Zhang, S. and Zhou, X. Y. (2006). Tracking a financial benchmark using a few assets.Operations Research,54(2): 232-246. 40

2006