pith. sign in

arxiv: 2606.11798 · v1 · pith:33N6USFMnew · submitted 2026-06-10 · 💱 q-fin.CP · cs.LG· math.OC

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

Pith reviewed 2026-06-27 07:54 UTC · model grok-4.3

classification 💱 q-fin.CP cs.LGmath.OC
keywords deterministic policy gradienttime-inconsistent controlreinforcement learningequilibrium policiesextended Hamilton-Jacobi-Bellmanmean-variance portfolionon-exponential discountingactor-critic iterations
0
0 comments X

The pith

A two-stage actor-critic algorithm learns deterministic equilibrium policies for time-inconsistent control problems by recasting them as an auxiliary time-consistent problem plus fixed-point updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a continuous-time model-free reinforcement learning algorithm that finds deterministic equilibrium policies in control problems exhibiting time-inconsistency. It transforms the original problem into an equivalent two-stage problem using the extended Hamilton-Jacobi-Bellman system, then alternates between deterministic policy gradient steps on an auxiliary time-consistent problem and inner fixed-point iterations to recover the auxiliary functions. Convergence of the inner iterations is established under mild model assumptions. The approach is shown to handle multiple sources of time-inconsistency in a single framework and is demonstrated on mean-variance portfolio selection and optimal tracking under non-exponential discounting.

Core claim

By repeating actor-critic style iterations across two stages, the algorithm learns the equilibrium under different sources of time-inconsistency in a unified manner: the first stage applies deterministic policy gradient to an auxiliary time-consistent control problem for given auxiliary functions, while the second stage uses inner fixed-point iterations and martingale characterizations to update those auxiliary functions, with the extended Hamilton-Jacobi-Bellman system guaranteeing equivalence to the original problem.

What carries the argument

The extended Hamilton-Jacobi-Bellman system that recasts the time-inconsistent problem into an equivalent two-stage problem whose inner fixed-point iterations converge under mild assumptions.

Load-bearing premise

The extended Hamilton-Jacobi-Bellman system allows an exact recasting of the original time-inconsistent problem into an equivalent two-stage problem whose inner fixed-point iterations converge.

What would settle it

Apply the algorithm to a low-dimensional time-inconsistent problem whose equilibrium policy is known in closed form and check whether the learned policy converges to that known equilibrium within numerical error.

Figures

Figures reproduced from arXiv: 2606.11798 by Xiang Yu, Xin Guo, Yijie Huang.

Figure 1
Figure 1. Figure 1: Convergence of parameter iterations using Algorithm [PITH_FULL_IMAGE:figures/full_fig_p031_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a): The learnt equilibrium value function vs the true equilibrium value function; and [PITH_FULL_IMAGE:figures/full_fig_p032_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of policy error between the DPG-FPI algorithm and the q-learning algorithm with stochastic policies. The simulation parameters are set as: for panel (a), r = 0.02, b = 0.1, σ = 0.3, γ = 2; for panel (b), r = 0.05, b = 0.1, σ = 0.25, γ = 1. All other parameters remain the same as before. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p033_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of value function error between the DPG-FPI algorithm and the q-learning algo￾rithm with stochastic policies at t = 0.5. The simulation parameters are set as: for panel (a), r = 0.02, b = 0.1, σ = 0.3, γ = 2; for panel (b), r = 0.05, b = 0.1, σ = 0.25, γ = 1. All other parameters remain the same as before. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a): The learnt equilibrium value function vs the true equilibrium value function and [PITH_FULL_IMAGE:figures/full_fig_p038_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a): Variance of learnt (equilibrium) value function. (b): Variance of learnt (equilib [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗
read the original abstract

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a continuous-time model-free reinforcement learning algorithm based on deterministic policy gradient (DPG) to learn deterministic equilibrium policies for general time-inconsistent stochastic control problems. It recasts the original problem as an equivalent two-stage problem via the extended Hamilton-Jacobi-Bellman system: for fixed auxiliary functions the first stage solves an auxiliary time-consistent problem with DPG, while the second stage updates the auxiliary functions via inner fixed-point iterations (with martingale characterizations); the stages are alternated in an actor-critic loop. Convergence of the inner iterations is asserted under mild model assumptions, and the method is illustrated on mean-variance portfolio management and optimal tracking portfolio selection under non-exponential discounting.

Significance. If the claimed exact equivalence and convergence of the inner fixed-point iterations hold in the continuous-time stochastic setting, the work would supply a unified model-free RL framework for equilibrium computation under multiple sources of time-inconsistency, extending DPG methods to an important class of financial control problems.

major comments (2)
  1. [§3] §3 (theoretical contribution on convergence): the manuscript asserts that the extended HJB system yields an exact two-stage recasting whose inner fixed-point iterations converge under the stated mild model assumptions, yet supplies neither the derivation of the fixed-point map, error bounds, nor verification that the conditions suffice for the continuous-time Itô-process setting used in the financial examples; this equivalence is load-bearing for both the unified treatment and the outer actor-critic guarantee.
  2. [Numerical experiments] Numerical experiments section (mean-variance and tracking examples): the illustrations report only qualitative behavior; no quantitative metrics (e.g., distance to known equilibrium, iteration counts to convergence, or comparison against analytic solutions) are supplied to confirm that the learned policy satisfies the equilibrium condition.
minor comments (2)
  1. [Algorithm description] Notation for the auxiliary functions and the two-stage operator should be introduced with explicit definitions before the algorithm pseudocode.
  2. [Abstract] The abstract states convergence is established, but the precise statement of the mild assumptions (e.g., Lipschitz constants, discount factors, or moment bounds) appears only later; a forward reference or boxed statement would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [§3] §3 (theoretical contribution on convergence): the manuscript asserts that the extended HJB system yields an exact two-stage recasting whose inner fixed-point iterations converge under the stated mild model assumptions, yet supplies neither the derivation of the fixed-point map, error bounds, nor verification that the conditions suffice for the continuous-time Itô-process setting used in the financial examples; this equivalence is load-bearing for both the unified treatment and the outer actor-critic guarantee.

    Authors: The extended HJB system and the resulting two-stage recasting are derived in Section 3 of the manuscript, where the fixed-point map on the auxiliary functions is obtained directly from the martingale characterization of the equilibrium condition. Convergence of the inner iterations is proved under the listed mild assumptions by showing that the map is a contraction in an appropriate function space. We acknowledge that the presentation can be strengthened by making the derivation of the fixed-point operator more explicit and by adding a brief verification that the assumptions are compatible with the Itô-process dynamics in the examples. Quantitative error bounds are not currently derived; we will add a remark on this point and note it as a direction for future work rather than claiming rates in the present version. revision: partial

  2. Referee: [Numerical experiments] Numerical experiments section (mean-variance and tracking examples): the illustrations report only qualitative behavior; no quantitative metrics (e.g., distance to known equilibrium, iteration counts to convergence, or comparison against analytic solutions) are supplied to confirm that the learned policy satisfies the equilibrium condition.

    Authors: We agree that quantitative metrics would strengthen the numerical section. In the revised manuscript we will add, for both examples, (i) the distance between the learned policy and the known analytic equilibrium (where available), (ii) the number of outer actor-critic iterations required for stabilization of the auxiliary functions, and (iii) a direct check that the learned policy satisfies the equilibrium condition via the martingale characterization. These additions will be placed in the existing numerical section without altering the qualitative illustrations. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation relies on external HJB recasting and independent convergence claim

full rationale

The paper recasts the time-inconsistent control problem into an equivalent two-stage formulation via the extended Hamilton-Jacobi-Bellman system, applies deterministic policy gradient to the auxiliary time-consistent problem for fixed auxiliaries, and uses inner fixed-point iterations (with martingale characterizations) to update the auxiliaries. It claims convergence of those iterations under separately stated mild model assumptions as a theoretical contribution. No equations reduce the target equilibrium to a quantity defined by the algorithm's own outputs or fitted parameters by construction, no load-bearing self-citation chain is invoked to justify the equivalence or convergence, and the method is presented as using standard RL updates on an externally motivated reformulation. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of an extended HJB system that exactly recasts the time-inconsistent problem and on convergence of the inner fixed-point iterations under mild model assumptions; both are stated without further derivation in the abstract.

axioms (2)
  • domain assumption The extended Hamilton-Jacobi-Bellman system recasts the original time-inconsistent problem into an equivalent two-stage problem.
    Invoked in the abstract to justify the two-stage reformulation.
  • domain assumption Inner fixed-point iterations converge under mild model assumptions.
    Stated as the theoretical contribution in the abstract.

pith-pipeline@v0.9.1-grok · 5708 in / 1377 out tokens · 18004 ms · 2026-06-27T07:54:36.447899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

26 extracted references · 2 canonical work pages

  1. [1]

    and Murgoci, A

    Bj¨ ork, T. and Murgoci, A. (2014). A theory of Markovian time-inconsistent stochastic control in discrete time.Finance and Stochastics,18(3): 545-592. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2017). On time-inconsistent stochastic control in continuous time. Finance and Stochastics,21(2): 331-360. Bj¨ ork, T., Khapko, M. and Murgoci, A. (2021).Time-Incon...

  2. [2]

    and Yu, X

    Bo, L., Huang, Y. and Yu, X. (2025). On optimal tracking portfolio in incomplete markets: The reinforce- ment learning approach.SIAM Journal on Control and Optimization,63(1): 321-348

  3. [3]

    and Zhang, T

    Bo, L., Huang, Y., Yu, X. and Zhang, T. (2024). Continuous-time q-learning for jump-diffusion models under Tsallis entropy.Preprint, available at arXiv:2407.03888

  4. [4]

    and Yang, Z

    Cao, H., Dong, Y. and Yang, Z. (2025). A two-fold randomization framework for impulse control problems. Preprint, available at arXiv:2509.12018

  5. [5]

    and Zhang, Y

    Cheng, Z., Guo, X. and Zhang, Y. (2025). Deterministic policy gradient for reinforcement learning with continuous time and space.Preprint, available at arXiv:2509.23711

  6. [6]

    and Jia, Y

    Dai, M., Dong, Y. and Jia, Y. (2023). Learning equilibrium mean-variance strategy.Mathematical Finance, 33(4), 1166-1212

  7. [7]

    and Li, L

    Dai, M., Dong, Y. and Li, L. (2025). Reinforcement learning for arbitrage strategies in stock index futures. Preprint, available at SSRN 5403455

  8. [8]

    Dai, M., Sun, Y., Xu, Z. Q. and Zhou, X. Y. (2026). Learning to optimally stop diffusion processes, with financial applications.Management Science, available at:https://doi.org/10.1287/mnsc.2024. 07614

  9. [9]

    and Xu, R

    Dianetti, J., Ferrari, G. and Xu, R. (2024). Exploratory optimal stopping: A singular control formulation. Preprint, available at arXiv:2408.09335

  10. [10]

    Dong, Y. (2024). Randomized optimal stopping problem in continuous time and reinforcement learning algorithm.SIAM Journal on Control and Optimization,62(3): 1590-1614. 39

  11. [11]

    and Lazrak, A

    Ekeland, I. and Lazrak, A. (2006). Being serious about non-commitment: subgame perfect equilibrium in continuous time.Preprint, available at arXiv:math/0604264

  12. [12]

    and Zhou, X

    Gao, X., Li, L. and Zhou, X. Y. (2026). Reinforcement learning for jump-diffusions, with financial appli- cations.Mathematical Finance, available at:https://doi.org/10.1111/mafi.70027

  13. [13]

    and Zariphopoulou, T

    Guo, X., Xu, R. and Zariphopoulou, T. (2022). Entropy regularization for mean field games with learning. Mathematics of Operations Research,47(4), 3239-3260

  14. [14]

    and Zhou, Z

    Huang, Y., Li, M., Yu, X. and Zhou, Z. (2025). Continuous-time reinforcement learning for optimal switch- ing over multiple regimes.Preprint, available at arXiv:2512.04697

  15. [15]

    and Zhang, K

    Huang, Y.-J., Yu, X. and Zhang, K. (2026). Policy iteration achieves regularized equilibrium under time inconsistency.Preprint, available at arXiv:2603.06145

  16. [16]

    and Zhang, Y

    Jia, Y., Ouyang, D. and Zhang, Y. (2025). Accuracy of discretely sampled stochastic policies in continuous- time reinforcement learning.Preprint, available at arXiv:2503.09981

  17. [17]

    and Zhou, X

    Jia, Y. and Zhou, X. Y. (2023). q-Learning in continuous time.Journal of Machine Learning Research, 24(161): 1-61

  18. [18]

    (2019).Stochastic Flows and Jump-Diffusions

    Kunita, K. (2019).Stochastic Flows and Jump-Diffusions. Springer-Verlag, New York

  19. [19]

    and Zhang, Y

    Sethi, D., ˇSiˇ ska, D. and Zhang, Y. (2025). Entropy annealing for policy mirror descent in continuous time and space.SIAM Journal on Control and Optimization,63(4), 3006-3041

  20. [20]

    Strotz, R. H. (1955). Myopia and inconsistency in dynamic utility maximization.Review of Economic Studies,23(3): 165-180

  21. [21]

    and Zhang, Y

    Szpruch, L., Treetanthiploet, T. and Zhang, Y. (2024): Optimal scheduling of entropy regularization for continuous-time linear-quadratic reinforcement learning.SIAM Journal on Control and Optimization, 62(1), 135-166

  22. [22]

    Tang, W., Zhang, Y. P. and Zhou, X. Y. (2022). Exploratory hjb equations and their convergence.SIAM Journal on Control and Optimization,60(6), 3191-3216

  23. [23]

    and Zhou, X

    Wang, H., Zariphopoulou, T. and Zhou, X. Y. (2020). Reinforcement learning in continuous time and space: A stochastic control approach.Journal of Machine Learning Research,21(198): 1-34

  24. [24]

    and Zhou, Z

    Wang, Z., Yu, X., Zhang, J. and Zhou, Z. (2026). Equilibrium under time-inconsistency: A new existence theory by vanishing entropy regularization. Preprint, available at arXiv:2603.10321

  25. [25]

    and Yu, X

    Wei, X. and Yu, X. (2025). Continuous-Time q-learning for mean-field control problems.Applied Mathe- matics&Optimization, 91(1):10

  26. [26]

    and Zhou, X

    Yao, D., Zhang, S. and Zhou, X. Y. (2006). Tracking a financial benchmark using a few assets.Operations Research,54(2): 232-246. 40