pith. sign in

arxiv: 2209.07059 · v5 · pith:WKUF75MJnew · submitted 2022-09-15 · 🧮 math.OC

Convergence of Policy Iteration for Entropy-Regularized Stochastic Control Problems

Pith reviewed 2026-05-24 11:23 UTC · model grok-4.3

classification 🧮 math.OC
keywords policy iterationentropy regularizationstochastic controlconvergencerelaxed controlHamilton-Jacobi-Bellman equationSobolev estimatesoptimal consumption
0
0 comments X

The pith

A policy iteration algorithm converges to an optimal relaxed control for entropy-regularized stochastic control on infinite horizons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves convergence of a policy iteration algorithm for general entropy-regularized stochastic control problems over infinite time horizons. Classical Hölder estimates on value functions fail due to the entropy term, so the authors introduce new Sobolev estimates specific to policy iteration plus a technique that bounds entropy growth. These steps together deliver a uniform Hölder bound on the sequence of value functions, which closes the convergence argument to an optimal relaxed control. A byproduct is that the optimal value function is the unique solution of an exploratory Hamilton-Jacobi-Bellman equation. The algorithm is demonstrated numerically on an optimal consumption example.

Core claim

For a general entropy-regularized stochastic control problem on an infinite horizon, the policy iteration algorithm converges to an optimal relaxed control. This is achieved by moving between Hölder and Sobolev spaces to obtain a uniform Hölder bound on the generated value functions, using new Sobolev estimates designed for policy iteration and a method to contain entropy growth, even though standard Hölder estimates are insufficient.

What carries the argument

The policy iteration algorithm (PIA), whose convergence is secured by new Sobolev estimates tailored to the iteration and a technique that controls entropy growth to produce a uniform Hölder bound on value functions.

If this is right

  • The value functions produced by the policy iteration algorithm remain uniformly bounded in the Hölder norm.
  • Convergence holds to an optimal relaxed control for the entropy-regularized problem.
  • The optimal value function is characterized as the unique solution to the exploratory Hamilton-Jacobi-Bellman equation.
  • The algorithm can be implemented numerically on concrete problems such as optimal consumption.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Sobolev estimates and entropy-control technique could be tested on finite-horizon versions of the problem.
  • The method supplies a route to prove convergence for other regularized control problems where classical estimates break.
  • Numerical stability of the policy iteration may improve in practice once the uniform Hölder bound is available.

Load-bearing premise

New Sobolev estimates designed for policy iteration, combined with a technique to contain entropy growth, produce a uniform Hölder bound on the sequence of value functions where classical estimates fail.

What would settle it

A concrete counter-example in which the sequence of value functions generated by the policy iteration algorithm fails to remain uniformly Hölder continuous, or in which the algorithm does not converge to the optimal relaxed control in the optimal consumption problem.

Figures

Figures reproduced from arXiv: 2209.07059 by Yu-Jui Huang, Zhenhua Wang, Zhou Zhou.

Figure 1
Figure 1. Figure 1: Difference between V ∗ and v n for n = 1, 2, · · · , 10 with the initial guess v 0 (x) = sin(x). The y-axis represents ∥V ∗ − v n ∥L∞([−50,50]) (left panel) and ln(∥V ∗ − v n ∥L∞([−50,50])) (right panel). Notice that the algorithm is set to stop once it reaches the tolerance of the finite difference solver [PITH_FULL_IMAGE:figures/full_fig_p024_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Difference between π ∗ and π n for n = 1, 2, · · · , 10 with the initial guess v 0 (x) = sin(x). The y-axis represents ∥π ∗ − π n ∥L∞([−50,50]×[0.1,0.9]) (left panel) and ln(∥π ∗ − π n ∥L∞([−50,50]×[0.1,0.9])) (right panel). A Derivation of Lemma 2.2 Recall (2.9)–(2.11). We will prove (i) and (ii) in Lemma 2.2 separately. Proof for Lemma 2.2 (i). It is sufficient to prove only (2.12), as the rest of the st… view at source ↗
Figure 3
Figure 3. Figure 3: Difference between V ∗ and v n for n = 1, 2, · · · , 10 with the initial guess v 0 = 1 1+x2 . The y-axis represents ∥V ∗ − v n ∥L∞([−50,50]) (left panel) and ln(∥V ∗ − v n ∥L∞([−50,50])) (right panel). Notice that the algorithm is set to stop once it reaches the tolerance of the finite difference solver [PITH_FULL_IMAGE:figures/full_fig_p025_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Difference between π ∗ and π n for n = 1, 2, · · · , 10 with the initial guess v 0 = 1 1+x2 . The y-axis represents ∥π ∗ − π n ∥L∞([−50,50]×[0.1,0.9]) (left panel) and ln(∥π ∗ − π n ∥L∞([−50,50]×[0.1,0.9])) (right panel) [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The left panel consists of V ∗ and v 10 on [−10, 10] with both initial guesses v 0 = sin(x) and v 0 = 1 1+x2 . The right panel displays the graph of π ∗ (x, u) on [−10, 10] × [0.1, 0.9]. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_5.png] view at source ↗
read the original abstract

For a general entropy-regularized stochastic control problem on an infinite horizon, we prove that a policy iteration algorithm (PIA) converges to an optimal relaxed control. Contrary to the standard stochastic control literature, classical H\"{o}lder estimates of value functions do not ensure the convergence of the PIA, due to the added entropy-regularizing term. To circumvent this, we carry out a delicate estimation by moving back and forth between appropriate H\"{o}lder and Sobolev spaces. This requires new Sobolev estimates designed specifically for the purpose of policy iteration and a nontrivial technique to contain the entropy growth. Ultimately, we obtain a uniform H\"{o}lder bound for the sequence of value functions generated by the PIA, thereby achieving the desired convergence result. Characterization of the optimal value function as the unique solution to an exploratory Hamilton-Jacobi-Bellman equation comes as a by-product. The PIA is numerically implemented in an example of optimal consumption.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proves convergence of a policy iteration algorithm (PIA) to an optimal relaxed control for general entropy-regularized infinite-horizon stochastic control problems. Classical Hölder estimates on value functions fail due to the entropy term, so the authors derive new Sobolev estimates tailored to the PIA sequence together with an entropy-growth containment argument; these yield a uniform Hölder bound that permits extraction of a convergent subsequence whose limit is identified as optimal. As a byproduct the optimal value function is characterized as the unique solution of an exploratory Hamilton-Jacobi-Bellman equation. A numerical illustration is given for an optimal consumption problem.

Significance. The result supplies a rigorous convergence theory for policy iteration under entropy regularization, a setting that appears in robust control and reinforcement learning. The construction of Sobolev estimates specifically adapted to the policy-iteration iterates, together with the entropy-control technique that restores uniform Hölder regularity, constitutes a technical contribution that may be reusable in other regularized control problems where standard parabolic estimates are insufficient. The argument is a direct analytic proof with no free parameters, no circular definitions, and no fitted quantities.

minor comments (3)
  1. [Introduction] The introduction should list the precise standing assumptions on the drift, diffusion, running cost, and entropy parameter (including any growth or boundedness conditions) before the statement of the main theorem, so that the uniformity of the Hölder bound is immediately traceable to those hypotheses.
  2. [Exploratory HJB section] In the statement of the exploratory HJB equation, clarify whether the entropy term appears inside or outside the supremum and whether the equation is understood in the classical or viscosity sense; this affects the uniqueness claim.
  3. [Numerical section] The numerical example would benefit from a brief description of the discretization scheme used for the PIA and from reporting the observed convergence rate or residual norm, even if only qualitatively.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their careful reading and positive recommendation to accept the manuscript. The report accurately captures the main contributions, including the novel Sobolev estimates adapted to the policy-iteration sequence and the entropy-growth control argument that restores uniform Hölder regularity.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained analytic proof

full rationale

The paper establishes convergence of policy iteration via new Sobolev estimates and an entropy-growth containment argument that produce a uniform Hölder bound on value functions. These estimates are derived directly from the problem coefficients and entropy parameter under the standing assumptions; the limit identification and exploratory HJB characterization follow from the extracted convergent subsequence. No step reduces a claimed result to a quantity defined by the result itself, no fitted parameters are relabeled as predictions, and no load-bearing uniqueness is imported via self-citation. The argument is therefore independent of its own output.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on standard background assumptions of stochastic control (Lipschitz or growth conditions on coefficients, existence of relaxed controls) that are typical but not enumerated in the abstract; no free parameters or invented entities are introduced. The new Sobolev estimates are analytic tools rather than additional axioms.

axioms (1)
  • domain assumption Standard technical assumptions on the controlled diffusion and running cost that guarantee well-posedness of the entropy-regularized problem (e.g., Lipschitz continuity, linear growth).
    These are invoked implicitly to make the exploratory HJB equation and the policy iteration well-defined; they are standard in the field but not listed explicitly in the abstract.

pith-pipeline@v0.9.0 · 5696 in / 1450 out tokens · 26023 ms · 2026-05-24T11:23:08.996749+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    Second order elliptic equations and elliptic systems, volume 174 of Trans- lations of Mathematical Monographs

    Ya-Zhe Chen and Lan-Cheng Wu. Second order elliptic equations and elliptic systems, volume 174 of Trans- lations of Mathematical Monographs . American Mathematical Society, Providence, RI, 1998. Translated from the 1991 Chinese original by Bei Hu

  2. [2]

    Learning equilibrium mean-variance strategy

    Min Dai, Yuchao Dong, and Yanwei Jia. Learning equilibrium mean-variance strategy. Mathematical Finance, 33(4):1166–1212, 2023

  3. [3]

    Lawrence C. Evans. Partial differential equations, volume 19 ofGraduate Studies in Mathematics. American Mathematical Society, Providence, RI, 1998

  4. [4]

    Exploratory LQG mean field games with entropy regularization

    Dena Firoozi and Sebastian Jaimungal. Exploratory LQG mean field games with entropy regularization. Automatica J. IFAC, 139:Paper No. 110177, 12, 2022

  5. [5]

    Taming the noise in reinforcement learning via soft updates

    Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Thirty-Second Conference on Uncertainty in Artificial Intelligence, UAI 2016, June 25-29, 2016, New York City, NY, USA

  6. [6]

    Trudinger

    David Gilbarg and Neil S. Trudinger. Elliptic partial differential equations of second order . Classics in Mathematics. Springer-Verlag, Berlin, 2001. Reprint of the 1998 edition

  7. [7]

    Entropy regularization for mean field games with learning

    Xin Guo, Renyuan Xu, and Thaleia Zariphopoulou. Entropy regularization for mean field games with learning. Mathematics of Operations research, 47(4):3239–3260, 2022

  8. [8]

    Reinforcement learning with deep energy-based policies

    Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 , volume 70, pages 1352–1361

  9. [9]

    Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsm¨ assan, Stockholm, Sweden, July 10-15, 2018 , volume 80, pages 1856–1865

  10. [10]

    Jacka and Aleksandar Mijatovi´ c

    Saul D. Jacka and Aleksandar Mijatovi´ c. On the policy improvement algorithm in continuous time.Stochas- tics, 89(1):348–359, 2017

  11. [11]

    E. T. Jaynes. Information theory and statistical mechanics. Phys. Rev. (2) , 106:620–630, 1957

  12. [12]

    E. T. Jaynes. Information theory and statistical mechanics. II. Phys. Rev. (2) , 108:171–190, 1957

  13. [13]

    q-learning in continuous time

    Yanwei Jia and Xun Yu Zhou. q-learning in continuous time. Journal of Machine Learning Research , 24(161):1–61, 2023

  14. [14]

    Ioannis Karatzas and Steven E. Shreve. Brownian motion and stochastic calculus , volume 113 of Graduate Texts in Mathematics. Springer-Verlag, New York, second edition, 1991

  15. [15]

    Kerimkulov, D

    B. Kerimkulov, D. ˇSiˇ ska, and L. Szpruch. A modified MSA for stochastic control problems. Appl. Math. Optim., 84(3):3417–3436, 2021. 31

  16. [16]

    Exponential convergence and stability of Howard’s policy improvement algorithm for controlled diffusions

    Bekzhan Kerimkulov, David ˇSiˇ ska, and Lukasz Szpruch. Exponential convergence and stability of Howard’s policy improvement algorithm for controlled diffusions. SIAM J. Control Optim. , 58(3):1314–1340, 2020

  17. [17]

    Policy iterations for reinforcement learning problems in continuous time and space—fundamental theory and methods

    Jaeyoung Lee and Richard S Sutton. Policy iterations for reinforcement learning problems in continuous time and space—fundamental theory and methods. Automatica, 126:109421, 2021

  18. [18]

    Value iteration in continuous actions, states and time

    Michael Lutter, Shie Mannor, Jan Peters, Dieter Fox, and Animesh Garg. Value iteration in continuous actions, states and time. arXiv preprint arXiv:2105.04682 , 2021

  19. [19]

    Higher chain formula proved by combinatorics

    Tsoy-Wo Ma. Higher chain formula proved by combinatorics. Electron. J. Combin., 16(1):Note 21, 7, 2009

  20. [20]

    M. L. Puterman. On the convergence of policy iteration for controlled diffusions. J. Optim. Theory Appl. , 33(1):137–144, 1981

  21. [21]

    Regularity and stability of feedback relaxed controls

    Christoph Reisinger and Yufei Zhang. Regularity and stability of feedback relaxed controls. SIAM J. Control Optim., 59(5):3118–3151, 2021

  22. [22]

    C. E. Shannon. A mathematical theory of communication. Bell System Tech. J. , 27:379–423, 623–656, 1948

  23. [23]

    Policy iteration for the deterministic control problems–a viscosity approach

    Wenpin Tang, Hung Vinh Tran, and Yuming Paul Zhang. Policy iteration for the deterministic control problems–a viscosity approach. arXiv preprint arXiv:2301.00419 , 2023

  24. [24]

    Exploratory hjb equations and their convergence

    Wenpin Tang, Yuming Paul Zhang, and Xun Yu Zhou. Exploratory hjb equations and their convergence. SIAM Journal on Control and Optimization , 60(6):3191–3216, 2022

  25. [25]

    Continuous-time reinforcement learning control: A review of theoretical results, insights on performance, and needs for new designs

    Brent A Wallace and Jennie Si. Continuous-time reinforcement learning control: A review of theoretical results, insights on performance, and needs for new designs. IEEE Transactions on Neural Networks and Learning Systems, 2023

  26. [26]

    Reinforcement learning in continuous time and space: a stochastic control approach

    Haoran Wang, Thaleia Zariphopoulou, and Xun Yu Zhou. Reinforcement learning in continuous time and space: a stochastic control approach. J. Mach. Learn. Res. , 21:Paper No. 198, 34, 2020

  27. [27]

    Continuous-time mean-variance portfolio selection: a reinforcement learning framework

    Haoran Wang and Xun Yu Zhou. Continuous-time mean-variance portfolio selection: a reinforcement learning framework. Math. Finance, 30(4):1273–1308, 2020

  28. [28]

    Ziebart, J

    Brian D. Ziebart, J. Andrew Bagnell, and Anind K. Dey. Modeling interaction via the principle of maximum causal entropy. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel , pages 1255–1262

  29. [29]

    Ziebart, Andrew L

    Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the Twenty-Third AAAI Conference on Artificial Intelligence, AAAI 2008, Chicago, Illinois, USA, July 13-17, 2008 , pages 1433–1438. 32