pith. sign in

arxiv: 1907.04214 · v2 · pith:YF3PJTJBnew · submitted 2019-07-06 · 💻 cs.LG · stat.ML

Entropic Regularization of Markov Decision Processes

Pith reviewed 2026-05-25 01:32 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords f-divergencesalpha-divergencesentropic regularizationactor-critic methodspolicy optimizationMarkov decision processesreinforcement learningPearson chi-squared divergence
0
0 comments X

The pith

f-divergences generalize KL regularization to produce a family of actor-critic methods with closed-form policy updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that replacing the Kullback-Leibler divergence with a broader class of f-divergences for entropic regularization in Markov decision processes preserves the closed-form policy improvement step while supplying matching dual objectives for policy evaluation. This creates a unified account in which the standard combination of least-squares value function estimation and advantage-weighted maximum likelihood policy improvement corresponds exactly to the Pearson χ²-divergence penalty. A sympathetic reader would care because the same construction generates other actor-critic pairs simply by selecting different penalty-generating functions f, and the authors illustrate the effects through asymptotic analysis of the α-divergence family on common reinforcement learning problems.

Core claim

Entropic proximal policy optimization with f-divergences yields a unified perspective on compatible actor-critic architectures. In particular, least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement corresponds to the Pearson χ²-divergence penalty, while other pairs arise for various choices of the penalty-generating function f. Asymptotic analysis of solutions for different values of α in the α-divergence family demonstrates the effects of the divergence choice on standard reinforcement learning problems.

What carries the argument

The penalty-generating function f of an f-divergence, which determines both the closed-form policy improvement step and the corresponding dual objective for policy evaluation.

If this is right

  • Least-squares value estimation with advantage-weighted maximum likelihood policy improvement is exactly the actor-critic pair induced by the Pearson χ²-divergence penalty.
  • Different choices of the function f generate different compatible actor-critic method pairs.
  • The α-divergence family supplies a parameterized set of regularizers that all admit closed-form updates and dual evaluation objectives.
  • The divergence choice alters the asymptotic character of the learned solutions on standard reinforcement learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Selecting f-divergences other than KL could allow explicit control over exploration-exploitation balance or robustness properties in the learned policy.
  • The framework offers a diagnostic lens for existing actor-critic instabilities by identifying the implicit divergence each method employs.
  • The same derivation pattern could be applied to derive regularized methods for partially observable or continuous-time MDPs.

Load-bearing premise

That the broader family of f-divergences, including α-divergences, admits closed-form policy improvement steps together with corresponding dual objectives for policy evaluation, extending the KL case without new instabilities.

What would settle it

A derivation for some α value in which the policy update lacks a closed-form expression, or an experiment in which α-divergence regularization produces divergence or instability where KL regularization remains stable.

Figures

Figures reproduced from arXiv: 1907.04214 by Boris Belousov, Jan Peters.

Figure 1
Figure 1. Figure 1: shows the effects of the α-divergence choice on policy updates. We consider a 10-armed bandit problem with arm values Q(a) ∼ N (0, 1) and keep the temperature fixed at η = 2 for all values of α. Several iterations starting from an initial uniform policy are shown in the figure for comparison. Extremely large positive and negative values of α result in ε-elimination and ε-greedy policies, respectively. Smal… view at source ↗
Figure 2
Figure 2. Figure 2: Average regret for various values of α [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the average regret after a given number of time steps as a function of the divergence type α. As can be seen from the figure, smaller values of α result in lower regret. Large negative α’s correspond to ε-greedy policies, which oftentimes prematurely converge to a sub-optimal action, failing to discover the optimal action for a long time if the exploration probability ε is small. Large positive α’s c… view at source ↗
Figure 4
Figure 4. Figure 4: Effects of α-divergence on policy iteration. Each row corresponds to a given environment. Results for different values of α are split into three subplots within each row, from the more extreme α’s on the left to the more refined values on the right. In all cases, more negative values α < 0 initially show faster improvement because they immediately jump to the mode and keep the exploration level low; howeve… view at source ↗
read the original abstract

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of $f$-divergences, and more concretely $\alpha$-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson $\chi^2$-divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function $f$. On a concrete instantiation of our framework with the $\alpha$-divergence, we carry out asymptotic analysis of the solutions for different values of $\alpha$ and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends entropic regularization of MDPs from the KL divergence to the broader family of f-divergences (with emphasis on α-divergences). It claims that this family inherits closed-form policy improvement steps together with tractable dual objectives for policy evaluation. Specific actor-critic correspondences are derived, including that the Pearson χ²-divergence penalty recovers least-squares value-function estimation paired with advantage-weighted maximum-likelihood policy improvement. Asymptotic analysis of the solutions for varying α and empirical results on standard RL benchmarks are also presented.

Significance. If the claimed correspondences and closed-form derivations hold without gaps, the work supplies a unified perspective on compatible actor-critic architectures via choice of the penalty-generating function f. The explicit reduction of common least-squares + advantage-weighted MLE to the χ² case and the asymptotic analysis for α-divergences constitute concrete strengths that could inform the design of new regularized RL methods.

major comments (2)
  1. [§3.2, Eq. (9)] §3.2, Eq. (9): the stationarity condition obtained from argmax_π [E_π[A] − D_f(π‖π_old)] is asserted to remain solvable in closed form for general f (including the α-family). For the Pearson χ² case f(u)=(u−1)²/2 this reduces to a linear relation between π and the advantage, but the manuscript must explicitly exhibit the algebraic solution for the α-divergence family and confirm that no additional non-convexity or measure-theoretic restrictions appear that are absent in the KL case; otherwise the inheritance claim is load-bearing for the unification.
  2. [§4.1] §4.1, the derivation linking least-squares value estimation + advantage-weighted MLE to the χ² penalty: the dual objective obtained after substituting the closed-form policy must be shown to be identical (not merely analogous) to the ordinary least-squares Bellman residual; if the equivalence holds only after post-hoc reparameterization, the claimed correspondence is weaker than stated and affects the central actor-critic unification.
minor comments (2)
  1. [§2] Notation for the f-divergence and its conjugate should be introduced once in §2 and used consistently thereafter; several later equations reuse D_f without re-stating the generating function.
  2. [Figure 3] Figure 3 (asymptotic bias plots) lacks error bars or confidence intervals; adding them would clarify whether observed differences across α are statistically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight opportunities to strengthen the explicitness of our derivations. We address each point below and will revise the manuscript accordingly where the presentation can be improved without altering the core claims.

read point-by-point responses
  1. Referee: [§3.2, Eq. (9)] §3.2, Eq. (9): the stationarity condition obtained from argmax_π [E_π[A] − D_f(π‖π_old)] is asserted to remain solvable in closed form for general f (including the α-family). For the Pearson χ² case f(u)=(u−1)²/2 this reduces to a linear relation between π and the advantage, but the manuscript must explicitly exhibit the algebraic solution for the α-divergence family and confirm that no additional non-convexity or measure-theoretic restrictions appear that are absent in the KL case; otherwise the inheritance claim is load-bearing for the unification.

    Authors: We agree that an explicit algebraic derivation for the α-family improves clarity. Starting from the stationarity condition of the regularized objective, the α-divergence yields the closed-form policy π*(a|s) ∝ [π_old(a|s)^α ⋅ (1 + (α−1)A(s,a)/λ)]^{1/(α−1)} (with λ chosen to enforce normalization). This follows directly from setting the functional derivative to zero and solving the resulting algebraic equation, recovering the softmax in the KL limit (α→1). The objective remains strictly concave in π for α in the standard range, introducing no extra non-convexity or measure-theoretic issues beyond the KL case. We will insert this explicit solution and the accompanying concavity argument into the revised §3.2. revision: yes

  2. Referee: [§4.1] §4.1, the derivation linking least-squares value estimation + advantage-weighted MLE to the χ² penalty: the dual objective obtained after substituting the closed-form policy must be shown to be identical (not merely analogous) to the ordinary least-squares Bellman residual; if the equivalence holds only after post-hoc reparameterization, the claimed correspondence is weaker than stated and affects the central actor-critic unification.

    Authors: After substituting the closed-form χ² policy into the dual objective, the resulting expression is algebraically identical to the standard least-squares Bellman residual (i.e., the squared TD error). The advantage-weighted maximum-likelihood policy update emerges directly from the same substitution without any auxiliary reparameterization. The claimed correspondence is therefore exact rather than merely analogous. To address the concern, we will expand the algebraic steps in §4.1 (and add an appendix if space is limited) to display the identity explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; generalization from KL to f-divergences is derived independently

full rationale

The paper's central derivation extends the closed-form policy improvement and dual evaluation objective from the KL case to general f-divergences (including α-divergences) by direct mathematical construction of the stationarity conditions and duals for arbitrary penalty-generating functions f. The specific mapping of Pearson χ² to least-squares value estimation plus advantage-weighted MLE is exhibited as one concrete instance of this derivation rather than a fitted input renamed as output or a self-citation load-bearing premise. No self-definitional loops, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the abstract or claimed chain; the framework remains self-contained against external benchmarks with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP interaction assumptions and the mathematical closure properties of f-divergences; no new entities are postulated and no parameters are fitted to data.

axioms (2)
  • domain assumption When dynamics and rewards are unknown, interactive data gathering requires regularization to avoid divergence to dangerous regions.
    Stated as the motivating setting for the regularization approach.
  • domain assumption f-divergences admit closed-form policy improvement and dual policy-evaluation objectives.
    Invoked as the inherited beneficial property that enables the framework.

pith-pipeline@v0.9.0 · 5778 in / 1386 out tokens · 45748 ms · 2026-05-25T01:32:20.852695+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 9 internal anchors

  1. [1]

    Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994

    Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994. [CrossRef]

  2. [2]

    Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

    Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

  3. [3]

    A survey on policy search for robotics.Found

    Deisenroth, M.P .; Neumann, G.; Peters, J. A survey on policy search for robotics.Found. T rendsR⃝ Robot. 2013, 2, 1–142. [CrossRef]

  4. [4]

    Dynamic Programming

    Bellman, R. Dynamic Programming. Science 1957, 70, 342. [CrossRef] Entropy 2019, 21, 674 15 of 16

  5. [5]

    A Natural Policy Gradient

    Kakade, S.M. A Natural Policy Gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 1531–1538. [CrossRef]

  6. [6]

    Relative Entropy Policy Search

    Peters, J.; Mülling, K.; Altun, Y. Relative Entropy Policy Search. In Proceedings of the 24th AAAI Conference on Artificial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612

  7. [7]

    Trust Region Policy Optimization

    Schulman, J.; Levine, S.; Moritz, P .; Jordan, M.; Abbeel, P . Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015

  8. [8]

    Proximal Policy Optimization Algorithms

    Schulman, J.; Wolski, F.; Dhariwal, P .; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347

  9. [9]

    Improving predictive inference under covariate shift by weighting the log-likelihood function

    Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference . 2000, 227–244. [CrossRef]

  10. [10]

    A unified view of entropy-regularized Markov decision processes

    Neu, G.; Jonsson, A.; Gómez, V . A unified view of entropy-regularized Markov decision processes.arXiv 2017, arXiv:1705.07798

  11. [11]

    Proximal Algorithms

    Parikh, N. Proximal Algorithms. Found. T rendsR⃝ Optim. 2014, 1, 127–239. [CrossRef]

  12. [12]

    An elementary introduction to information geometry

    Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271

  13. [13]

    Generative Adversarial Nets

    Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014

  14. [14]

    Geometrical Insights for Implicit Generative Modeling

    Bottou, L.; Arjovsky, M.; Lopez-Paz, D.; Oquab, M. Geometrical Insights for Implicit Generative Modeling. Braverman Read. Mach. Learn . 2018, 11100, 229–268

  15. [15]

    f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

    Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 271–279

  16. [16]

    Entropic Proximal Mappings with Applications to Nonlinear Programming

    Teboulle, M. Entropic Proximal Mappings with Applications to Nonlinear Programming. Math. Operations Res. 1992, 17, 670–690. [CrossRef]

  17. [17]

    Problem complexity and method efficiency in optimization

    Nemirovski, A.; Yudin, D. Problem complexity and method efficiency in optimization. J. Operational Res. Soc. 1984, 35, 455

  18. [18]

    Mirror descent and nonlinear projected subgradient methods for convex optimization

    Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett. 2003, 31, 167–175. [CrossRef]

  19. [19]

    A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations

    Chernoff, H. A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [CrossRef]

  20. [20]

    Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985

    Amari, S. Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985. [CrossRef]

  21. [21]

    Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities

    Cichocki, A.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities. Entropy 2010, 12, 1532–1568. [CrossRef]

  22. [22]

    A Notation for Markov Decision Processes

    Thomas, P .S.; Okal, B. A notation for Markov decision processes. arXiv 2015, arXiv:1512.09075

  23. [23]

    Policy Gradient Methods for Reinforcement Learning with Function Approximation

    Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [CrossRef]

  24. [24]

    Natural Actor-Critic

    Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [CrossRef]

  25. [25]

    High-Dimensional Continuous Control Using Generalized Advantage Estimation

    Schulman, J.; Moritz, P .; Levine, S.; Jordan, M.I.; Abbeel, P . High Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438

  26. [26]

    Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten

    Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108

  27. [27]

    Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

    Zhu, H.; Rohwer, R. Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

  28. [28]

    Simple statistical gradient-following methods for connectionist reinforcement learning

    Williams, R.J. Simple statistical gradient-following methods for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [CrossRef]

  29. [29]

    Graphical Models, Exponential Families, and Variational Inference

    Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. T rends Mach. Learn. 2007, 1, 1–305. [CrossRef] Entropy 2019, 21, 674 16 of 16

  30. [30]

    Residual Algorithms: Reinforcement Learning with Function Approximation

    Baird, L. Residual Algorithms: Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 30–37. [CrossRef]

  31. [31]

    Policy Evaluation with Temporal Differences: A Survey and Comparison

    Dann, C.; Neumann, G.; Peters, J. Policy Evaluation with Temporal Differences: A Survey and Comparison. J. Mach. Learn. Res. 2014, 15, 809–883

  32. [32]

    F-divergence inequalities

    Sason, I.; Verdu, S. F-divergence inequalities. IEEE T rans. Inf. Theory 2016, 62, 5973–6006. [CrossRef]

  33. [33]

    Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

    Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. T rends Mach. Learn. 2012, 5, 1–122. [CrossRef]

  34. [34]

    The Non-Stochastic Multi-Armed Bandit Problem.SIAM J

    Auer, P .; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. The Non-Stochastic Multi-Armed Bandit Problem.SIAM J. Comput. 2003, 32, 48–77. [CrossRef]

  35. [35]

    Bayesian Reinforcement Learning: A Survey

    Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian Reinforcement Learning: A Survey. Found. T rends Mach. Learn. 2015, 8, 359–483. [CrossRef]

  36. [36]

    OpenAI Gym

    Brockman, G.; Cheung, V .; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540

  37. [37]

    Information theory of decisions and actions

    Tishby, N.; Polani, D. Information theory of decisions and actions. In Perception-Action Cycle; Cutsuridis, V ., Hussain, A., Taylor, J., Eds.; Springer: New York, NY, USA, 2011; pp. 601–636

  38. [38]

    Autonomy: An information theoretic perspective

    Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems 2008, 91, 331–345. [CrossRef] [PubMed]

  39. [39]

    An information-theoretic approach to curiosity-driven reinforcement learning

    Still, S.; Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 2012, 131, 139–148. [CrossRef]

  40. [40]

    Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle

    Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Front. Rob. AI 2015, 2, 27. [CrossRef]

  41. [41]

    Information theory—the bridge connecting bounded rational game theory and statistical physics

    Wolpert, D.H. Information theory—the bridge connecting bounded rational game theory and statistical physics. In Complex Engineered Systems; Braha, D., Minai, A., Bar-Yam, Y., Eds.; Springer: Berlin, Germany, 2006; pp. 262–290

  42. [42]

    A Theory of Regularized Markov Decision Processes

    Geist, M.; Scherrer, B.; Pietquin, O. A Theory of Regularized Markov Decision Processes. arXiv 2019, arXiv:1901.11275

  43. [43]

    A Unified Framework for Regularized Reinforcement Learning

    Li, X.; Yang, W.; Zhang, Z. A Unified Framework for Regularized Reinforcement Learning. arXiv 2019, arXiv:1903.00725

  44. [44]

    Path Consistency Learning in Tsallis Entropy Regularized MDPs

    Nachum, O.; Chow, Y.; Ghavamzadeh, M. Path consistency learning in Tsallis entropy regularized MDPs. arXiv 2018, arXiv:1802.03501

  45. [45]

    Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

    Lee, K.; Kim, S.; Lim, S.; Choi, S.; Oh, S. Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning. arXiv 2019, arXiv:1902.00137

  46. [46]

    Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning

    Lee, K.; Choi, S.; Oh, S. Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Rob. Autom. Lett. 2018, 3, 1466–1473. [CrossRef]

  47. [47]

    Maximum Causal Tsallis Entropy Imitation Learning

    Lee, K.; Choi, S.; Oh, S. Maximum Causal Tsallis Entropy Imitation Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4408–4418

  48. [48]

    Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

    Mahadevan, S.; Liu, B.; Thomas, P .; Dabney, W.; Giguere, S.; Jacek, N.; Gemp, I.; Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv 2014, arXiv:1405.6757

  49. [49]

    Markov processes and the H-theorem

    Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [CrossRef]

  50. [50]

    A General Class of Coefficients of Divergence of One Distribution from Another

    Ali, S.M.; Silvey, S.D. A General Class of Coefficients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [CrossRef]

  51. [51]

    Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p

    Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p. [CrossRef] c⃝ 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/)