Entropic Regularization of Markov Decision Processes

Boris Belousov; Jan Peters

arxiv: 1907.04214 · v2 · pith:YF3PJTJBnew · submitted 2019-07-06 · 💻 cs.LG · stat.ML

Entropic Regularization of Markov Decision Processes

Boris Belousov , Jan Peters This is my paper

Pith reviewed 2026-05-25 01:32 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords f-divergencesalpha-divergencesentropic regularizationactor-critic methodspolicy optimizationMarkov decision processesreinforcement learningPearson chi-squared divergence

0 comments

The pith

f-divergences generalize KL regularization to produce a family of actor-critic methods with closed-form policy updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that replacing the Kullback-Leibler divergence with a broader class of f-divergences for entropic regularization in Markov decision processes preserves the closed-form policy improvement step while supplying matching dual objectives for policy evaluation. This creates a unified account in which the standard combination of least-squares value function estimation and advantage-weighted maximum likelihood policy improvement corresponds exactly to the Pearson χ²-divergence penalty. A sympathetic reader would care because the same construction generates other actor-critic pairs simply by selecting different penalty-generating functions f, and the authors illustrate the effects through asymptotic analysis of the α-divergence family on common reinforcement learning problems.

Core claim

Entropic proximal policy optimization with f-divergences yields a unified perspective on compatible actor-critic architectures. In particular, least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement corresponds to the Pearson χ²-divergence penalty, while other pairs arise for various choices of the penalty-generating function f. Asymptotic analysis of solutions for different values of α in the α-divergence family demonstrates the effects of the divergence choice on standard reinforcement learning problems.

What carries the argument

The penalty-generating function f of an f-divergence, which determines both the closed-form policy improvement step and the corresponding dual objective for policy evaluation.

If this is right

Least-squares value estimation with advantage-weighted maximum likelihood policy improvement is exactly the actor-critic pair induced by the Pearson χ²-divergence penalty.
Different choices of the function f generate different compatible actor-critic method pairs.
The α-divergence family supplies a parameterized set of regularizers that all admit closed-form updates and dual evaluation objectives.
The divergence choice alters the asymptotic character of the learned solutions on standard reinforcement learning tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Selecting f-divergences other than KL could allow explicit control over exploration-exploitation balance or robustness properties in the learned policy.
The framework offers a diagnostic lens for existing actor-critic instabilities by identifying the implicit divergence each method employs.
The same derivation pattern could be applied to derive regularized methods for partially observable or continuous-time MDPs.

Load-bearing premise

That the broader family of f-divergences, including α-divergences, admits closed-form policy improvement steps together with corresponding dual objectives for policy evaluation, extending the KL case without new instabilities.

What would settle it

A derivation for some α value in which the policy update lacks a closed-form expression, or an experiment in which α-divergence regularization produces divergence or instability where KL regularization remains stable.

Figures

Figures reproduced from arXiv: 1907.04214 by Boris Belousov, Jan Peters.

**Figure 1.** Figure 1: shows the effects of the α-divergence choice on policy updates. We consider a 10-armed bandit problem with arm values Q(a) ∼ N (0, 1) and keep the temperature fixed at η = 2 for all values of α. Several iterations starting from an initial uniform policy are shown in the figure for comparison. Extremely large positive and negative values of α result in ε-elimination and ε-greedy policies, respectively. Smal… view at source ↗

**Figure 2.** Figure 2: Average regret for various values of α [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

**Figure 3.** Figure 3: shows the average regret after a given number of time steps as a function of the divergence type α. As can be seen from the figure, smaller values of α result in lower regret. Large negative α’s correspond to ε-greedy policies, which oftentimes prematurely converge to a sub-optimal action, failing to discover the optimal action for a long time if the exploration probability ε is small. Large positive α’s c… view at source ↗

**Figure 4.** Figure 4: Effects of α-divergence on policy iteration. Each row corresponds to a given environment. Results for different values of α are split into three subplots within each row, from the more extreme α’s on the left to the more refined values on the right. In all cases, more negative values α < 0 initially show faster improvement because they immediately jump to the mode and keep the exploration level low; howeve… view at source ↗

read the original abstract

An optimal feedback controller for a given Markov decision process (MDP) can in principle be synthesized by value or policy iteration. However, if the system dynamics and the reward function are unknown, a learning agent must discover an optimal controller via direct interaction with the environment. Such interactive data gathering commonly leads to divergence towards dangerous or uninformative regions of the state space unless additional regularization measures are taken. Prior works proposed bounding the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step to eliminate instability in the learning dynamics. In this paper, we consider a broader family of $f$-divergences, and more concretely $\alpha$-divergences, which inherit the beneficial property of providing the policy improvement step in closed form at the same time yielding a corresponding dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least-squares value function estimation coupled with advantage-weighted maximum likelihood policy improvement is shown to correspond to the Pearson $\chi^2$-divergence penalty. Other actor-critic pairs arise for various choices of the penalty-generating function $f$. On a concrete instantiation of our framework with the $\alpha$-divergence, we carry out asymptotic analysis of the solutions for different values of $\alpha$ and demonstrate the effects of the divergence function choice on common standard reinforcement learning problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends KL regularization to f-divergences with explicit mappings, including least-squares plus weighted MLE to Pearson χ².

read the letter

The main point is that this paper generalizes the KL-regularized policy improvement to f-divergences, and specifically alpha-divergences, while showing that common actor-critic pairs arise directly from the choice of f. Least-squares value estimation combined with advantage-weighted maximum likelihood policy improvement corresponds to the Pearson χ² case, and other pairs follow from other f functions. They also carry out asymptotic analysis for different alpha values and test the effects on standard RL benchmarks. This is new relative to the earlier KL-only results. The unification is useful because it makes the correspondences between estimation procedures and divergence penalties explicit and checkable rather than ad hoc. The derivations for closed-form policy steps and dual evaluation objectives appear to carry over cleanly for the alpha family and the χ² example they highlight. The experiments illustrate the practical impact of the divergence choice without overclaiming. The soft spots are minor. The general inheritance claim for arbitrary f could in principle introduce stationarity conditions that are harder to solve than the KL softmax, but the paper concentrates on alpha-divergences where the property holds and on the concrete χ² mapping, so the central argument does not rest on unverified general cases. No circularity or fitting issues show up in the presented results. This is for RL researchers who work on regularized policy optimization or actor-critic design. A reader looking to understand why certain method pairs fit together or to explore new divergences would get direct value. It deserves serious peer review because the claims are specific enough to verify and the framework organizes existing techniques in a falsifiable way.

Referee Report

2 major / 2 minor

Summary. The paper extends entropic regularization of MDPs from the KL divergence to the broader family of f-divergences (with emphasis on α-divergences). It claims that this family inherits closed-form policy improvement steps together with tractable dual objectives for policy evaluation. Specific actor-critic correspondences are derived, including that the Pearson χ²-divergence penalty recovers least-squares value-function estimation paired with advantage-weighted maximum-likelihood policy improvement. Asymptotic analysis of the solutions for varying α and empirical results on standard RL benchmarks are also presented.

Significance. If the claimed correspondences and closed-form derivations hold without gaps, the work supplies a unified perspective on compatible actor-critic architectures via choice of the penalty-generating function f. The explicit reduction of common least-squares + advantage-weighted MLE to the χ² case and the asymptotic analysis for α-divergences constitute concrete strengths that could inform the design of new regularized RL methods.

major comments (2)

[§3.2, Eq. (9)] §3.2, Eq. (9): the stationarity condition obtained from argmax_π [E_π[A] − D_f(π‖π_old)] is asserted to remain solvable in closed form for general f (including the α-family). For the Pearson χ² case f(u)=(u−1)²/2 this reduces to a linear relation between π and the advantage, but the manuscript must explicitly exhibit the algebraic solution for the α-divergence family and confirm that no additional non-convexity or measure-theoretic restrictions appear that are absent in the KL case; otherwise the inheritance claim is load-bearing for the unification.
[§4.1] §4.1, the derivation linking least-squares value estimation + advantage-weighted MLE to the χ² penalty: the dual objective obtained after substituting the closed-form policy must be shown to be identical (not merely analogous) to the ordinary least-squares Bellman residual; if the equivalence holds only after post-hoc reparameterization, the claimed correspondence is weaker than stated and affects the central actor-critic unification.

minor comments (2)

[§2] Notation for the f-divergence and its conjugate should be introduced once in §2 and used consistently thereafter; several later equations reuse D_f without re-stating the generating function.
[Figure 3] Figure 3 (asymptotic bias plots) lacks error bars or confidence intervals; adding them would clarify whether observed differences across α are statistically meaningful.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments highlight opportunities to strengthen the explicitness of our derivations. We address each point below and will revise the manuscript accordingly where the presentation can be improved without altering the core claims.

read point-by-point responses

Referee: [§3.2, Eq. (9)] §3.2, Eq. (9): the stationarity condition obtained from argmax_π [E_π[A] − D_f(π‖π_old)] is asserted to remain solvable in closed form for general f (including the α-family). For the Pearson χ² case f(u)=(u−1)²/2 this reduces to a linear relation between π and the advantage, but the manuscript must explicitly exhibit the algebraic solution for the α-divergence family and confirm that no additional non-convexity or measure-theoretic restrictions appear that are absent in the KL case; otherwise the inheritance claim is load-bearing for the unification.

Authors: We agree that an explicit algebraic derivation for the α-family improves clarity. Starting from the stationarity condition of the regularized objective, the α-divergence yields the closed-form policy π*(a|s) ∝ [π_old(a|s)^α ⋅ (1 + (α−1)A(s,a)/λ)]^{1/(α−1)} (with λ chosen to enforce normalization). This follows directly from setting the functional derivative to zero and solving the resulting algebraic equation, recovering the softmax in the KL limit (α→1). The objective remains strictly concave in π for α in the standard range, introducing no extra non-convexity or measure-theoretic issues beyond the KL case. We will insert this explicit solution and the accompanying concavity argument into the revised §3.2. revision: yes
Referee: [§4.1] §4.1, the derivation linking least-squares value estimation + advantage-weighted MLE to the χ² penalty: the dual objective obtained after substituting the closed-form policy must be shown to be identical (not merely analogous) to the ordinary least-squares Bellman residual; if the equivalence holds only after post-hoc reparameterization, the claimed correspondence is weaker than stated and affects the central actor-critic unification.

Authors: After substituting the closed-form χ² policy into the dual objective, the resulting expression is algebraically identical to the standard least-squares Bellman residual (i.e., the squared TD error). The advantage-weighted maximum-likelihood policy update emerges directly from the same substitution without any auxiliary reparameterization. The claimed correspondence is therefore exact rather than merely analogous. To address the concern, we will expand the algebraic steps in §4.1 (and add an appendix if space is limited) to display the identity explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; generalization from KL to f-divergences is derived independently

full rationale

The paper's central derivation extends the closed-form policy improvement and dual evaluation objective from the KL case to general f-divergences (including α-divergences) by direct mathematical construction of the stationarity conditions and duals for arbitrary penalty-generating functions f. The specific mapping of Pearson χ² to least-squares value estimation plus advantage-weighted MLE is exhibited as one concrete instance of this derivation rather than a fitted input renamed as output or a self-citation load-bearing premise. No self-definitional loops, uniqueness theorems imported from the authors' prior work, or ansatzes smuggled via citation appear in the abstract or claimed chain; the framework remains self-contained against external benchmarks with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard MDP interaction assumptions and the mathematical closure properties of f-divergences; no new entities are postulated and no parameters are fitted to data.

axioms (2)

domain assumption When dynamics and rewards are unknown, interactive data gathering requires regularization to avoid divergence to dangerous regions.
Stated as the motivating setting for the regularization approach.
domain assumption f-divergences admit closed-form policy improvement and dual policy-evaluation objectives.
Invoked as the inherited beneficial property that enables the framework.

pith-pipeline@v0.9.0 · 5778 in / 1386 out tokens · 45748 ms · 2026-05-25T01:32:20.852695+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 9 internal anchors

[1]

Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994

Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994. [CrossRef]

work page 1994
[2]

Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

work page 1998
[3]

A survey on policy search for robotics.Found

Deisenroth, M.P .; Neumann, G.; Peters, J. A survey on policy search for robotics.Found. T rendsR⃝ Robot. 2013, 2, 1–142. [CrossRef]

work page 2013
[4]

Dynamic Programming

Bellman, R. Dynamic Programming. Science 1957, 70, 342. [CrossRef] Entropy 2019, 21, 674 15 of 16

work page 1957
[5]

A Natural Policy Gradient

Kakade, S.M. A Natural Policy Gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 1531–1538. [CrossRef]

work page 2001
[6]

Relative Entropy Policy Search

Peters, J.; Mülling, K.; Altun, Y. Relative Entropy Policy Search. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612

work page 2010
[7]

Trust Region Policy Optimization

Schulman, J.; Levine, S.; Moritz, P .; Jordan, M.; Abbeel, P . Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015

work page 2015
[8]

Proximal Policy Optimization Algorithms

Schulman, J.; Wolski, F.; Dhariwal, P .; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Improving predictive inference under covariate shift by weighting the log-likelihood function

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference . 2000, 227–244. [CrossRef]

work page 2000
[10]

A unified view of entropy-regularized Markov decision processes

Neu, G.; Jonsson, A.; Gómez, V . A uniﬁed view of entropy-regularized Markov decision processes.arXiv 2017, arXiv:1705.07798

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Proximal Algorithms

Parikh, N. Proximal Algorithms. Found. T rendsR⃝ Optim. 2014, 1, 127–239. [CrossRef]

work page 2014
[12]

An elementary introduction to information geometry

Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271

work page arXiv 2018
[13]

Generative Adversarial Nets

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014

work page 2014
[14]

Geometrical Insights for Implicit Generative Modeling

Bottou, L.; Arjovsky, M.; Lopez-Paz, D.; Oquab, M. Geometrical Insights for Implicit Generative Modeling. Braverman Read. Mach. Learn . 2018, 11100, 229–268

work page 2018
[15]

f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 271–279

work page 2016
[16]

Entropic Proximal Mappings with Applications to Nonlinear Programming

Teboulle, M. Entropic Proximal Mappings with Applications to Nonlinear Programming. Math. Operations Res. 1992, 17, 670–690. [CrossRef]

work page 1992
[17]

Problem complexity and method efﬁciency in optimization

Nemirovski, A.; Yudin, D. Problem complexity and method efﬁciency in optimization. J. Operational Res. Soc. 1984, 35, 455

work page 1984
[18]

Mirror descent and nonlinear projected subgradient methods for convex optimization

Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett. 2003, 31, 167–175. [CrossRef]

work page 2003
[19]

A measure of asymptotic efﬁciency for tests of a hypothesis based on the sum of observations

Chernoff, H. A measure of asymptotic efﬁciency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [CrossRef]

work page 1952
[20]

Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985

Amari, S. Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985. [CrossRef]

work page 1985
[21]

Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities

Cichocki, A.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities. Entropy 2010, 12, 1532–1568. [CrossRef]

work page 2010
[22]

A Notation for Markov Decision Processes

Thomas, P .S.; Okal, B. A notation for Markov decision processes. arXiv 2015, arXiv:1512.09075

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [CrossRef]

work page 1999
[24]

Natural Actor-Critic

Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [CrossRef]

work page 2008
[25]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J.; Moritz, P .; Levine, S.; Jordan, M.I.; Abbeel, P . High Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438

work page internal anchor Pith review Pith/arXiv arXiv 2015
[26]

Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten

Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108

work page 1963
[27]

Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

Zhu, H.; Rohwer, R. Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

work page 1995
[28]

Simple statistical gradient-following methods for connectionist reinforcement learning

Williams, R.J. Simple statistical gradient-following methods for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [CrossRef]

work page 1992
[29]

Graphical Models, Exponential Families, and Variational Inference

Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. T rends Mach. Learn. 2007, 1, 1–305. [CrossRef] Entropy 2019, 21, 674 16 of 16

work page 2007
[30]

Residual Algorithms: Reinforcement Learning with Function Approximation

Baird, L. Residual Algorithms: Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 30–37. [CrossRef]

work page 1995
[31]

Policy Evaluation with Temporal Differences: A Survey and Comparison

Dann, C.; Neumann, G.; Peters, J. Policy Evaluation with Temporal Differences: A Survey and Comparison. J. Mach. Learn. Res. 2014, 15, 809–883

work page 2014
[32]

F-divergence inequalities

Sason, I.; Verdu, S. F-divergence inequalities. IEEE T rans. Inf. Theory 2016, 62, 5973–6006. [CrossRef]

work page 2016
[33]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. T rends Mach. Learn. 2012, 5, 1–122. [CrossRef]

work page 2012
[34]

The Non-Stochastic Multi-Armed Bandit Problem.SIAM J

Auer, P .; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. The Non-Stochastic Multi-Armed Bandit Problem.SIAM J. Comput. 2003, 32, 48–77. [CrossRef]

work page 2003
[35]

Bayesian Reinforcement Learning: A Survey

Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian Reinforcement Learning: A Survey. Found. T rends Mach. Learn. 2015, 8, 359–483. [CrossRef]

work page 2015
[36]

OpenAI Gym

Brockman, G.; Cheung, V .; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540

work page internal anchor Pith review Pith/arXiv arXiv 2016
[37]

Information theory of decisions and actions

Tishby, N.; Polani, D. Information theory of decisions and actions. In Perception-Action Cycle; Cutsuridis, V ., Hussain, A., Taylor, J., Eds.; Springer: New York, NY, USA, 2011; pp. 601–636

work page 2011
[38]

Autonomy: An information theoretic perspective

Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems 2008, 91, 331–345. [CrossRef] [PubMed]

work page 2008
[39]

An information-theoretic approach to curiosity-driven reinforcement learning

Still, S.; Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 2012, 131, 139–148. [CrossRef]

work page 2012
[40]

Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle

Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Front. Rob. AI 2015, 2, 27. [CrossRef]

work page 2015
[41]

Information theory—the bridge connecting bounded rational game theory and statistical physics

Wolpert, D.H. Information theory—the bridge connecting bounded rational game theory and statistical physics. In Complex Engineered Systems; Braha, D., Minai, A., Bar-Yam, Y., Eds.; Springer: Berlin, Germany, 2006; pp. 262–290

work page 2006
[42]

A Theory of Regularized Markov Decision Processes

Geist, M.; Scherrer, B.; Pietquin, O. A Theory of Regularized Markov Decision Processes. arXiv 2019, arXiv:1901.11275

work page internal anchor Pith review Pith/arXiv arXiv 2019
[43]

A Uniﬁed Framework for Regularized Reinforcement Learning

Li, X.; Yang, W.; Zhang, Z. A Uniﬁed Framework for Regularized Reinforcement Learning. arXiv 2019, arXiv:1903.00725

work page arXiv 2019
[44]

Path Consistency Learning in Tsallis Entropy Regularized MDPs

Nachum, O.; Chow, Y.; Ghavamzadeh, M. Path consistency learning in Tsallis entropy regularized MDPs. arXiv 2018, arXiv:1802.03501

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Lee, K.; Kim, S.; Lim, S.; Choi, S.; Oh, S. Tsallis Reinforcement Learning: A Uniﬁed Framework for Maximum Entropy Reinforcement Learning. arXiv 2019, arXiv:1902.00137

work page internal anchor Pith review Pith/arXiv arXiv 2019
[46]

Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning

Lee, K.; Choi, S.; Oh, S. Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Rob. Autom. Lett. 2018, 3, 1466–1473. [CrossRef]

work page 2018
[47]

Maximum Causal Tsallis Entropy Imitation Learning

Lee, K.; Choi, S.; Oh, S. Maximum Causal Tsallis Entropy Imitation Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4408–4418

work page 2018
[48]

Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

Mahadevan, S.; Liu, B.; Thomas, P .; Dabney, W.; Giguere, S.; Jacek, N.; Gemp, I.; Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv 2014, arXiv:1405.6757

work page internal anchor Pith review Pith/arXiv arXiv 2014
[49]

Markov processes and the H-theorem

Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [CrossRef]

work page 1963
[50]

A General Class of Coefﬁcients of Divergence of One Distribution from Another

Ali, S.M.; Silvey, S.D. A General Class of Coefﬁcients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [CrossRef]

work page 1966
[51]

Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p

Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p. [CrossRef] c⃝ 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/)

work page 2004

[1] [1]

Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994

Puterman, M.L. Markov Decision Processes: Discrete Stochastic Dynamic Programming ; John Wiley & Sons: Hoboken, NJ, USA, 1994. [CrossRef]

work page 1994

[2] [2]

Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

Sutton, R.S.; Barto, A.G. Reinforcement Learning: An Introduction ; MIT Press: Cambridge, MA, USA, 1998

work page 1998

[3] [3]

A survey on policy search for robotics.Found

Deisenroth, M.P .; Neumann, G.; Peters, J. A survey on policy search for robotics.Found. T rendsR⃝ Robot. 2013, 2, 1–142. [CrossRef]

work page 2013

[4] [4]

Dynamic Programming

Bellman, R. Dynamic Programming. Science 1957, 70, 342. [CrossRef] Entropy 2019, 21, 674 15 of 16

work page 1957

[5] [5]

A Natural Policy Gradient

Kakade, S.M. A Natural Policy Gradient. In Proceedings of the 14th International Conference on Neural Information Processing Systems: Natural and Synthetic, Vancouver, BC, Canada, 3–8 December 2001; pp. 1531–1538. [CrossRef]

work page 2001

[6] [6]

Relative Entropy Policy Search

Peters, J.; Mülling, K.; Altun, Y. Relative Entropy Policy Search. In Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, Atlanta, GA, USA, 11–15 July 2010; pp. 1607–1612

work page 2010

[7] [7]

Trust Region Policy Optimization

Schulman, J.; Levine, S.; Moritz, P .; Jordan, M.; Abbeel, P . Trust Region Policy Optimization. In Proceedings of the 32nd International Conference on International Conference on Machine Learning, Lille, France, 6–11 July 2015

work page 2015

[8] [8]

Proximal Policy Optimization Algorithms

Schulman, J.; Wolski, F.; Dhariwal, P .; Radford, A.; Klimov, O. Proximal policy optimization algorithms. arXiv 2017, arXiv:1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Improving predictive inference under covariate shift by weighting the log-likelihood function

Shimodaira, H. Improving predictive inference under covariate shift by weighting the log-likelihood function. J. Stat. Plann. Inference . 2000, 227–244. [CrossRef]

work page 2000

[10] [10]

A unified view of entropy-regularized Markov decision processes

Neu, G.; Jonsson, A.; Gómez, V . A uniﬁed view of entropy-regularized Markov decision processes.arXiv 2017, arXiv:1705.07798

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Proximal Algorithms

Parikh, N. Proximal Algorithms. Found. T rendsR⃝ Optim. 2014, 1, 127–239. [CrossRef]

work page 2014

[12] [12]

An elementary introduction to information geometry

Nielsen, F. An elementary introduction to information geometry. arXiv 2018, arXiv:1808.08271

work page arXiv 2018

[13] [13]

Generative Adversarial Nets

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative Adversarial Nets. In Proceedings of the 27th International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014

work page 2014

[14] [14]

Geometrical Insights for Implicit Generative Modeling

Bottou, L.; Arjovsky, M.; Lopez-Paz, D.; Oquab, M. Geometrical Insights for Implicit Generative Modeling. Braverman Read. Mach. Learn . 2018, 11100, 229–268

work page 2018

[15] [15]

f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization

Nowozin, S.; Cseke, B.; Tomioka, R. f-GAN: Training Generative Neural Samplers using Variational Divergence Minimization. In Proceedings of the 30th International Conference on Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 271–279

work page 2016

[16] [16]

Entropic Proximal Mappings with Applications to Nonlinear Programming

Teboulle, M. Entropic Proximal Mappings with Applications to Nonlinear Programming. Math. Operations Res. 1992, 17, 670–690. [CrossRef]

work page 1992

[17] [17]

Problem complexity and method efﬁciency in optimization

Nemirovski, A.; Yudin, D. Problem complexity and method efﬁciency in optimization. J. Operational Res. Soc. 1984, 35, 455

work page 1984

[18] [18]

Mirror descent and nonlinear projected subgradient methods for convex optimization

Beck, A.; Teboulle, M. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Res. Lett. 2003, 31, 167–175. [CrossRef]

work page 2003

[19] [19]

A measure of asymptotic efﬁciency for tests of a hypothesis based on the sum of observations

Chernoff, H. A measure of asymptotic efﬁciency for tests of a hypothesis based on the sum of observations. Ann. Math. Stat. 1952, 23, 493–507. [CrossRef]

work page 1952

[20] [20]

Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985

Amari, S. Differential-Geometrical Methods in Statistics ; Springer: New York, NY, USA, 1985. [CrossRef]

work page 1985

[21] [21]

Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities

Cichocki, A.; Amari, S. Families of alpha- beta- and gamma- divergences: Flexible and robust measures of Similarities. Entropy 2010, 12, 1532–1568. [CrossRef]

work page 2010

[22] [22]

A Notation for Markov Decision Processes

Thomas, P .S.; Okal, B. A notation for Markov decision processes. arXiv 2015, arXiv:1512.09075

work page internal anchor Pith review Pith/arXiv arXiv 2015

[23] [23]

Policy Gradient Methods for Reinforcement Learning with Function Approximation

Sutton, R.S.; Mcallester, D.; Singh, S.; Mansour, Y. Policy Gradient Methods for Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, Denver, CO, USA, 29 November–4 December 1999; pp. 1057–1063. [CrossRef]

work page 1999

[24] [24]

Natural Actor-Critic

Peters, J.; Schaal, S. Natural Actor-Critic. Neurocomputing 2008, 71, 1180–1190. [CrossRef]

work page 2008

[25] [25]

High-Dimensional Continuous Control Using Generalized Advantage Estimation

Schulman, J.; Moritz, P .; Levine, S.; Jordan, M.I.; Abbeel, P . High Dimensional Continuous Control Using Generalized Advantage Estimation. arXiv 2015, arXiv:1506.02438

work page internal anchor Pith review Pith/arXiv arXiv 2015

[26] [26]

Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten

Csiszár, I. Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizität von Markoffschen Ketten. Publ. Math. Inst. Hungar. Acad. Sci. 1963, 8, 85–108

work page 1963

[27] [27]

Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

Zhu, H.; Rohwer, R. Information Geometric Measurements of Generalisation; Technical Report; Aston University: Birmingham, UK, 1995

work page 1995

[28] [28]

Simple statistical gradient-following methods for connectionist reinforcement learning

Williams, R.J. Simple statistical gradient-following methods for connectionist reinforcement learning. Mach. Learn. 1992, 8, 229–256. [CrossRef]

work page 1992

[29] [29]

Graphical Models, Exponential Families, and Variational Inference

Wainwright, M.J.; Jordan, M.I. Graphical Models, Exponential Families, and Variational Inference. Found. T rends Mach. Learn. 2007, 1, 1–305. [CrossRef] Entropy 2019, 21, 674 16 of 16

work page 2007

[30] [30]

Residual Algorithms: Reinforcement Learning with Function Approximation

Baird, L. Residual Algorithms: Reinforcement Learning with Function Approximation. In Proceedings of the 12th International Conference on Machine Learning, Tahoe City, CA, USA, 9–12 July 1995; pp. 30–37. [CrossRef]

work page 1995

[31] [31]

Policy Evaluation with Temporal Differences: A Survey and Comparison

Dann, C.; Neumann, G.; Peters, J. Policy Evaluation with Temporal Differences: A Survey and Comparison. J. Mach. Learn. Res. 2014, 15, 809–883

work page 2014

[32] [32]

F-divergence inequalities

Sason, I.; Verdu, S. F-divergence inequalities. IEEE T rans. Inf. Theory 2016, 62, 5973–6006. [CrossRef]

work page 2016

[33] [33]

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems

Bubeck, S.; Cesa-Bianchi, N. Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Found. T rends Mach. Learn. 2012, 5, 1–122. [CrossRef]

work page 2012

[34] [34]

The Non-Stochastic Multi-Armed Bandit Problem.SIAM J

Auer, P .; Cesa-Bianchi, N.; Freund, Y.; Schapire, R. The Non-Stochastic Multi-Armed Bandit Problem.SIAM J. Comput. 2003, 32, 48–77. [CrossRef]

work page 2003

[35] [35]

Bayesian Reinforcement Learning: A Survey

Ghavamzadeh, M.; Mannor, S.; Pineau, J.; Tamar, A. Bayesian Reinforcement Learning: A Survey. Found. T rends Mach. Learn. 2015, 8, 359–483. [CrossRef]

work page 2015

[36] [36]

OpenAI Gym

Brockman, G.; Cheung, V .; Pettersson, L.; Schneider, J.; Schulman, J.; Tang, J.; Zaremba, W. OpenAI Gym. arXiv 2016, arXiv:1606.01540

work page internal anchor Pith review Pith/arXiv arXiv 2016

[37] [37]

Information theory of decisions and actions

Tishby, N.; Polani, D. Information theory of decisions and actions. In Perception-Action Cycle; Cutsuridis, V ., Hussain, A., Taylor, J., Eds.; Springer: New York, NY, USA, 2011; pp. 601–636

work page 2011

[38] [38]

Autonomy: An information theoretic perspective

Bertschinger, N.; Olbrich, E.; Ay, N.; Jost, J. Autonomy: An information theoretic perspective. Biosystems 2008, 91, 331–345. [CrossRef] [PubMed]

work page 2008

[39] [39]

An information-theoretic approach to curiosity-driven reinforcement learning

Still, S.; Precup, D. An information-theoretic approach to curiosity-driven reinforcement learning. Theory Biosci. 2012, 131, 139–148. [CrossRef]

work page 2012

[40] [40]

Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle

Genewein, T.; Leibfried, F.; Grau-Moya, J.; Braun, D.A. Bounded rationality, abstraction, and hierarchical decision-making: An information-theoretic optimality principle. Front. Rob. AI 2015, 2, 27. [CrossRef]

work page 2015

[41] [41]

Information theory—the bridge connecting bounded rational game theory and statistical physics

Wolpert, D.H. Information theory—the bridge connecting bounded rational game theory and statistical physics. In Complex Engineered Systems; Braha, D., Minai, A., Bar-Yam, Y., Eds.; Springer: Berlin, Germany, 2006; pp. 262–290

work page 2006

[42] [42]

A Theory of Regularized Markov Decision Processes

Geist, M.; Scherrer, B.; Pietquin, O. A Theory of Regularized Markov Decision Processes. arXiv 2019, arXiv:1901.11275

work page internal anchor Pith review Pith/arXiv arXiv 2019

[43] [43]

A Uniﬁed Framework for Regularized Reinforcement Learning

Li, X.; Yang, W.; Zhang, Z. A Uniﬁed Framework for Regularized Reinforcement Learning. arXiv 2019, arXiv:1903.00725

work page arXiv 2019

[44] [44]

Path Consistency Learning in Tsallis Entropy Regularized MDPs

Nachum, O.; Chow, Y.; Ghavamzadeh, M. Path consistency learning in Tsallis entropy regularized MDPs. arXiv 2018, arXiv:1802.03501

work page internal anchor Pith review Pith/arXiv arXiv 2018

[45] [45]

Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Lee, K.; Kim, S.; Lim, S.; Choi, S.; Oh, S. Tsallis Reinforcement Learning: A Uniﬁed Framework for Maximum Entropy Reinforcement Learning. arXiv 2019, arXiv:1902.00137

work page internal anchor Pith review Pith/arXiv arXiv 2019

[46] [46]

Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning

Lee, K.; Choi, S.; Oh, S. Sparse Markov decision processes with causal sparse Tsallis entropy regularization for reinforcement learning. IEEE Rob. Autom. Lett. 2018, 3, 1466–1473. [CrossRef]

work page 2018

[47] [47]

Maximum Causal Tsallis Entropy Imitation Learning

Lee, K.; Choi, S.; Oh, S. Maximum Causal Tsallis Entropy Imitation Learning. In Proceedings of the 32nd International Conference on Neural Information Processing Systems, Montreal, QC, Canada, 3–8 December 2018; pp. 4408–4418

work page 2018

[48] [48]

Proximal Reinforcement Learning: A New Theory of Sequential Decision Making in Primal-Dual Spaces

Mahadevan, S.; Liu, B.; Thomas, P .; Dabney, W.; Giguere, S.; Jacek, N.; Gemp, I.; Liu, J. Proximal reinforcement learning: A new theory of sequential decision making in primal-dual spaces. arXiv 2014, arXiv:1405.6757

work page internal anchor Pith review Pith/arXiv arXiv 2014

[49] [49]

Markov processes and the H-theorem

Morimoto, T. Markov processes and the H-theorem. J. Phys. Soc. Jpn. 1963, 18, 328–331. [CrossRef]

work page 1963

[50] [50]

A General Class of Coefﬁcients of Divergence of One Distribution from Another

Ali, S.M.; Silvey, S.D. A General Class of Coefﬁcients of Divergence of One Distribution from Another. J. R. Stat. Soc. Ser. B (Methodol.) 1966, 28, 131–142. [CrossRef]

work page 1966

[51] [51]

Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p

Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004; 487p. [CrossRef] c⃝ 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/)

work page 2004