arxiv: 1805.00909 · v3 · submitted 2018-05-02 · 💻 cs.LG · cs.AI· cs.RO· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review

Sergey Levine

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ROstat.ML

keywords reinforcement learningmaximum entropyprobabilistic inferencevariational inferenceoptimal controlpolicy optimization

0 comments

The pith

Maximum entropy reinforcement learning is equivalent to exact probabilistic inference for deterministic dynamics and variational inference for stochastic dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a maximum-entropy version of the reinforcement learning or optimal control objective can be rewritten as a probabilistic inference problem. When the system dynamics are deterministic, the optimal policy corresponds exactly to the posterior distribution over actions; when dynamics are stochastic, the same objective yields a variational inference problem. This rewriting matters because it lets researchers import tools from approximate inference, such as variational methods and message passing, directly into policy optimization. The resulting perspective also clarifies how to incorporate uncertainty, partial observability, and compositional structure into control problems without changing the underlying decision-making formalism.

Core claim

A generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics.

What carries the argument

The maximum-entropy reinforcement learning objective, which augments the usual expected reward with an entropy term over the policy and thereby converts the control problem into one of inferring a distribution over trajectories.

If this is right

Any approximate inference algorithm can be repurposed as a reinforcement learning algorithm by substituting the appropriate energy function.
Problems with partial observability become standard filtering or smoothing tasks once cast as inference.
Compositionality in tasks can be handled by composing the underlying probabilistic models rather than hand-designing reward functions.
Uncertainty over dynamics or goals is represented directly as uncertainty in the inferred posterior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same equivalence may allow transfer of inference scaling techniques, such as amortized variational inference, to high-dimensional continuous control.
It suggests that multi-task reinforcement learning can be viewed as joint inference over a shared prior and task-specific posteriors.
Future work could test whether inference-based regularization improves sample efficiency in model-based control compared with standard entropy bonuses.

Load-bearing premise

The claimed equivalence requires that the reinforcement learning objective is written in maximum-entropy form and that the dynamics are modeled strictly as either deterministic or stochastic.

What would settle it

An explicit counter-example in which the policy that maximizes the maximum-entropy objective differs from the posterior obtained by exact or variational inference on the same trajectory distribution.

read the original abstract

The framework of reinforcement learning or optimal control provides a mathematical formalization of intelligent decision making that is powerful and broadly applicable. While the general form of the reinforcement learning problem enables effective reasoning about uncertainty, the connection between reinforcement learning and inference in probabilistic models is not immediately obvious. However, such a connection has considerable value when it comes to algorithm design: formalizing a problem as probabilistic inference in principle allows us to bring to bear a wide array of approximate inference tools, extend the model in flexible and powerful ways, and reason about compositionality and partial observability. In this article, we will discuss how a generalization of the reinforcement learning or optimal control problem, which is sometimes termed maximum entropy reinforcement learning, is equivalent to exact probabilistic inference in the case of deterministic dynamics, and variational inference in the case of stochastic dynamics. We will present a detailed derivation of this framework, overview prior work that has drawn on this and related ideas to propose new reinforcement learning and control algorithms, and describe perspectives on future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clear tutorial deriving the max-entropy RL to inference equivalence, but it synthesizes prior work without new results.

read the letter

This tutorial walks through the equivalence between maximum entropy reinforcement learning and probabilistic inference over trajectories. The core point is that the max-entropy objective matches exact inference when dynamics are deterministic and variational inference when they are stochastic, with rewards acting as the unnormalized potentials in the trajectory distribution. Levine presents the derivation in a direct, step-by-step way that makes the connection explicit for both cases. The overview of prior algorithms that use this view, such as soft Q-learning variants, is also straightforward and pulls the threads together without unnecessary complication. The math checks out against standard formulations and does not introduce circular definitions or hidden constraints beyond the usual variational families. The main limitation is that the paper is explicitly a review and tutorial, so it contains no new empirical results, algorithms, or theoretical extensions. Anyone who has read the original papers on this connection will not find surprises in the content, only a cleaner single-source presentation. The citation pattern is appropriate and points to the relevant earlier work. This paper is for readers who want a single, accessible reference for the inference perspective on RL, particularly those coming from variational methods who need to see how the objectives line up. It would be worth bringing to a reading group for that bridging value. I would send it to peer review rather than desk reject because a well-executed tutorial on this established link can still be useful for the community.

Referee Report

0 major / 2 minor

Summary. The manuscript claims that maximum-entropy reinforcement learning is equivalent to exact probabilistic inference under deterministic dynamics and to variational inference under stochastic dynamics. It supplies a detailed derivation of the equivalence, surveys prior algorithms that exploit the connection, and outlines perspectives for future research.

Significance. The equivalence supplies a principled route for importing approximate-inference machinery into reinforcement learning and control, thereby supporting more flexible handling of uncertainty, compositionality, and partial observability. Because the derivation is standard and the review synthesizes an already influential line of work, the tutorial consolidates a useful conceptual bridge that has demonstrably aided algorithm design.

minor comments (2)

[Abstract] The abstract states the central equivalence but does not explicitly label the manuscript as a tutorial and review; adding this phrase would help readers set expectations for the scope and depth of the material.
[Section 3] Notation for the trajectory distribution p(τ) and the reward-augmented potential is introduced early but is not cross-referenced in the later algorithmic survey; a brief reminder table or consistent equation numbering would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee's summary correctly identifies the core contribution: a detailed derivation showing that maximum-entropy reinforcement learning corresponds to exact probabilistic inference for deterministic dynamics and to variational inference for stochastic dynamics, together with a survey of prior algorithms and future perspectives.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper derives the equivalence between maximum-entropy RL and exact/variational inference by defining the trajectory distribution p(τ) ∝ exp(∑ r_t) directly from the reward function and showing that the RL objective is the log-partition function of this distribution. This construction follows immediately from the given probabilistic model and the max-ent objective without any fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that themselves require the target result. Prior literature is reviewed for context, but the central derivation chain remains independent and self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a review paper, the work relies on standard axioms from probabilistic inference and reinforcement learning without introducing new free parameters or invented entities.

axioms (1)

standard math Standard axioms of probabilistic inference and variational methods
The equivalence derivations invoke core principles of exact and variational inference as background.

pith-pipeline@v0.9.0 · 5469 in / 1029 out tokens · 38326 ms · 2026-05-13T18:23:50.151277+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.LawOfExistence defect_zero_iff_one echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

the graphical model for control as inference... optimality variables... p(Ot=1|st,at)=exp(r(st,at))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Generic Correlation to Input-Specific Credit in On-Policy Self Distillation
cs.LG 2026-05 conditional novelty 7.0

Self-distillation token rewards measure input-response-feedback pointwise mutual information, and CREDIT extracts the input-specific component with contrastive baselines to improve LLM reasoning performance.
Fast Rates for Offline Contextual Bandits with Forward-KL Regularization under Single-Policy Concentrability
cs.LG 2026-05 unverdicted novelty 7.0

The paper establishes the first tilde O(epsilon^{-1}) upper bounds and matching lower bounds for forward-KL-regularized offline contextual bandits under single-policy concentrability in both tabular and general functi...
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 7.0

An exploration-aware RL framework lets LLM agents adaptively explore only under high uncertainty via variational rewards and action grouping, yielding consistent gains on text and GUI agent benchmarks.
Generative Actor-Critic with Soft Bridge Policies
cs.LG 2026-05 unverdicted novelty 7.0

SoftGAC defines a stochastic bridge from base to action latent that converts the MaxEnt objective into a tractable relative-entropy term reducible to control energy, achieving competitive returns with one-pass sampling.
Reference-Sampled Boltzmann Projection for KL-Regularized RLVR: Target-Matched Weighted SFT, Finite One-Shot Gaps, and Policy Mirror Descent
cs.LG 2026-05 unverdicted novelty 7.0

Reference-sampled weighted SFT with prompt-normalized Boltzmann weights induces the same policy as fixed-reference KL-regularized RLVR, with BOLT as the estimator and a finite one-shot error decomposition separating c...
Advantage-Guided Diffusion for Model-Based Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 7.0

Advantage-guided diffusion (SAG and EAG) steers sampling in diffusion world models to higher-advantage trajectories, enabling policy improvement and better sample efficiency on MuJoCo tasks.
Receding-Horizon Control via Drifting Models
cs.AI 2026-04 unverdicted novelty 7.0

Drifting MPC produces a unique distribution over trajectories that trades off data support against optimality and enables efficient receding-horizon planning under unknown dynamics.
DiffusionNFT: Online Diffusion Reinforcement with Forward Process
cs.LG 2025-09 unverdicted novelty 7.0

DiffusionNFT performs online RL for diffusion models on the forward process via flow matching and positive-negative contrasts, delivering up to 25x efficiency gains and rapid benchmark improvements over prior reverse-...
Mutual Information Optimal Density Control of Linear Systems and Generalized Schr\"{o}dinger Bridges with Reference Refinement
math.OC 2026-05 unverdicted novelty 6.0

Alternating optimization for MI-optimal density control of linear systems coincides with that for generalized Schrödinger bridges.
Drifting Field Policy: A One-Step Generative Policy via Wasserstein Gradient Flow
cs.LG 2026-05 unverdicted novelty 6.0

DFP is a one-step generative policy using Wasserstein gradient flow on a drifting model backbone, with a top-K behavior cloning surrogate, that reaches SOTA on Robomimic and OGBench manipulation tasks.
Experience Sharing in Mutual Reinforcement Learning for Heterogeneous Language Models
cs.LG 2026-05 unverdicted novelty 6.0

Mutual Reinforcement Learning allows heterogeneous LLMs to exchange experience through mechanisms like Peer Rollout Pooling, Cross-Policy GRPO Advantage Sharing, and Success-Gated Transfer, with outcome-level sharing ...
PISTO: Proximal Inference for Stochastic Trajectory Optimization
cs.RO 2026-05 unverdicted novelty 6.0

PISTO augments stochastic trajectory optimization with proximal KL regularization, yielding closed-form mean updates via importance sampling that outperform STOMP, CHOMP, CEM, and MPPI on robot arm and MuJoCo benchmarks.
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
cs.LG 2026-05 unverdicted novelty 6.0

LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
Power Distribution Bridges Sampling, Self-Reward RL, and Self-Distillation
cs.LG 2026-05 unverdicted novelty 6.0

The power distribution is the target of power sampling, the closed-form solution to self-reward KL-regularized RL, and the basis for power self-distillation that matches sampling performance at lower cost.
RL Token: Bootstrapping Online RL with Vision-Language-Action Models
cs.LG 2026-04 unverdicted novelty 6.0

RL Token enables sample-efficient online RL fine-tuning of large VLAs, delivering up to 3x speed gains and higher success rates on real-robot manipulation tasks within minutes to hours.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples efficiently from a temperature-annealed distribution over controller parameters to solve trajectory and policy optimization under differentiable dynamics.
Tempered Sequential Monte Carlo for Trajectory and Policy Optimization with Differentiable Dynamics
cs.LG 2026-04 unverdicted novelty 6.0

Tempered sequential Monte Carlo samples from a Boltzmann-tilted distribution over controllers to optimize trajectories and policies under differentiable dynamics.
DAG-STL: A Hierarchical Framework for Zero-Shot Trajectory Planning under Signal Temporal Logic Specifications
cs.RO 2026-04 unverdicted novelty 6.0

DAG-STL decomposes long-horizon STL planning into decomposition, timed waypoint allocation, and diffusion-based trajectory generation to enable zero-shot planning under unknown dynamics.
Reinforcement Learning, Optimal Control, and Bayesian Filtering in Data Assimilation
math.DS 2026-04 unverdicted novelty 6.0

A variational hierarchy unifies Bayesian filtering, variational data assimilation, KL-regularized control, and Kalman methods by proving that posteriors minimize a likelihood-plus-KL objective with evidence as the glo...
Learning to Explore: Scaling Agentic Reasoning via Exploration-Aware Policy Optimization
cs.AI 2026-05 unverdicted novelty 5.0

An exploration-aware policy optimization method lets LLM agents explore selectively via a variational-inference reward and action grouping, yielding consistent gains on text and GUI agent benchmarks.
On Distinguishing Capability Elicitation from Capability Creation in Post-Training: A Free-Energy Perspective
cs.AI 2026-05 unverdicted novelty 5.0

Post-training reweights a pretrained model's behavior distribution either within its existing accessible support (elicitation) or by expanding that support (creation), with both SFT and RL acting as free-energy minimi...
RAD-2: Scaling Reinforcement Learning in a Generator-Discriminator Framework
cs.CV 2026-04 unverdicted novelty 5.0

RAD-2 uses a diffusion generator and RL discriminator to cut collision rates by 56% in closed-loop autonomous driving planning.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 21 Pith papers

[1]

T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M

Abdolmaleki, A., Springenberg, J. T., Tassa, Y., Munos, R., Heess, N., and Riedmiller, M. (2018). Maximum a posteriori policy optimisation. In International Conference on Learning Representations (ICLR)

work page 2018
[2]

Attias, H. (2003). Planning by probabilistic inference. In Proceedings of the 9th International Workshop on Artificial Intelligence and Statistics

work page 2003
[3]

Bagnell, J. A. and Schneider, J. (2003). Covariant policy search. In International Joint Conference on Artifical Intelligence (IJCAI)

work page 2003
[4]

and An, J

Botvinick, M. and An, J. (2009). Goal-directed decision making in prefrontal cortex: a computational framework. In Advances in Neural Information Processing Systems (NIPS)

work page 2009
[5]

and Toussaint, M

Botvinick, M. and Toussaint, M. (2012). Planning as inference. Trends in Cognitive Sciences , 16(10):485--488

work page 2012
[6]

D., Lee, K

Dragan, A. D., Lee, K. C. T., and Srinivasa, S. S. (2013). Legibility and predictability of robot motion. In International Conference on Human-Robot Interaction (HRI)

work page 2013
[7]

and Todorov, E

Dvijotham, K. and Todorov, E. (2010). Inverse optimal control with linearly-solvable mdps. In International Conference on International Conference on Machine Learning (ICML)

work page 2010
[8]

Finn, C., Christiano, P., Abbeel, P., and Levine, S. (2016a). A connection between generative adversarial networks, inverse reinforcement learning, and energy-based models. CoRR , abs/1611.03852

work page arXiv
[9]

Finn, C., Levine, S., and Abbeel, P. (2016b). Guided cost learning: Deep inverse optimal control via policy optimization. In International Conference on Machine Learning (ICML)

work page
[10]

Friston, K. (2009). The free-energy principle: A rough guide to the brain? Trends in Cognitive Sciences , 13(7):293--301

work page 2009
[11]

Fu, J., Luo, K., and Levine, S. (2018). Learning robust rewards with adversarial inverse reinforcement learning. In International Conference on Learning Representations (ICLR)

work page 2018
[12]

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Neural Information Processing Systems (NIPS)

work page 2014
[13]

Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., and Levine, S. (2018). Meta-reinforcement learning of structured exploration strategies. CoRR , abs/1802.07245

work page arXiv 2018
[14]

Haarnjoa, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., and Levine, S. (2018). Composable deep reinforcement learning for robotic manipulation. In International Conference on Robotics and Automation (ICRA)

work page 2018
[15]

Haarnoja, T., Hartikainen, K., Abbeel, P., and Levine, S. (2018a). Latent space policies for hierarchical reinforcement learning. CoRR , abs/1804.02808

work page arXiv
[16]

Haarnoja, T., Tang, H., Abbeel, P., and Levine, S. (2017). Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning (ICML)

work page 2017
[17]

Haarnoja, T., Zhou, A., Abbeel, P., and Levine, S. (2018b). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In arXiv

work page
[18]

Hachiya, H., Peters, J., and Sugiyama, M. (2009). Efficient sample reuse in em-based policy search. In European Conference on Machine Learning (ECML)

work page 2009
[19]

T., Wang, Z., Heess, N., and Riedmiller, M

Hausman, K., Springenberg, J. T., Wang, Z., Heess, N., and Riedmiller, M. (2018). Learning an embedding space for transferable robot skills. In International Conference on Learning Representations (ICLR)

work page 2018
[20]

Heess, N., Silver, D., and Teh, Y. W. (2013). Actor-critic reinforcement learning with energy-based policies. In European Workshop on Reinforcement Learning (EWRL)

work page 2013
[21]

and Ermon, S

Ho, J. and Ermon, S. (2016). Generative adversarial imitation learning. In Neural Information Processing Systems (NIPS)

work page 2016
[22]

M., and Bagnell, J

Huang, D., Farahmand, A., Kitani, K. M., and Bagnell, J. A. (2015). Approximate MaxEnt inverse optimal control and its application for mental simulation of human interactions. In AAAI Conference on Artificial Intelligence (AAAI)

work page 2015
[23]

and Kitani, K

Huang, D. and Kitani, K. M. (2014). Action-reaction: Forecasting the dynamics of human interaction. In European Conference on Computer Vision (ECCV)

work page 2014
[24]

Javdani, S., Srinivasa, S., and Bagnell, J. A. (2015). Shared autonomy via hindsight optimization. In Robotics: Science and Systems (RSS)

work page 2015
[25]

P., Littman, M

Kaelbling, L. P., Littman, M. L., and Moore, A. P. (1996). Reinforcement learning: A survey. Journal of Artificial Intelligence Research , 4:237--285

work page 1996
[26]

Kalman, R. (1960). A new approach to linear filtering and prediction problems. ASME Transactions journal of basic engineering , 82(1):35--45

work page 1960
[27]

Kappen, H. J. (2011). Optimal control theory and the linear bellman equation. Inference and Learning in Dynamic Models , pages 363--387

work page 2011
[28]

J., G \'o mez, V., and Opper, M

Kappen, H. J., G \'o mez, V., and Opper, M. (2012). Optimal control as a graphical model inference problem. Machine Learning , 87(2):159--182

work page 2012
[29]

and Friedman, N

Koller, D. and Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques . The MIT Press

work page 2009
[30]

Levine, S. (2014). Motor skill learning with local trajectory methods . PhD thesis, Stanford University

work page 2014
[31]

and Abbeel, P

Levine, S. and Abbeel, P. (2014). Learning neural network policies with guided policy search under unknown dynamics. In Neural Information Processing Systems (NIPS)

work page 2014
[32]

Levine, S., Finn, C., Darrell, T., and Abbeel, P. (2016). End-to-end training of deep visuomotor policies. Journal of Machine Learning Research , 17(1)

work page 2016
[33]

and Koltun, V

Levine, S. and Koltun, V. (2012). Continuous inverse optimal control with locally optimal examples. In International Conference on Machine Learning (ICML)

work page 2012
[34]

and Koltun, V

Levine, S. and Koltun, V. (2013a). Guided policy search. In International Conference on International Conference on Machine Learning (ICML)

work page
[35]

and Koltun, V

Levine, S. and Koltun, V. (2013b). Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems (NIPS)

work page
[36]

and Koltun, V

Levine, S. and Koltun, V. (2014). Learning complex neural network policies with trajectory optimization. In International Conference on Machine Learning (ICML)

work page 2014
[37]

Levine, S., Popovi\' c , Z., and Koltun, V. (2011). Nonlinear inverse reinforcement learning with gaussian processes. In Neural Information Processing Systems (NIPS)

work page 2011
[38]

Minka, T. P. (2001). Expectation propagation for approximate bayesian inference. In Uncertainty in Artificial Intelligence (UAI)

work page 2001
[39]

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017a). Bridging the gap between value and policy based reinforcement learning. In arXiv

work page
[40]

Nachum, O., Norouzi, M., Xu, K., and Schuurmans, D. (2017b). Trust-pcl: An off-policy trust region method for continuous control. CoRR , abs/1707.01891

work page arXiv
[41]

Neumann, G. (2011). Variational inference for policy search in changing situations. In International Conference on Machine Learning (ICML)

work page 2011
[42]

O'Donoghue, B., Munos, R., Kavukcuoglu, K., and Mnih, V. (2017). Pgq: Combining policy gradient and q-learning. In International Conference on Learning Representations (ICLR)

work page 2017
[43]

u lling, K., and Alt \

Peters, J., M \"u lling, K., and Alt \"u n, Y. (2010). Relative entropy policy search. In AAAI Conference on Artificial Intelligence (AAAI)

work page 2010
[44]

and Schaal, S

Peters, J. and Schaal, S. (2007). Reinforcement learning by reward-weighted regression for operational space control. In International Conference on Machine Learning (ICML)

work page 2007
[45]

Rawlik, K., Toussaint, M., and Vijayakumar, S. (2013). On stochastic optimal control and reinforcement learning by approximate inference. In Robotics: Science and Systems (RSS)

work page 2013
[46]

and Hinton, G

Sallans, B. and Hinton, G. E. (2004). Reinforcement learning with factored states and actions. Journal of Machine Learning Research , 5

work page 2004
[47]

Schulman, J., Chen, X., and Abbeel, P. (2017). Equivalence between policy gradients and soft q-learning. In arXiv

work page 2017
[48]

I., and Abbeel, P

Schulman, J., Levine, S., Moritz, P., Jordan, M. I., and Abbeel, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML)

work page 2015
[49]

Schulman, J., Moritz, P., Levine, S., Jordan, M., and Abbeel, P. (2016). High-dimensional continuous control using generalized advantage estimation. In International Conference on Learning Representations (ICLR)

work page 2016
[50]

and Botvinick, M

Solway, A. and Botvinick, M. (2012). Goal-directed decision making as probabilistic inference: a computational framework and potential neural correlates. Psychol Rev. , 119(1):120--154

work page 2012
[51]

Sutton, R. S. (1990). Integrated architectures for learning, planning, and reacting based on approximating dynamic programming. In International Conference on Machine Learning (ICML)

work page 1990
[52]

A., Buchli, J., and Schaal, S

Theodorou, E. A., Buchli, J., and Schaal, S. (2010). Learning policy improvements with path integrals. In International Conference on Artificial Intelligence and Statistics (AISTATS 2010)

work page 2010
[53]

Todorov, E. (2006). Linearly-solvable markov decision problems. In Advances in Neural Information Processing Systems (NIPS)

work page 2006
[54]

Todorov, E. (2008). General duality between optimal control and estimation. In Conference on Decision and Control (CDC)

work page 2008
[55]

Todorov, E. (2010). Policy gradients in linearly-solvable mdps. In Neural Information Processing Systems (NIPS)

work page 2010
[56]

Toussaint, M. (2009). Robot trajectory optimization using approximate inference. In International Conference on Machine Learning (ICML)

work page 2009
[57]

and Storkey, A

Toussaint, M. and Storkey, A. (2006). Probabilistic inference for solving discrete and continuous state markov decision processes. In International Conference on Machine Learning (ICML)

work page 2006
[58]

Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning , 8(3-4):229--256

work page 1992
[59]

Williams, R. J. and Peng, J. (1991). Function optimization using connectionist reinforcement learning algorithms. Connection Science , 3(3):241--268

work page 1991
[60]

Wulfmeier, M., Ondruska, P., and Posner, I. (2015). Maximum entropy deep inverse reinforcement learning. In Neural Information Processing Systems Conference, Deep Reinforcement Learning Workshop

work page 2015
[61]

Ziebart, B. (2010). Modeling purposeful adaptive behavior with the principle of maximum causal entropy . PhD thesis, Carnegie Mellon University

work page 2010
[62]

D., Bagnell, J

Ziebart, B. D., Bagnell, J. A., and Dey, A. K. (2010). Modeling interaction via the principle of maximum causal entropy. In International Conference on Machine Learning (ICML)

work page 2010
[63]

D., Maas, A., Bagnell, J

Ziebart, B. D., Maas, A., Bagnell, J. A., and Dey, A. K. (2008). Maximum entropy inverse reinforcement learning. In International Conference on Artificial Intelligence (AAAI)

work page 2008