Deep Active Inference as Variational Policy Gradients

Beren Millidge

arxiv: 1907.03876 · v1 · pith:HEM62SNUnew · submitted 2019-07-08 · 💻 cs.LG · cs.NE

Deep Active Inference as Variational Policy Gradients

Beren Millidge This is my paper

Pith reviewed 2026-05-25 00:55 UTC · model grok-4.3

classification 💻 cs.LG cs.NE

keywords active inferencedeep active inferencevariational free energypolicy gradientsreinforcement learningfunction approximationvariational inference

0 comments

The pith

Active inference scales to complex tasks by using deep neural networks to approximate the densities in its variational free energy objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a deep active inference algorithm that replaces hand-specified densities with neural network approximators. This change removes the previous restrictions to small discrete spaces and known dynamics. The resulting method is evaluated on standard OpenAI Gym control tasks and reaches performance levels comparable to established reinforcement learning algorithms. The derivation also shows that the active inference objective is closely related to maximum-entropy reinforcement learning and policy-gradient methods.

Core claim

Parameterizing the state and policy densities with deep neural networks allows the variational free energy to be minimized end-to-end for large continuous state-action spaces without an explicit model of environmental dynamics, producing an algorithm that matches reinforcement learning baselines while remaining formally equivalent to variational policy gradients.

What carries the argument

Neural-network approximation of the variational densities inside the free-energy objective, optimized directly from experience.

If this is right

Active inference can now be applied to significantly larger and more complex tasks than earlier discrete implementations allowed.
Performance on OpenAI Gym benchmarks becomes comparable to common reinforcement learning baselines.
Active inference is formally connected to maximum-entropy reinforcement learning and the policy-gradient algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The equivalence suggests active inference supplies a Bayesian interpretation of policy-gradient updates.
The same architecture could be tested on partially observable environments where active inference's inference step may confer an advantage.
Extensions that retain the free-energy formulation while adding explicit planning horizons remain open for investigation.

Load-bearing premise

The variational free energy objective can be optimized end-to-end with neural-network function approximators without requiring explicit knowledge of environmental dynamics.

What would settle it

If the resulting agent fails to reach performance comparable to standard policy-gradient or Q-learning baselines on continuous-control Gym tasks when dynamics are unknown, the central claim does not hold.

read the original abstract

Active Inference is a theory of action arising from neuroscience which casts action and planning as a bayesian inference problem to be solved by minimizing a single quantity - the variational free energy. Active Inference promises a unifying account of action and perception coupled with a biologically plausible process theory. Despite these potential advantages, current implementations of Active Inference can only handle small, discrete policy and state-spaces and typically require the environmental dynamics to be known. In this paper we propose a novel deep Active Inference algorithm which approximates key densities using deep neural networks as flexible function approximators, which enables Active Inference to scale to significantly larger and more complex tasks. We demonstrate our approach on a suite of OpenAIGym benchmark tasks and obtain performance comparable with common reinforcement learning baselines. Moreover, our algorithm shows similarities with maximum entropy reinforcement learning and the policy gradients algorithm, which reveals interesting connections between the Active Inference framework and reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a workable deep-net version of active inference that runs on Gym tasks and notes a link to policy gradients.

read the letter

This paper takes active inference and scales it with deep networks so it can handle standard RL benchmarks without needing known dynamics upfront. They approximate the key densities with neural nets, run on OpenAI Gym tasks, and match common RL baselines while pointing out overlaps with maximum-entropy RL and policy gradients. The core advance is replacing the small discrete-space restriction with flexible function approximators, which directly tackles the main practical barrier mentioned in prior work. The Gym experiments are a useful concrete step that shows the approach is runnable end-to-end. The observed similarity to policy gradients is presented as an interesting side note rather than a formal proof, which keeps expectations reasonable. A soft spot is that the abstract supplies no equations or training details, so it is difficult to confirm whether the variational free energy is preserved under the approximations or whether the performance numbers reflect a fair comparison. If the full paper supplies clean derivations and ablations, that concern shrinks. This work is aimed at people trying to connect active inference to mainstream RL or test neuroscience-derived ideas on the same suites. It is coherent enough on its own terms to deserve a serious referee, even if revisions will likely be needed on the exact reduction and experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes a deep active inference algorithm that approximates the key densities in the variational free energy objective using deep neural networks as function approximators. This removes the requirement for known environmental dynamics that limited prior active inference implementations, enabling application to continuous control tasks in OpenAI Gym. The authors report performance comparable to standard RL baselines and observe similarities between their objective and maximum-entropy RL / policy gradients.

Significance. If the claimed equivalence to variational policy gradients is rigorously derived and the Gym results hold under standard controls, the work would usefully connect the active-inference framework to scalable RL methods and demonstrate that biologically motivated free-energy minimization can be made practical with modern function approximators. The empirical demonstration on benchmark tasks is a concrete step toward that unification.

major comments (2)

[§4] §4, the derivation of the policy-gradient equivalence: the manuscript presents the similarity to maximum-entropy RL as an observation after the fact rather than showing an explicit reduction of the active-inference objective to the policy-gradient estimator; without the intermediate steps (e.g., the precise form of the variational posterior and the reparameterization used), it is impossible to verify whether the claimed equivalence is exact or only approximate under additional assumptions.
[§5.2] §5.2, experimental protocol: the paper reports performance “comparable with common reinforcement learning baselines” on Gym tasks, yet provides neither the exact baselines (e.g., DDPG, SAC, PPO), nor the number of random seeds, nor statistical significance tests; without these details the central empirical claim cannot be assessed.

minor comments (2)

Notation for the variational densities (q, p, etc.) is introduced without a consolidated table; a single reference table would improve readability.
[Abstract] The abstract states that the method “approximates key densities,” but the precise set of densities (state, policy, transition, etc.) is only clarified later; an early explicit list would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of the policy-gradient connection and strengthen the empirical evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [§4] §4, the derivation of the policy-gradient equivalence: the manuscript presents the similarity to maximum-entropy RL as an observation after the fact rather than showing an explicit reduction of the active-inference objective to the policy-gradient estimator; without the intermediate steps (e.g., the precise form of the variational posterior and the reparameterization used), it is impossible to verify whether the claimed equivalence is exact or only approximate under additional assumptions.

Authors: We agree that an explicit derivation would make the connection more rigorous. The current manuscript notes the similarities after deriving the active-inference objective, but does not walk through the reduction step by step. In the revised version we will expand §4 to include the intermediate steps: we will state the precise form of the variational posterior over policies, show how the free-energy objective reduces to an expectation under that posterior, and detail the reparameterization that yields the policy-gradient estimator, thereby clarifying whether the equivalence is exact or holds under the stated assumptions. revision: yes
Referee: [§5.2] §5.2, experimental protocol: the paper reports performance “comparable with common reinforcement learning baselines” on Gym tasks, yet provides neither the exact baselines (e.g., DDPG, SAC, PPO), nor the number of random seeds, nor statistical significance tests; without these details the central empirical claim cannot be assessed.

Authors: We acknowledge that the experimental section is missing these protocol details, which are required to evaluate the empirical claims. In the revised manuscript we will update §5.2 (and the associated tables/figures) to list the exact baselines that were implemented, report the number of random seeds for each task, and include statistical significance tests (e.g., Welch’s t-test or Wilcoxon rank-sum) comparing our method against the baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes approximating key densities in active inference via deep neural networks to enable scaling without known dynamics, then demonstrates performance on Gym tasks comparable to RL baselines. The abstract and description present connections to maximum-entropy RL and policy gradients as observational similarities rather than definitional equivalences or fitted inputs renamed as predictions. No equations, self-citations, or derivation steps are provided that reduce the central claim to its own inputs by construction. The method is self-contained against external benchmarks with independent empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the unstated premise that neural networks can faithfully approximate the required densities and that the free-energy objective remains well-behaved under this approximation.

pith-pipeline@v0.9.0 · 5670 in / 1070 out tokens · 17799 ms · 2026-05-25T00:55:31.509672+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle
q-bio.NC 2026-05 unverdicted novelty 6.0

Information defined as maximum-caliber deviation derives IIT 3.0 cause-effect repertoires from constrained entropy maximization and equates to prediction error under CLT and LDT.