pith. sign in

arxiv: 1907.03876 · v1 · pith:HEM62SNUnew · submitted 2019-07-08 · 💻 cs.LG · cs.NE

Deep Active Inference as Variational Policy Gradients

Pith reviewed 2026-05-25 00:55 UTC · model grok-4.3

classification 💻 cs.LG cs.NE
keywords active inferencedeep active inferencevariational free energypolicy gradientsreinforcement learningfunction approximationvariational inference
0
0 comments X

The pith

Active inference scales to complex tasks by using deep neural networks to approximate the densities in its variational free energy objective.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a deep active inference algorithm that replaces hand-specified densities with neural network approximators. This change removes the previous restrictions to small discrete spaces and known dynamics. The resulting method is evaluated on standard OpenAI Gym control tasks and reaches performance levels comparable to established reinforcement learning algorithms. The derivation also shows that the active inference objective is closely related to maximum-entropy reinforcement learning and policy-gradient methods.

Core claim

Parameterizing the state and policy densities with deep neural networks allows the variational free energy to be minimized end-to-end for large continuous state-action spaces without an explicit model of environmental dynamics, producing an algorithm that matches reinforcement learning baselines while remaining formally equivalent to variational policy gradients.

What carries the argument

Neural-network approximation of the variational densities inside the free-energy objective, optimized directly from experience.

If this is right

  • Active inference can now be applied to significantly larger and more complex tasks than earlier discrete implementations allowed.
  • Performance on OpenAI Gym benchmarks becomes comparable to common reinforcement learning baselines.
  • Active inference is formally connected to maximum-entropy reinforcement learning and the policy-gradient algorithm.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The equivalence suggests active inference supplies a Bayesian interpretation of policy-gradient updates.
  • The same architecture could be tested on partially observable environments where active inference's inference step may confer an advantage.
  • Extensions that retain the free-energy formulation while adding explicit planning horizons remain open for investigation.

Load-bearing premise

The variational free energy objective can be optimized end-to-end with neural-network function approximators without requiring explicit knowledge of environmental dynamics.

What would settle it

If the resulting agent fails to reach performance comparable to standard policy-gradient or Q-learning baselines on continuous-control Gym tasks when dynamics are unknown, the central claim does not hold.

read the original abstract

Active Inference is a theory of action arising from neuroscience which casts action and planning as a bayesian inference problem to be solved by minimizing a single quantity - the variational free energy. Active Inference promises a unifying account of action and perception coupled with a biologically plausible process theory. Despite these potential advantages, current implementations of Active Inference can only handle small, discrete policy and state-spaces and typically require the environmental dynamics to be known. In this paper we propose a novel deep Active Inference algorithm which approximates key densities using deep neural networks as flexible function approximators, which enables Active Inference to scale to significantly larger and more complex tasks. We demonstrate our approach on a suite of OpenAIGym benchmark tasks and obtain performance comparable with common reinforcement learning baselines. Moreover, our algorithm shows similarities with maximum entropy reinforcement learning and the policy gradients algorithm, which reveals interesting connections between the Active Inference framework and reinforcement learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a deep active inference algorithm that approximates the key densities in the variational free energy objective using deep neural networks as function approximators. This removes the requirement for known environmental dynamics that limited prior active inference implementations, enabling application to continuous control tasks in OpenAI Gym. The authors report performance comparable to standard RL baselines and observe similarities between their objective and maximum-entropy RL / policy gradients.

Significance. If the claimed equivalence to variational policy gradients is rigorously derived and the Gym results hold under standard controls, the work would usefully connect the active-inference framework to scalable RL methods and demonstrate that biologically motivated free-energy minimization can be made practical with modern function approximators. The empirical demonstration on benchmark tasks is a concrete step toward that unification.

major comments (2)
  1. [§4] §4, the derivation of the policy-gradient equivalence: the manuscript presents the similarity to maximum-entropy RL as an observation after the fact rather than showing an explicit reduction of the active-inference objective to the policy-gradient estimator; without the intermediate steps (e.g., the precise form of the variational posterior and the reparameterization used), it is impossible to verify whether the claimed equivalence is exact or only approximate under additional assumptions.
  2. [§5.2] §5.2, experimental protocol: the paper reports performance “comparable with common reinforcement learning baselines” on Gym tasks, yet provides neither the exact baselines (e.g., DDPG, SAC, PPO), nor the number of random seeds, nor statistical significance tests; without these details the central empirical claim cannot be assessed.
minor comments (2)
  1. Notation for the variational densities (q, p, etc.) is introduced without a consolidated table; a single reference table would improve readability.
  2. [Abstract] The abstract states that the method “approximates key densities,” but the precise set of densities (state, policy, transition, etc.) is only clarified later; an early explicit list would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the presentation of the policy-gradient connection and strengthen the empirical evaluation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [§4] §4, the derivation of the policy-gradient equivalence: the manuscript presents the similarity to maximum-entropy RL as an observation after the fact rather than showing an explicit reduction of the active-inference objective to the policy-gradient estimator; without the intermediate steps (e.g., the precise form of the variational posterior and the reparameterization used), it is impossible to verify whether the claimed equivalence is exact or only approximate under additional assumptions.

    Authors: We agree that an explicit derivation would make the connection more rigorous. The current manuscript notes the similarities after deriving the active-inference objective, but does not walk through the reduction step by step. In the revised version we will expand §4 to include the intermediate steps: we will state the precise form of the variational posterior over policies, show how the free-energy objective reduces to an expectation under that posterior, and detail the reparameterization that yields the policy-gradient estimator, thereby clarifying whether the equivalence is exact or holds under the stated assumptions. revision: yes

  2. Referee: [§5.2] §5.2, experimental protocol: the paper reports performance “comparable with common reinforcement learning baselines” on Gym tasks, yet provides neither the exact baselines (e.g., DDPG, SAC, PPO), nor the number of random seeds, nor statistical significance tests; without these details the central empirical claim cannot be assessed.

    Authors: We acknowledge that the experimental section is missing these protocol details, which are required to evaluate the empirical claims. In the revised manuscript we will update §5.2 (and the associated tables/figures) to list the exact baselines that were implemented, report the number of random seeds for each task, and include statistical significance tests (e.g., Welch’s t-test or Wilcoxon rank-sum) comparing our method against the baselines. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes approximating key densities in active inference via deep neural networks to enable scaling without known dynamics, then demonstrates performance on Gym tasks comparable to RL baselines. The abstract and description present connections to maximum-entropy RL and policy gradients as observational similarities rather than definitional equivalences or fitted inputs renamed as predictions. No equations, self-citations, or derivation steps are provided that reduce the central claim to its own inputs by construction. The method is self-contained against external benchmarks with independent empirical validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The central claim implicitly rests on the unstated premise that neural networks can faithfully approximate the required densities and that the free-energy objective remains well-behaved under this approximation.

pith-pipeline@v0.9.0 · 5670 in / 1070 out tokens · 17799 ms · 2026-05-25T00:55:31.509672+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Information as Maximum-Caliber Deviation: A bridge between Integrated Information Theory and the Free Energy Principle

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Information defined as maximum-caliber deviation derives IIT 3.0 cause-effect repertoires from constrained entropy maximization and equates to prediction error under CLT and LDT.