Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies

Ariella Smofsky; Avishek Joey Bose; Patrick Nadeem Ward

Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies

Not yet reviewed by Pith; the record is open.

Re-run · record.json Download PDF Read on arXiv ↗

This paper has not been read by Pith yet. Machine review is queued; the pith claim, tier, and objections will appear here once it completes.

SPECIMEN: schema-true, not a live event

T0 review · schema-true

One-sentence machine reading of the paper's core claim.

pith:XXXXXXXX · record.json · timestamp

arxiv 1906.02771 v1 pith:Q6FGGO6V submitted 2019-06-06 cs.LG cs.AIstat.ML

Improving Exploration in Soft-Actor-Critic with Normalizing Flows Policies

Patrick Nadeem Ward , Ariella Smofsky , Avishek Joey Bose This is my paper

classification cs.LG cs.AIstat.ML

keywords policiesactorcontinuouscriticdeepexplorationfactoredframework

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

0 comments

read the original abstract

Deep Reinforcement Learning (DRL) algorithms for continuous action spaces are known to be brittle toward hyperparameters as well as \cut{being}sample inefficient. Soft Actor Critic (SAC) proposes an off-policy deep actor critic algorithm within the maximum entropy RL framework which offers greater stability and empirical gains. The choice of policy distribution, a factored Gaussian, is motivated by \cut{chosen due}its easy re-parametrization rather than its modeling power. We introduce Normalizing Flow policies within the SAC framework that learn more expressive classes of policies than simple factored Gaussians. \cut{We also present a series of stabilization tricks that enable effective training of these policies in the RL setting.}We show empirically on continuous grid world tasks that our approach increases stability and is better suited to difficult exploration in sparse reward settings.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NFTR: From Provable Mode-Averaging to Geodesic Subgoal Selection in Offline Goal-Conditioned RL
cs.LG 2026-07 conditional novelty 6.5

Normalizing-flow subgoal policies plus triangle-slack reweighting provably avoid Gaussian mode-averaging and filter lucky transitions in offline hierarchical GCRL.