Actor-Critic with Active Importance Sampling

Alberto Maria Metelli; Gabor Paczolay; Majid Molaei; Marcello Restelli; Matteo Papini

arxiv: 2605.07094 · v1 · submitted 2026-05-08 · 💻 cs.LG

Actor-Critic with Active Importance Sampling

Majid Molaei , Gabor Paczolay , Matteo Papini , Alberto Maria Metelli , Marcello Restelli This is my paper

Pith reviewed 2026-05-11 01:08 UTC · model grok-4.3

classification 💻 cs.LG

keywords actor-criticimportance samplingvariance reductionpolicy gradientsreinforcement learningbehavior policy optimizationcontinuous action spaces

0 comments

The pith

Actor-critic can cut gradient variance by optimizing a separate behavior policy for sampling while keeping estimates unbiased.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes AISAC to improve actor-critic by actively optimizing the behavior policy that collects data. This optimization uses importance sampling to reduce the variance of the policy gradient estimates for the target policy. For continuous actions, a Gaussian behavior policy is adjusted by cross-entropy minimization. Experiments on inverted pendulum and half cheetah tasks confirm faster learning and better stability than standard actor-critic.

Core claim

AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods.

What carries the argument

The active importance sampling mechanism that optimizes a behavior policy to minimize variance in policy gradient estimates while keeping them unbiased.

Load-bearing premise

Cross-entropy minimization applied to a Gaussian behavior policy will produce a distribution that yields lower variance unbiased gradients in the tested continuous control tasks.

What would settle it

Running the algorithm on the inverted pendulum task and measuring that the gradient variance is higher with the optimized behavior policy than without optimization, or that the final policy performance is worse than standard actor-critic.

Figures

Figures reproduced from arXiv: 2605.07094 by Alberto Maria Metelli, Gabor Paczolay, Majid Molaei, Marcello Restelli, Matteo Papini.

read the original abstract

This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AISAC adds active cross-entropy optimization of a Gaussian behavior policy to cut policy-gradient variance in actor-critic while keeping estimates unbiased, but the experiments are thin on details and the heuristic may not deliver the promised reduction in practice.

read the letter

The main takeaway is that this paper makes the behavior policy active in actor-critic by optimizing its parameters to minimize gradient variance, using cross-entropy minimization on Gaussian policies for continuous actions. That is the concrete extension beyond standard importance sampling setups. They claim theoretical support for unbiasedness and variance reduction, and they report faster learning and better stability on Inverted Pendulum and Half Cheetah compared with plain actor-critic baselines. The idea of aligning the sampling distribution more tightly with the target gradient is a reasonable direction for sample efficiency in continuous control. The results also suggest the change helps critic estimates across hyperparameter choices, which is a practical plus. On the downside, the evidence is light. No error bars, no ablation breakdowns, and no visible derivation steps are referenced in the available material, so it is hard to judge how tight the variance bounds actually are once the cross-entropy step is applied. The stress-test point lands: cross-entropy minimization is a surrogate heuristic for the Gaussian parameters, and in higher-dimensional action spaces like Half Cheetah the approximation gap could erase most of the theoretical variance benefit while unbiasedness still holds. That leaves open whether the observed speed-up comes from the intended mechanism or from incidental effects. This work is aimed at people already working on policy gradients and importance sampling in continuous reinforcement learning. A reader in that niche would find the specific algorithmic tweak and the two-task comparison useful to see. It is not a broad advance, but the claims are concrete enough and the experiments are on standard benchmarks, so it deserves a serious referee rather than a desk reject. I would send it out for review with the expectation that the authors will need to add error bars, ablations, and clearer proof steps.

Referee Report

2 major / 2 minor

Summary. The paper introduces Active Importance Sampling Actor-Critic (AISAC), an extension of standard actor-critic methods that optimizes a behavior policy (via cross-entropy minimization for Gaussian policies in continuous action spaces) to reduce variance in policy-gradient estimates while preserving unbiasedness. It supplies a theoretical analysis of unbiasedness and variance reduction, then reports empirical gains in learning speed, sample efficiency, and stability on Inverted Pendulum and Half Cheetah relative to baseline actor-critic.

Significance. If the claimed variance reduction is realized in practice, the method could improve sample efficiency for policy-gradient algorithms without sacrificing correctness. The explicit use of importance-sampling identities and an external optimization objective for the behavior policy is a clear, falsifiable idea that, if validated, would be a modest but useful addition to the actor-critic literature.

major comments (2)

[§3] §3 (Theoretical Analysis): the variance-reduction claim is derived for the exact minimizer of the gradient-variance objective; the manuscript does not show that the cross-entropy-minimization procedure used in the algorithm converges to a distribution whose importance weights achieve the same reduction, leaving a gap between the proved statement and the implemented method.
[Experiments] Experiments section (Inverted Pendulum and Half Cheetah results): no error bars, standard deviations, or statistical tests are reported, and no ablation isolating the effect of the active behavior-policy update versus other algorithmic changes is provided; this makes it impossible to attribute the observed speed-up specifically to variance reduction.

minor comments (2)

[Abstract] The abstract states that results hold 'across different hyperparameter settings' but does not specify which settings were varied or how many runs were performed.
[§2] Notation for the behavior-policy parameters and the cross-entropy objective should be introduced earlier and used consistently with the importance-sampling weight definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments identify a meaningful gap in the theoretical analysis and highlight the need for greater experimental rigor. We address each point below and describe the revisions we will incorporate.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis): the variance-reduction claim is derived for the exact minimizer of the gradient-variance objective; the manuscript does not show that the cross-entropy-minimization procedure used in the algorithm converges to a distribution whose importance weights achieve the same reduction, leaving a gap between the proved statement and the implemented method.

Authors: We agree that the theoretical analysis establishes variance reduction only for the exact minimizer of the gradient-variance objective, while the algorithm employs cross-entropy minimization as a practical surrogate for Gaussian policies. This constitutes a genuine gap between the proved result and the implemented procedure. In the revision we will add a dedicated subsection clarifying this distinction and deriving the conditions under which cross-entropy minimization for Gaussians approximates the variance-minimizing behavior policy. We will also include new empirical plots that directly measure the reduction in gradient variance achieved by the deployed cross-entropy update, thereby providing concrete evidence that the implemented method realizes the intended benefit even if it does not reach the exact theoretical optimum. revision: partial
Referee: [Experiments] Experiments section (Inverted Pendulum and Half Cheetah results): no error bars, standard deviations, or statistical tests are reported, and no ablation isolating the effect of the active behavior-policy update versus other algorithmic changes is provided; this makes it impossible to attribute the observed speed-up specifically to variance reduction.

Authors: We accept that the current experimental presentation lacks error bars, standard deviations, statistical tests, and an ablation isolating the active behavior-policy update. These omissions limit the strength of the claims. We will rerun all experiments across at least ten independent random seeds, report mean curves with shaded standard-deviation bands, and add paired t-tests or Wilcoxon tests to assess statistical significance against the baselines. In addition, we will introduce an ablation that disables the cross-entropy behavior-policy optimization while keeping all other components fixed, allowing direct attribution of performance gains to the variance-reduction mechanism. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's derivation relies on standard importance-sampling identities to establish unbiasedness of the policy gradient estimator and a separate cross-entropy minimization procedure to optimize the Gaussian behavior policy parameters. These steps use external optimization objectives and classical IS variance bounds rather than defining any quantity in terms of itself or renaming a fitted parameter as a prediction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or ansatzes smuggled via citation are present in the provided abstract and reader summary. The central claim of variance reduction for the (approximate) optimal behavior policy is therefore independent of the target policy's own fitted values and remains self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on standard importance-sampling unbiasedness identities, the ability to represent continuous policies as Gaussians, and the assumption that cross-entropy minimization yields a useful behavior distribution; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

Gaussian behavior-policy parameters
Optimized on-line via cross-entropy minimization; treated as learned rather than hand-tuned constants.

axioms (1)

domain assumption Markov decision process and policy-gradient existence assumptions
Standard background for all actor-critic methods invoked implicitly throughout.

invented entities (1)

Active behavior policy no independent evidence
purpose: Data-collection distribution optimized to minimize target gradient variance
Core algorithmic component introduced by the paper; no independent falsifiable evidence supplied beyond the algorithm itself.

pith-pipeline@v0.9.0 · 5473 in / 1330 out tokens · 29108 ms · 2026-05-11T01:08:15.732835+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018
[2]

2013 , publisher=

Monte Carlo theory, methods and examples , author=. 2013 , publisher=

work page 2013
[3]

2016 , publisher=

Simulation and the Monte Carlo method , author=. 2016 , publisher=

work page 2016
[4]

Kakade and Jason D

Alekh Agarwal and Sham M. Kakade and Jason D. Lee and Gaurav Mahajan , title =. J. Mach. Learn. Res. , volume =

work page
[5]

Rusu and Joel Veness and Marc G

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Andrei A. Rusu and Joel Veness and Marc G. Bellemare and Alex Graves and Martin A. Riedmiller and Andreas Fidjeland and Georg Ostrovski and Stig Petersen and Charles Beattie and Amir Sadik and Ioannis Antonoglou and Helen King and Dharshan Kumaran and Daan Wierstra and Shane Legg and Demis Hassabis...

work page
[6]

Tuomas Haarnoja and Aurick Zhou and Pieter Abbeel and Sergey Levine , title =

work page
[7]

Lillicrap and Jonathan J

Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Heess and Tom Erez and Yuval Tassa and David Silver and Daan Wierstra , title =

work page
[8]

Scott Fujimoto and Herke van Hoof and David Meger , title =

work page
[9]

Twin-delayed deep deterministic policy gradient algorithm for the energy management of microgrids , journal =

David Dom. Twin-delayed deep deterministic policy gradient algorithm for the energy management of microgrids , journal =

work page
[10]

Actor-Critic Deep Reinforcement Learning for Solving Job Shop Scheduling Problems , journal =

Chien. Actor-Critic Deep Reinforcement Learning for Solving Job Shop Scheduling Problems , journal =

work page
[11]

Soft Actor-Critic for Navigation of Mobile Robots , journal =

Junior Costa de Jesus and Victor Augusto Kich and Alisson Henrique Kolling and Ricardo Bedin Grando and Marco Ant. Soft Actor-Critic for Navigation of Mobile Robots , journal =

work page
[12]

Reinforcement Learning Journal , volume=

Policy Gradient with Active Importance Sampling , author=. Reinforcement Learning Journal , volume=

work page
[13]

Hanna and Philip S

Josiah P. Hanna and Philip S. Thomas and Peter Stone and Scott Niekum , title =

work page
[14]

Alberto Maria Metelli and Samuele Meta and Marcello Restelli , title =

work page
[15]

Journal of Machine Learning Research , year =

Kamil Ciosek and Shimon Whiteson , title =. Journal of Machine Learning Research , year =

work page
[16]

Sutton and David A

Richard S. Sutton and David A. McAllester and Satinder Singh and Yishay Mansour , title =

work page
[17]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis , title =

work page

[1] [1]

2018 , publisher=

Reinforcement learning: An introduction , author=. 2018 , publisher=

work page 2018

[2] [2]

2013 , publisher=

Monte Carlo theory, methods and examples , author=. 2013 , publisher=

work page 2013

[3] [3]

2016 , publisher=

Simulation and the Monte Carlo method , author=. 2016 , publisher=

work page 2016

[4] [4]

Kakade and Jason D

Alekh Agarwal and Sham M. Kakade and Jason D. Lee and Gaurav Mahajan , title =. J. Mach. Learn. Res. , volume =

work page

[5] [5]

Rusu and Joel Veness and Marc G

Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Andrei A. Rusu and Joel Veness and Marc G. Bellemare and Alex Graves and Martin A. Riedmiller and Andreas Fidjeland and Georg Ostrovski and Stig Petersen and Charles Beattie and Amir Sadik and Ioannis Antonoglou and Helen King and Dharshan Kumaran and Daan Wierstra and Shane Legg and Demis Hassabis...

work page

[6] [6]

Tuomas Haarnoja and Aurick Zhou and Pieter Abbeel and Sergey Levine , title =

work page

[7] [7]

Lillicrap and Jonathan J

Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Heess and Tom Erez and Yuval Tassa and David Silver and Daan Wierstra , title =

work page

[8] [8]

Scott Fujimoto and Herke van Hoof and David Meger , title =

work page

[9] [9]

Twin-delayed deep deterministic policy gradient algorithm for the energy management of microgrids , journal =

David Dom. Twin-delayed deep deterministic policy gradient algorithm for the energy management of microgrids , journal =

work page

[10] [10]

Actor-Critic Deep Reinforcement Learning for Solving Job Shop Scheduling Problems , journal =

Chien. Actor-Critic Deep Reinforcement Learning for Solving Job Shop Scheduling Problems , journal =

work page

[11] [11]

Soft Actor-Critic for Navigation of Mobile Robots , journal =

Junior Costa de Jesus and Victor Augusto Kich and Alisson Henrique Kolling and Ricardo Bedin Grando and Marco Ant. Soft Actor-Critic for Navigation of Mobile Robots , journal =

work page

[12] [12]

Reinforcement Learning Journal , volume=

Policy Gradient with Active Importance Sampling , author=. Reinforcement Learning Journal , volume=

work page

[13] [13]

Hanna and Philip S

Josiah P. Hanna and Philip S. Thomas and Peter Stone and Scott Niekum , title =

work page

[14] [14]

Alberto Maria Metelli and Samuele Meta and Marcello Restelli , title =

work page

[15] [15]

Journal of Machine Learning Research , year =

Kamil Ciosek and Shimon Whiteson , title =. Journal of Machine Learning Research , year =

work page

[16] [16]

Sutton and David A

Richard S. Sutton and David A. McAllester and Satinder Singh and Yishay Mansour , title =

work page

[17] [17]

Konda and John N

Vijay R. Konda and John N. Tsitsiklis , title =

work page