Variance Reduction in Actor Critic Methods (ACM)

Eric Benhamou

arxiv: 1907.09765 · v1 · pith:6ZBM56AInew · submitted 2019-07-23 · 💻 cs.LG · stat.ML

Variance Reduction in Actor Critic Methods (ACM)

Eric Benhamou This is my paper

Pith reviewed 2026-05-24 17:38 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords actor-critic methodsvariance reductioncontrol variatesprojection theorempolicy gradientsA2Creinforcement learning

0 comments

The pith

Q-actor-critic and A2C are L2-optimal among control variates conditioned on state and action

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Actor-critic methods act as control variate estimators that subtract a baseline from policy gradient samples to cut variance. The paper applies the projection theorem to prove that both Q-actor-critic and advantage actor-critic methods achieve the smallest L2 error possible inside the subspace of all such estimators that depend only on the current state and action. This direct use of the Pythagorean identity in function space accounts for the reliable empirical results of these algorithms. The same optimality argument produces a revised A2C update rule with strictly lower variance than the standard version.

Core claim

Actor Critic Methods are control variate estimators. Using the projection theorem, the Q and Advantage Actor Critic methods are optimal in the L2 norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of the Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC methods and enables derivation of a new formulation for Advantage Actor Critic methods that has lower variance.

What carries the argument

Control variate estimators lying in the subspace spanned by functions of the current state and action, shown optimal by the projection theorem

If this is right

QAC and A2C attain the minimal L2 error among all estimators conditioned only on state and action.
The derived A2C formulation has lower variance than the conventional version.
The result supplies a theoretical reason for the observed reliability of these two methods in policy gradient training.
The same projection argument applies to any control variate inside the given subspace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If major variance sources lie outside functions of the current state and action, estimators that condition on additional history could beat the paper's optimum in practice.
The projection technique could be reused to rank other policy-gradient baselines by their L2 distance to the true value function.
Direct comparison of the new A2C rule against standard A2C on benchmark tasks would test whether the variance reduction improves learning speed.

Load-bearing premise

The dominant sources of variance in actor-critic gradient estimates are captured by functions that depend only on the current state and action.

What would settle it

An empirical demonstration that a control variate using information beyond the current state and action produces lower variance than the optimal estimator inside the state-action subspace would show the claimed optimality does not cover practical cases.

read the original abstract

After presenting Actor Critic Methods (ACM), we show ACM are control variate estimators. Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the $L^2$ norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC most often referred to as A2C methods in deep policy gradient methods. This enables us to derive a new formulation for Advantage Actor Critic methods that has lower variance and improves the traditional A2C method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses the projection theorem to call A2C L2-optimal among (s,a)-conditioned control variates and derives a new lower-variance version, but the unbiasedness of the resulting gradient estimator is left unaddressed.

read the letter

The paper's main move is to treat actor-critic methods as control-variate estimators and then apply the projection theorem to argue that QAC and A2C are optimal in L2 norm inside the subspace of functions of the current state and action. It also produces a new A2C-style estimator that is claimed to have lower variance than the usual one. That unification is the clearest new piece; the rest is mostly a re-framing of existing methods through functional analysis and Pythagoras. The argument is clean on its own terms if the inner product really corresponds to the variance that matters for the policy gradient. The new formulation is presented as a direct consequence, which is the part that could be useful if it holds up. The soft spot is the one the stress-test note flags. Projecting onto the (s,a) subspace gives the smallest L2 error inside that subspace, but the policy-gradient estimator stays unbiased only when the control variate b satisfies E[∇_θ log π ⋅ b(s,a)] = 0. Nothing in the abstract shows that the projected b automatically meets this condition, and the paper does not appear to add an extra orthogonality constraint. If the full derivation skips this step, the optimality claim applies to a different quantity than the actual variance of the gradient estimator. The citation pattern is light and the work is mostly self-contained, which is fine for a short theoretical note. This is aimed at people who already work on variance reduction in policy gradients and want a functional-analysis angle. It is coherent enough on its own terms to go to a referee, even though the unbiasedness gap needs to be closed before the optimality result can be taken at face value.

Referee Report

2 major / 1 minor

Summary. The manuscript presents actor-critic methods as control-variate estimators for policy gradients and applies the projection theorem to prove that Q-Actor-Critic and Advantage Actor-Critic (A2C) are L²-optimal within the subspace of control variates that are functions of the current state and action only. It claims this optimality supplies a theoretical justification for their empirical performance and derives a new A2C formulation asserted to have lower variance.

Significance. A rigorous, unbiasedness-preserving derivation of L² optimality for A2C-style estimators would supply a clean functional-analytic explanation for the practical success of these methods and could guide further variance-reduction techniques. The explicit derivation of a new formulation is a concrete contribution if it is shown to be both unbiased and lower-variance.

major comments (2)

[derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.
[new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.

minor comments (1)

Notation for the control variate and the precise definition of the function space (e.g., the measure with respect to which the L² inner product is taken) should be stated explicitly before the projection argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and valuable feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the theoretical claims.

read point-by-point responses

Referee: [derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.

Authors: We appreciate this important observation. The projection theorem is applied within the subspace of (s,a)-measurable functions, but we recognize that the unbiasedness condition must be explicitly verified for the estimator to be valid. In the revised version, we will impose this constraint in the optimization and demonstrate that the resulting b*(s,a) satisfies E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0, thereby confirming that QAC and A2C remain unbiased optimal estimators. revision: yes
Referee: [new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.

Authors: We agree that the presentation of the new A2C formulation would benefit from additional details. We will add an explicit derivation of the variance of the new estimator, provide a comparison showing it is lower than the standard A2C, and include a proof of unbiasedness by verifying the required orthogonality condition with the score function. revision: yes

Circularity Check

0 steps flagged

No circularity: optimality claim is direct application of external projection theorem to explicitly defined subspace.

full rationale

The paper states it applies the projection theorem (Pythagoras) to prove QAC/A2C are L2-optimal control variates within the subspace of (s,a)-conditioned functions. This is a standard functional-analysis result applied to a modeling choice made in the paper; no parameter is fitted to data and then renamed a prediction, no self-citation chain is load-bearing for the central step, and no definition is circular (e.g., the subspace is defined independently of the claimed optimum). The derivation is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the projection theorem from functional analysis applied to the space of state-action conditioned estimators; this is a standard mathematical result whose applicability to the RL variance-reduction setting is taken as given.

axioms (1)

domain assumption Projection theorem holds in the L2 space of control variate estimators conditioned on state and action
Invoked to establish optimality of QAC and A2C via Pythagoras

pith-pipeline@v0.9.0 · 5621 in / 1313 out tokens · 57382 ms · 2026-05-24T17:38:12.432842+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the L² norm for the control variate estimators spanned by functions conditioned by the current state and action.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This straightforward application of Pythagoras theorem provides a theoretical justification...

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.