pith. sign in

arxiv: 1907.09765 · v1 · pith:6ZBM56AInew · submitted 2019-07-23 · 💻 cs.LG · stat.ML

Variance Reduction in Actor Critic Methods (ACM)

Pith reviewed 2026-05-24 17:38 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords actor-critic methodsvariance reductioncontrol variatesprojection theorempolicy gradientsA2Creinforcement learning
0
0 comments X

The pith

Q-actor-critic and A2C are L2-optimal among control variates conditioned on state and action

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Actor-critic methods act as control variate estimators that subtract a baseline from policy gradient samples to cut variance. The paper applies the projection theorem to prove that both Q-actor-critic and advantage actor-critic methods achieve the smallest L2 error possible inside the subspace of all such estimators that depend only on the current state and action. This direct use of the Pythagorean identity in function space accounts for the reliable empirical results of these algorithms. The same optimality argument produces a revised A2C update rule with strictly lower variance than the standard version.

Core claim

Actor Critic Methods are control variate estimators. Using the projection theorem, the Q and Advantage Actor Critic methods are optimal in the L2 norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of the Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC methods and enables derivation of a new formulation for Advantage Actor Critic methods that has lower variance.

What carries the argument

Control variate estimators lying in the subspace spanned by functions of the current state and action, shown optimal by the projection theorem

If this is right

  • QAC and A2C attain the minimal L2 error among all estimators conditioned only on state and action.
  • The derived A2C formulation has lower variance than the conventional version.
  • The result supplies a theoretical reason for the observed reliability of these two methods in policy gradient training.
  • The same projection argument applies to any control variate inside the given subspace.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If major variance sources lie outside functions of the current state and action, estimators that condition on additional history could beat the paper's optimum in practice.
  • The projection technique could be reused to rank other policy-gradient baselines by their L2 distance to the true value function.
  • Direct comparison of the new A2C rule against standard A2C on benchmark tasks would test whether the variance reduction improves learning speed.

Load-bearing premise

The dominant sources of variance in actor-critic gradient estimates are captured by functions that depend only on the current state and action.

What would settle it

An empirical demonstration that a control variate using information beyond the current state and action produces lower variance than the optimal estimator inside the state-action subspace would show the claimed optimality does not cover practical cases.

read the original abstract

After presenting Actor Critic Methods (ACM), we show ACM are control variate estimators. Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the $L^2$ norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC most often referred to as A2C methods in deep policy gradient methods. This enables us to derive a new formulation for Advantage Actor Critic methods that has lower variance and improves the traditional A2C method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript presents actor-critic methods as control-variate estimators for policy gradients and applies the projection theorem to prove that Q-Actor-Critic and Advantage Actor-Critic (A2C) are L²-optimal within the subspace of control variates that are functions of the current state and action only. It claims this optimality supplies a theoretical justification for their empirical performance and derives a new A2C formulation asserted to have lower variance.

Significance. A rigorous, unbiasedness-preserving derivation of L² optimality for A2C-style estimators would supply a clean functional-analytic explanation for the practical success of these methods and could guide further variance-reduction techniques. The explicit derivation of a new formulation is a concrete contribution if it is shown to be both unbiased and lower-variance.

major comments (2)
  1. [derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.
  2. [new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.
minor comments (1)
  1. Notation for the control variate and the precise definition of the function space (e.g., the measure with respect to which the L² inner product is taken) should be stated explicitly before the projection argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and valuable feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the theoretical claims.

read point-by-point responses
  1. Referee: [derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.

    Authors: We appreciate this important observation. The projection theorem is applied within the subspace of (s,a)-measurable functions, but we recognize that the unbiasedness condition must be explicitly verified for the estimator to be valid. In the revised version, we will impose this constraint in the optimization and demonstrate that the resulting b*(s,a) satisfies E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0, thereby confirming that QAC and A2C remain unbiased optimal estimators. revision: yes

  2. Referee: [new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.

    Authors: We agree that the presentation of the new A2C formulation would benefit from additional details. We will add an explicit derivation of the variance of the new estimator, provide a comparison showing it is lower than the standard A2C, and include a proof of unbiasedness by verifying the required orthogonality condition with the score function. revision: yes

Circularity Check

0 steps flagged

No circularity: optimality claim is direct application of external projection theorem to explicitly defined subspace.

full rationale

The paper states it applies the projection theorem (Pythagoras) to prove QAC/A2C are L2-optimal control variates within the subspace of (s,a)-conditioned functions. This is a standard functional-analysis result applied to a modeling choice made in the paper; no parameter is fitted to data and then renamed a prediction, no self-citation chain is load-bearing for the central step, and no definition is circular (e.g., the subspace is defined independently of the claimed optimum). The derivation is therefore self-contained against external benchmarks and receives the default non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the projection theorem from functional analysis applied to the space of state-action conditioned estimators; this is a standard mathematical result whose applicability to the RL variance-reduction setting is taken as given.

axioms (1)
  • domain assumption Projection theorem holds in the L2 space of control variate estimators conditioned on state and action
    Invoked to establish optimality of QAC and A2C via Pythagoras

pith-pipeline@v0.9.0 · 5621 in / 1313 out tokens · 57382 ms · 2026-05-24T17:38:12.432842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.