Variance Reduction in Actor Critic Methods (ACM)
Pith reviewed 2026-05-24 17:38 UTC · model grok-4.3
The pith
Q-actor-critic and A2C are L2-optimal among control variates conditioned on state and action
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Actor Critic Methods are control variate estimators. Using the projection theorem, the Q and Advantage Actor Critic methods are optimal in the L2 norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of the Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC methods and enables derivation of a new formulation for Advantage Actor Critic methods that has lower variance.
What carries the argument
Control variate estimators lying in the subspace spanned by functions of the current state and action, shown optimal by the projection theorem
If this is right
- QAC and A2C attain the minimal L2 error among all estimators conditioned only on state and action.
- The derived A2C formulation has lower variance than the conventional version.
- The result supplies a theoretical reason for the observed reliability of these two methods in policy gradient training.
- The same projection argument applies to any control variate inside the given subspace.
Where Pith is reading between the lines
- If major variance sources lie outside functions of the current state and action, estimators that condition on additional history could beat the paper's optimum in practice.
- The projection technique could be reused to rank other policy-gradient baselines by their L2 distance to the true value function.
- Direct comparison of the new A2C rule against standard A2C on benchmark tasks would test whether the variance reduction improves learning speed.
Load-bearing premise
The dominant sources of variance in actor-critic gradient estimates are captured by functions that depend only on the current state and action.
What would settle it
An empirical demonstration that a control variate using information beyond the current state and action produces lower variance than the optimal estimator inside the state-action subspace would show the claimed optimality does not cover practical cases.
read the original abstract
After presenting Actor Critic Methods (ACM), we show ACM are control variate estimators. Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the $L^2$ norm for the control variate estimators spanned by functions conditioned by the current state and action. This straightforward application of Pythagoras theorem provides a theoretical justification of the strong performance of QAC and AAC most often referred to as A2C methods in deep policy gradient methods. This enables us to derive a new formulation for Advantage Actor Critic methods that has lower variance and improves the traditional A2C method.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents actor-critic methods as control-variate estimators for policy gradients and applies the projection theorem to prove that Q-Actor-Critic and Advantage Actor-Critic (A2C) are L²-optimal within the subspace of control variates that are functions of the current state and action only. It claims this optimality supplies a theoretical justification for their empirical performance and derives a new A2C formulation asserted to have lower variance.
Significance. A rigorous, unbiasedness-preserving derivation of L² optimality for A2C-style estimators would supply a clean functional-analytic explanation for the practical success of these methods and could guide further variance-reduction techniques. The explicit derivation of a new formulation is a concrete contribution if it is shown to be both unbiased and lower-variance.
major comments (2)
- [derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.
- [new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.
minor comments (1)
- Notation for the control variate and the precise definition of the function space (e.g., the measure with respect to which the L² inner product is taken) should be stated explicitly before the projection argument.
Simulated Author's Rebuttal
We thank the referee for their detailed review and valuable feedback on our manuscript. We address the major comments below and will incorporate revisions to strengthen the theoretical claims.
read point-by-point responses
-
Referee: [derivation via projection theorem (main text)] The central optimality claim relies on the projection theorem applied to the subspace of (s,a)-measurable functions. However, the resulting control variate b*(s,a) must additionally satisfy E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0 for the policy-gradient estimator to remain unbiased; the manuscript does not impose or verify this orthogonality constraint on the projected solution. This is load-bearing for the claim that QAC/A2C are optimal valid estimators.
Authors: We appreciate this important observation. The projection theorem is applied within the subspace of (s,a)-measurable functions, but we recognize that the unbiasedness condition must be explicitly verified for the estimator to be valid. In the revised version, we will impose this constraint in the optimization and demonstrate that the resulting b*(s,a) satisfies E[∇_θ log π(a|s) ⋅ b*(s,a)] = 0, thereby confirming that QAC and A2C remain unbiased optimal estimators. revision: yes
-
Referee: [new A2C formulation] The new A2C formulation obtained from the optimality result is presented as having lower variance, yet no explicit variance expression, comparison to the standard A2C baseline, or proof that the new estimator remains unbiased is supplied in the derivation.
Authors: We agree that the presentation of the new A2C formulation would benefit from additional details. We will add an explicit derivation of the variance of the new estimator, provide a comparison showing it is lower than the standard A2C, and include a proof of unbiasedness by verifying the required orthogonality condition with the score function. revision: yes
Circularity Check
No circularity: optimality claim is direct application of external projection theorem to explicitly defined subspace.
full rationale
The paper states it applies the projection theorem (Pythagoras) to prove QAC/A2C are L2-optimal control variates within the subspace of (s,a)-conditioned functions. This is a standard functional-analysis result applied to a modeling choice made in the paper; no parameter is fitted to data and then renamed a prediction, no self-citation chain is load-bearing for the central step, and no definition is circular (e.g., the subspace is defined independently of the claimed optimum). The derivation is therefore self-contained against external benchmarks and receives the default non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Projection theorem holds in the L2 space of control variate estimators conditioned on state and action
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Using the projection theorem, we prove that the Q and Advantage Actor Critic (A2C) methods are optimal in the sense of the L² norm for the control variate estimators spanned by functions conditioned by the current state and action.
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
This straightforward application of Pythagoras theorem provides a theoretical justification...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.