On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost
Pith reviewed 2026-05-24 21:30 UTC · model grok-4.3
The pith
Actor-critic converges to a globally optimal policy and critic at linear rate for linear quadratic regulators with ergodic cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
In the linear quadratic regulator with ergodic cost, actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence.
What carries the argument
The actor-critic alternating updates between policy gradient and critic least-squares steps, whose analysis exploits the quadratic structure and ergodic cost to obtain global optimality.
If this is right
- Actor-critic is globally convergent rather than merely locally stable under these dynamics.
- The convergence rate is linear and nonasymptotic, supplying explicit iteration bounds.
- The result offers a concrete instance where bilevel optimization with nonconvex subproblems admits global solution via simple alternation.
- The analysis technique may extend to other structured reinforcement learning problems that share the same quadratic and ergodic properties.
Where Pith is reading between the lines
- If the same linear rate holds when the cost is only approximately quadratic, the method could apply to nearby control problems.
- The global optimality proof may indicate why actor-critic succeeds on some non-LQR tasks despite the general NP-hardness of bilevel problems.
- Removing the ergodic-cost assumption while keeping linear dynamics would test whether the linear rate survives.
Load-bearing premise
The problem must satisfy the specific structural assumptions of the linear quadratic regulator with ergodic cost.
What would settle it
A numerical run of actor-critic on an LQR instance with ergodic cost in which the policy or critic error fails to decrease linearly to the known optimum.
read the original abstract
Despite the empirical success of the actor-critic algorithm, its theoretical understanding lags behind. In a broader context, actor-critic can be viewed as an online alternating update algorithm for bilevel optimization, whose convergence is known to be fragile. To understand the instability of actor-critic, we focus on its application to linear quadratic regulators, a simple yet fundamental setting of reinforcement learning. We establish a nonasymptotic convergence analysis of actor-critic in this setting. In particular, we prove that actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence. Our analysis may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper provides a nonasymptotic convergence analysis of the actor-critic algorithm applied to the linear quadratic regulator (LQR) with ergodic cost. It proves that the algorithm converges globally to the optimal pair of actor (policy) and critic (action-value function) at a linear rate, leveraging the structural assumptions of linear dynamics, quadratic costs, and ergodicity.
Significance. If the result holds, this offers the first rigorous nonasymptotic global convergence guarantee with linear rate for actor-critic in a canonical RL setting. The explicit use of LQR structure to obtain global optimality (rather than local) and a concrete linear rate is a clear strength, providing a concrete, falsifiable case study for bilevel optimization in RL where general cases are NP-hard. The scope is appropriately limited to LQR, and the work positions itself as a preliminary step toward broader theory.
minor comments (2)
- The abstract and introduction could include a one-sentence statement of the key structural assumptions (linear dynamics, quadratic cost, ergodicity) that enable the global linear-rate result, to help readers quickly assess applicability.
- Notation for the ergodic cost functional and the associated Bellman operator should be cross-referenced explicitly between the problem formulation and the convergence theorem to improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive review, accurate summary of our contributions, and recommendation to accept the paper. No major comments were provided, so we have no points to address.
Circularity Check
No significant circularity detected
full rationale
The paper presents a nonasymptotic convergence proof for actor-critic on the LQR with ergodic cost, using the linear dynamics and quadratic cost structure to establish global optimality and linear rate. No derivation step reduces by construction to a fitted parameter, self-citation, or renamed input; the analysis is self-contained within the explicitly stated structural assumptions and does not invoke load-bearing self-citations or ansatzes that collapse the claim.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The environment is a linear quadratic regulator with ergodic cost
Forward citations
Cited by 1 Pith paper
-
Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.