On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

Mingyi Hong; Yongxin Chen; Zhaoran Wang; Zhuoran Yang

arxiv: 1907.06246 · v1 · pith:4EFJ5B5Enew · submitted 2019-07-14 · 💻 cs.LG · math.OC· stat.ML

On the Global Convergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost

Zhuoran Yang , Yongxin Chen , Mingyi Hong , Zhaoran Wang This is my paper

Pith reviewed 2026-05-24 21:30 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML

keywords actor-criticlinear quadratic regulatorglobal convergenceergodic costreinforcement learningbilevel optimizationlinear rate

0 comments

The pith

Actor-critic converges to a globally optimal policy and critic at linear rate for linear quadratic regulators with ergodic cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves a nonasymptotic convergence result for actor-critic applied to the linear quadratic regulator with ergodic cost. It shows that the alternating updates between policy and action-value function reach the globally optimal pair at a linear rate. This addresses the known fragility of actor-critic as an online method for bilevel optimization. A sympathetic reader cares because the result supplies the first concrete guarantee of global optimality and linear speed in a core reinforcement learning setting where theory had lagged empirical use.

Core claim

In the linear quadratic regulator with ergodic cost, actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence.

What carries the argument

The actor-critic alternating updates between policy gradient and critic least-squares steps, whose analysis exploits the quadratic structure and ergodic cost to obtain global optimality.

If this is right

Actor-critic is globally convergent rather than merely locally stable under these dynamics.
The convergence rate is linear and nonasymptotic, supplying explicit iteration bounds.
The result offers a concrete instance where bilevel optimization with nonconvex subproblems admits global solution via simple alternation.
The analysis technique may extend to other structured reinforcement learning problems that share the same quadratic and ergodic properties.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the same linear rate holds when the cost is only approximately quadratic, the method could apply to nearby control problems.
The global optimality proof may indicate why actor-critic succeeds on some non-LQR tasks despite the general NP-hardness of bilevel problems.
Removing the ergodic-cost assumption while keeping linear dynamics would test whether the linear rate survives.

Load-bearing premise

The problem must satisfy the specific structural assumptions of the linear quadratic regulator with ergodic cost.

What would settle it

A numerical run of actor-critic on an LQR instance with ergodic cost in which the policy or critic error fails to decrease linearly to the known optimum.

read the original abstract

Despite the empirical success of the actor-critic algorithm, its theoretical understanding lags behind. In a broader context, actor-critic can be viewed as an online alternating update algorithm for bilevel optimization, whose convergence is known to be fragile. To understand the instability of actor-critic, we focus on its application to linear quadratic regulators, a simple yet fundamental setting of reinforcement learning. We establish a nonasymptotic convergence analysis of actor-critic in this setting. In particular, we prove that actor-critic finds a globally optimal pair of actor (policy) and critic (action-value function) at a linear rate of convergence. Our analysis may serve as a preliminary step towards a complete theoretical understanding of bilevel optimization with nonconvex subproblems, which is NP-hard in the worst case and is often solved using heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a non-asymptotic linear-rate global convergence proof for actor-critic specifically on ergodic-cost LQR.

read the letter

The main takeaway is that actor-critic reaches the globally optimal policy and value function at a linear rate in the LQR ergodic-cost case, with an explicit non-asymptotic bound. This is the first such guarantee I have seen for the algorithm in this model class, and it is presented cleanly as a step toward understanding bilevel optimization with nonconvex inner problems. The authors correctly restrict the claim to linear dynamics and quadratic costs, which lets them exploit the structure for global optimality instead of settling for local or asymptotic results. That choice is honest and useful; it turns a hard general problem into a tractable one where the contraction can be tracked explicitly. The writing is direct, the motivation from bilevel fragility is clear, and the result sits on top of standard LQR facts without obvious circularity. The main limitation is scope. Everything rides on the quadratic cost and linear dynamics, so the technique does not immediately transfer to nonlinear or non-quadratic settings where actor-critic is actually used. Within the stated setting the argument appears tight, but a referee would still need to check the contraction mapping and the handling of the critic's estimation error. No invented assumptions or post-hoc restrictions show up in the abstract or stress-test note. This paper is for people working on convergence theory for policy-gradient or bilevel methods who want a concrete, verifiable anchor point. It is not a general solution, but the claim is precise and the setting is canonical, so it deserves a serious referee rather than a desk reject.

Referee Report

0 major / 2 minor

Summary. The paper provides a nonasymptotic convergence analysis of the actor-critic algorithm applied to the linear quadratic regulator (LQR) with ergodic cost. It proves that the algorithm converges globally to the optimal pair of actor (policy) and critic (action-value function) at a linear rate, leveraging the structural assumptions of linear dynamics, quadratic costs, and ergodicity.

Significance. If the result holds, this offers the first rigorous nonasymptotic global convergence guarantee with linear rate for actor-critic in a canonical RL setting. The explicit use of LQR structure to obtain global optimality (rather than local) and a concrete linear rate is a clear strength, providing a concrete, falsifiable case study for bilevel optimization in RL where general cases are NP-hard. The scope is appropriately limited to LQR, and the work positions itself as a preliminary step toward broader theory.

minor comments (2)

The abstract and introduction could include a one-sentence statement of the key structural assumptions (linear dynamics, quadratic cost, ergodicity) that enable the global linear-rate result, to help readers quickly assess applicability.
Notation for the ergodic cost functional and the associated Bellman operator should be cross-referenced explicitly between the problem formulation and the convergence theorem to improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review, accurate summary of our contributions, and recommendation to accept the paper. No major comments were provided, so we have no points to address.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents a nonasymptotic convergence proof for actor-critic on the LQR with ergodic cost, using the linear dynamics and quadratic cost structure to establish global optimality and linear rate. No derivation step reduces by construction to a fitted parameter, self-citation, or renamed input; the analysis is self-contained within the explicitly stated structural assumptions and does not invoke load-bearing self-citations or ansatzes that collapse the claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the structural properties of the LQR model (linear dynamics, quadratic ergodic cost) and on the alternating-update formulation of actor-critic; these are domain assumptions rather than free parameters or new entities.

axioms (1)

domain assumption The environment is a linear quadratic regulator with ergodic cost
The convergence proof is derived specifically for this model class.

pith-pipeline@v0.9.0 · 5682 in / 1144 out tokens · 20520 ms · 2026-05-24T21:30:26.724238+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimal Sample Complexity for Single Time-Scale Actor-Critic with Momentum
cs.LG 2026-02 unverdicted novelty 7.0

Single-timescale actor-critic with STORM momentum and a recent-sample buffer achieves optimal O(ε^{-2}) sample complexity for ε-optimal policies in finite discounted MDPs.