Global attractors and fast-slow reduction for finite-state actor-critic mean dynamics

Vladyslav Prytula (zooplus SE)

arxiv: 2604.13259 · v1 · submitted 2026-04-14 · 🧮 math.DS

Global attractors and fast-slow reduction for finite-state actor-critic mean dynamics

Vladyslav Prytula (zooplus SE) This is my paper

Pith reviewed 2026-05-10 13:44 UTC · model grok-4.3

classification 🧮 math.DS

keywords actor-criticmean dynamicsglobal attractorsfast-slow systemsMarkov chainsreinforcement learningdynamical systemsLean formalization

0 comments

The pith

Finite-state actor-critic mean dynamics on policy, critic, and state-law variables admit compact global attractors for any positive separation parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the joint evolution of an actor parameter, a critic state, and the induced state distribution in finite-state actor-critic learning. It enlarges the phase space to include the fast distribution variable whose evolution is governed by a controlled Markov generator scaled by a small separation parameter δ. Under a softmax actor with box confinement, a coercive linear critic, and Lipschitz generators, the resulting autonomous semiflow is shown to possess a compact global attractor for every δ > 0. When the fast variable mixes exponentially uniformly, the invariant distribution map becomes Lipschitz, the reduced slow system on the actor and critic coordinates is well-posed, and the full flow tracks the reduced flow after a short initial layer while the attractors converge upper-semicontinuously to the lifted reduced attractor as δ tends to zero. All claims are stated and proved inside Lean 4 without additional axioms.

Core claim

For each δ > 0 the autonomous semiflow on the enlarged space (θ, w, μ) possesses a compact global attractor. Under a uniform exponential-mixing assumption the map θ ↦ μ_θ is Lipschitz and the reduced invariant-law system on (θ, w) is well-posed. Under an additional pathwise exponential-stability estimate for the non-autonomous fast equation, the exact flow tracks the reduced flow on every finite time interval up to the initial layer, and the exact attractors converge upper semicontinuously to the lifted reduced attractor as δ → 0. A concrete finite-state reference-state minorization condition is given that implies the pathwise hypothesis.

What carries the argument

The enlarged autonomous semiflow on the product space (θ, w, μ) whose fast coordinate obeys the exact controlled-Markov equation δ μ̇ = Q_θ^* μ, together with the compact global attractor of this semiflow and its reduction to the slow invariant-law system obtained by replacing μ with its θ-dependent stationary measure μ_θ.

If this is right

For every fixed positive separation δ the enlarged dynamics admit a nonempty compact global attractor.
The stationary distribution map θ ↦ μ_θ is Lipschitz continuous once uniform exponential mixing holds.
The reduced slow system on the actor and critic coordinates alone is well-posed and inherits existence of equilibria or limit sets from the full system.
After a short initial layer, solutions of the full system remain close to solutions of the reduced system on any finite time interval.
As the separation parameter tends to zero the attractors of the full system converge upper-semicontinuously onto the lift of the reduced attractor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reduction supplies a rigorous justification for replacing the fast distribution dynamics by its stationary measure in long-run analyses of adaptive reinforcement-learning algorithms.
The explicit minorization condition on a reference state gives a directly checkable criterion that practitioners can verify on a given finite-state MDP to guarantee the pathwise stability hypothesis.
Because the entire argument is machine-checked in Lean 4, the same formalization style could be reused to obtain verified global-attractor statements for other mean-field learning models.
The upper-semicontinuous convergence of attractors suggests that limit sets computed on the reduced system remain valid approximations of the long-run behavior even when δ is only moderately small.

Load-bearing premise

The uniform exponential-mixing assumption on the family of Markov chains that makes the stationary map Lipschitz and supplies the pathwise exponential stability needed for the tracking and convergence statements.

What would settle it

An explicit finite-state MDP and choice of actor-critic updates in which trajectories of the full three-variable system remain bounded away from every trajectory of the reduced two-variable system for arbitrarily small positive δ after the initial transient, or a concrete generator family where the global attractor ceases to exist once the uniform coercivity or Lipschitz condition on Q_θ is dropped.

Figures

Figures reproduced from arXiv: 2604.13259 by Vladyslav Prytula (zooplus SE).

**Figure 2.** Figure 2: Finite-time reduction in the two-state example. (a) State-law defect [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

When a learning algorithm reshapes the data distribution it trains on, the long-run behavior depends on the joint evolution of the policy, the value estimate, and the data distribution. We study finite-state actor-critic mean dynamics on the enlarged phase space $(\theta,w,\mu)$, where $\theta$ is the actor parameter, $w$ is an auxiliary critic state, and $\mu$ is a state-law variable (the distribution over states induced by the current policy). The state-law coordinate follows the exact controlled-Markov equation $\delta \dot\mu = Q_\theta^*\mu$. Under a softmax actor with box confinement (a smooth proxy for parameter clipping), a uniformly coercive linear critic equation, and a Lipschitz generator family $\theta \mapsto Q_\theta$, we prove that for each $\delta > 0$ the resulting autonomous semiflow possesses a compact global attractor. Under a uniform exponential-mixing assumption, we prove that the invariant-law map $\theta \mapsto \mu_\theta$ is Lipschitz and that the reduced invariant-law system on $(\theta,w)$ is well posed. Under an additional pathwise exponential-stability estimate for the non-autonomous fast state equation, we show that the exact flow tracks the reduced flow on every finite time interval up to the initial layer, and that the exact attractors converge upper semicontinuously to the lifted reduced attractor as $\delta \to 0$. We also give a concrete finite-state reference-state minorization condition implying the pathwise hypothesis. All results are formalized in Lean 4 without custom axioms.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper proves global attractor existence and fast-slow reduction for enlarged actor-critic mean dynamics on finite MDPs, with all claims machine-checked in Lean 4.

read the letter

The core contribution is a pair of theorems for the autonomous semiflow on the enlarged space (θ, w, μ): a compact global attractor exists for each fixed δ > 0 under the stated conditions on the softmax actor, box confinement, coercive critic, and Lipschitz Q_θ family. They then reduce to an invariant-law system on (θ, w) and show finite-time tracking plus upper-semicontinuous attractor convergence as δ → 0, once uniform exponential mixing and a pathwise stability estimate are added. A concrete reference-state minorization condition is supplied to realize the stability hypothesis in finite states. Everything is formalized in Lean 4 with no custom axioms, which is the clearest strength here because it makes the estimates independently verifiable rather than dependent on unchecked algebra. The enlarged-phase-space setup and the singular-perturbation argument follow standard lines once the mixing is granted, so the novelty sits mainly in the specific application to actor-critic mean dynamics plus the formalization. The assumptions are stated plainly and the minorization link is useful for finite MDPs, though the uniform mixing step itself remains an input rather than a derived property in all regimes. That keeps the result applicable mainly where policies already mix reasonably fast. This work is for researchers doing rigorous convergence analysis of adaptive RL algorithms or applying dynamical-systems methods to mean-field limits. A reader who needs reproducible theorems on attractor behavior or singular perturbation for these systems will find the reduction and the Lean artifact directly usable. I would send it to peer review; the formal verification and explicit conditions give it enough grounding to merit referee attention even if the mixing hypotheses need some sharpening in revision.

Referee Report

0 major / 2 minor

Summary. The manuscript studies the mean dynamics of finite-state actor-critic algorithms on the enlarged phase space (θ, w, μ), where μ satisfies the controlled Markov equation δ μ̇ = Q_θ^* μ. Under a softmax actor with box confinement, a uniformly coercive linear critic, and Lipschitz continuity of θ ↦ Q_θ, it proves that the autonomous semiflow possesses a compact global attractor for each fixed δ > 0. Under uniform exponential mixing, the invariant-law map θ ↦ μ_θ is shown to be Lipschitz and the reduced system on (θ, w) is well-posed. With an additional pathwise exponential-stability estimate for the fast non-autonomous equation (implied by a concrete finite-state reference-state minorization condition), the exact flow tracks the reduced flow on finite time intervals up to an initial layer, and the exact attractors converge upper semicontinuously to the lifted reduced attractor as δ → 0. All results are formalized in Lean 4 without custom axioms.

Significance. If the claims hold, the work supplies a rigorous justification for fast-slow reductions and long-term behavior in actor-critic mean dynamics, including explicit conditions under which reduced models accurately capture attractor structure. The machine-checked Lean 4 formalization without ad-hoc axioms is a notable strength that enhances verifiability, and the concrete minorization condition provides a practical bridge from abstract hypotheses to finite-state models. These contributions strengthen the dynamical-systems analysis of reinforcement-learning algorithms.

minor comments (2)

[Abstract] The abstract introduces 'box confinement (a smooth proxy for parameter clipping)' without stating its explicit functional form; adding the precise expression (e.g., the smoothing function and its derivative bounds) would improve immediate readability.
The statement that the minorization condition implies the pathwise exponential-stability estimate is central; a short remark on the quantitative constants obtained from the minorization (e.g., the resulting mixing rate) would help readers assess the strength of the hypothesis.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment and recommendation to accept the manuscript. The referee's summary correctly captures the main theorems on compact global attractors for each fixed δ > 0, the Lipschitz continuity of the invariant-law map under uniform exponential mixing, well-posedness of the reduced system, finite-time tracking up to the initial layer, upper semicontinuous convergence of attractors as δ → 0, the concrete reference-state minorization condition, and the Lean 4 formalization without custom axioms.

Circularity Check

0 steps flagged

No significant circularity; derivations are self-contained mathematical proofs

full rationale

The paper establishes existence of global attractors for fixed δ>0, well-posedness of the reduced slow system, finite-time tracking, and upper-semicontinuous attractor convergence as δ→0. These rest on explicitly listed hypotheses (softmax actor with box confinement, uniformly coercive linear critic, Lipschitz θ↦Q_θ, uniform exponential mixing, pathwise exponential stability) plus a concrete finite-state minorization condition that implies the pathwise hypothesis. All steps are formalized in Lean 4 with no custom axioms, so the derivation chain is machine-verified and does not reduce any claimed result to a fitted parameter, self-definition, or load-bearing self-citation. No equations equate a prediction to its own input by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 5 axioms · 0 invented entities

The central claims rest on several domain assumptions about mixing rates, stability, and Lipschitz continuity that are standard in dynamical-systems treatments of RL but must be verified for each concrete MDP.

axioms (5)

domain assumption Uniform exponential-mixing assumption on the controlled Markov chain
Invoked to obtain Lipschitz continuity of the invariant-law map θ ↦ μ_θ.
domain assumption Pathwise exponential-stability estimate for the non-autonomous fast state equation
Required to prove tracking of the reduced flow on finite time intervals.
domain assumption Lipschitz continuity of the generator family θ ↦ Q_θ
Used to guarantee existence of the autonomous semiflow.
domain assumption Uniform coercivity of the linear critic equation
Ensures well-posedness of the critic dynamics.
domain assumption Softmax actor with box confinement
Smooth proxy for parameter clipping that keeps the semiflow inside a compact set.

pith-pipeline@v0.9.0 · 5584 in / 1811 out tokens · 25773 ms · 2026-05-10T13:44:33.312460+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages

[1]

Benaim, A dynamical system approach to stochastic approximations,SIAM J

M. Benaim, A dynamical system approach to stochastic approximations,SIAM J. Control Optim.34(1996), 437–472

work page 1996
[2]

V. S. Borkar and S. P. Meyn, The ODE method for convergence of stochastic approximation and reinforcement learning,SIAM J. Control Optim.38(2000), 447–469

work page 2000
[3]

V. R. Konda and V. S. Borkar, Actor-critic–type learning algorithms for Markov decision processes,SIAM J. Control Optim.38(1999), 94–123

work page 1999
[4]

S. D. Liu, S. Chen, and S. Zhang, The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise,J. Mach. Learn. Res.26(2025), Paper 24, 1–76

work page 2025
[5]

V. R. Konda and J. N. Tsitsiklis, On the convergence of actor-critic algorithms,SIAM J. Control Optim.42(2003), 1143–1166

work page 2003
[6]

A. Y. Mitrophanov, Sensitivity and convergence of uniformly ergodic Markov chains,J. Appl. Probab.42(2005), 1003–1014. 14

work page 2005
[7]

S. P. Meyn and R. L. Tweedie,Markov Chains and Stochastic Stability, 2nd ed., Cambridge University Press, Cambridge, 2009

work page 2009
[8]

J. C. Perdomo, T. Zrnic, C. Mendler-D¨ unner, and M. Hardt, Performative prediction, inProc. 37th International Conference on Machine Learning, PMLR 119, 2020, pp. 7599–7609

work page 2020
[9]

J. R. Norris,Markov Chains, Cambridge University Press, Cambridge, 1997

work page 1997
[10]

J. C. Robinson,Infinite-Dimensional Dynamical Systems, Cambridge University Press, Cam- bridge, 2001

work page 2001
[11]

P. J. Schweitzer, Perturbation theory and finite Markov chains,J. Appl. Probab.5(1968), 401–413

work page 1968
[12]

Truquet, A Perturbation Analysis of Markov Chains Models with Time-Varying Parameters, Bernoulli26(2020), 2876–2906

L. Truquet, A Perturbation Analysis of Markov Chains Models with Time-Varying Parameters, Bernoulli26(2020), 2876–2906

work page 2020
[13]

Y¨ uksel, On Borkar and Young Relaxed Control Topologies and Continuous Dependence of Invariant Measures on Control Policy,SIAM J

S. Y¨ uksel, On Borkar and Young Relaxed Control Topologies and Continuous Dependence of Invariant Measures on Control Policy,SIAM J. Control Optim.62(2024), 2367–2386

work page 2024
[14]

J. K. Hale, X.-B. Lin, and G. Raugel, Upper semicontinuity of attractors for approximations of semigroups and partial differential equations,Math. Comp.50(1988), 89–123

work page 1988
[15]

Temam,Infinite-Dimensional Dynamical Systems in Mechanics and Physics, 2nd ed., Springer, New York, 1997

R. Temam,Infinite-Dimensional Dynamical Systems in Mechanics and Physics, 2nd ed., Springer, New York, 1997

work page 1997
[16]

H. K. Khalil,Nonlinear Systems, 3rd ed., Prentice Hall, Upper Saddle River, NJ, 2002

work page 2002
[17]

Verhulst,Methods and Applications of Singular Perturbations, Springer, New York, 2005

F. Verhulst,Methods and Applications of Singular Perturbations, Springer, New York, 2005. Acknowledgments The Lean 4 formalization accompanying this paper was developed with assistance from Claude (Anthropic). 15

work page 2005

[1] [1]

Benaim, A dynamical system approach to stochastic approximations,SIAM J

M. Benaim, A dynamical system approach to stochastic approximations,SIAM J. Control Optim.34(1996), 437–472

work page 1996

[2] [2]

V. S. Borkar and S. P. Meyn, The ODE method for convergence of stochastic approximation and reinforcement learning,SIAM J. Control Optim.38(2000), 447–469

work page 2000

[3] [3]

V. R. Konda and V. S. Borkar, Actor-critic–type learning algorithms for Markov decision processes,SIAM J. Control Optim.38(1999), 94–123

work page 1999

[4] [4]

S. D. Liu, S. Chen, and S. Zhang, The ODE Method for Stochastic Approximation and Reinforcement Learning with Markovian Noise,J. Mach. Learn. Res.26(2025), Paper 24, 1–76

work page 2025

[5] [5]

V. R. Konda and J. N. Tsitsiklis, On the convergence of actor-critic algorithms,SIAM J. Control Optim.42(2003), 1143–1166

work page 2003

[6] [6]

A. Y. Mitrophanov, Sensitivity and convergence of uniformly ergodic Markov chains,J. Appl. Probab.42(2005), 1003–1014. 14

work page 2005

[7] [7]

S. P. Meyn and R. L. Tweedie,Markov Chains and Stochastic Stability, 2nd ed., Cambridge University Press, Cambridge, 2009

work page 2009

[8] [8]

J. C. Perdomo, T. Zrnic, C. Mendler-D¨ unner, and M. Hardt, Performative prediction, inProc. 37th International Conference on Machine Learning, PMLR 119, 2020, pp. 7599–7609

work page 2020

[9] [9]

J. R. Norris,Markov Chains, Cambridge University Press, Cambridge, 1997

work page 1997

[10] [10]

J. C. Robinson,Infinite-Dimensional Dynamical Systems, Cambridge University Press, Cam- bridge, 2001

work page 2001

[11] [11]

P. J. Schweitzer, Perturbation theory and finite Markov chains,J. Appl. Probab.5(1968), 401–413

work page 1968

[12] [12]

Truquet, A Perturbation Analysis of Markov Chains Models with Time-Varying Parameters, Bernoulli26(2020), 2876–2906

L. Truquet, A Perturbation Analysis of Markov Chains Models with Time-Varying Parameters, Bernoulli26(2020), 2876–2906

work page 2020

[13] [13]

Y¨ uksel, On Borkar and Young Relaxed Control Topologies and Continuous Dependence of Invariant Measures on Control Policy,SIAM J

S. Y¨ uksel, On Borkar and Young Relaxed Control Topologies and Continuous Dependence of Invariant Measures on Control Policy,SIAM J. Control Optim.62(2024), 2367–2386

work page 2024

[14] [14]

J. K. Hale, X.-B. Lin, and G. Raugel, Upper semicontinuity of attractors for approximations of semigroups and partial differential equations,Math. Comp.50(1988), 89–123

work page 1988

[15] [15]

Temam,Infinite-Dimensional Dynamical Systems in Mechanics and Physics, 2nd ed., Springer, New York, 1997

R. Temam,Infinite-Dimensional Dynamical Systems in Mechanics and Physics, 2nd ed., Springer, New York, 1997

work page 1997

[16] [16]

H. K. Khalil,Nonlinear Systems, 3rd ed., Prentice Hall, Upper Saddle River, NJ, 2002

work page 2002

[17] [17]

Verhulst,Methods and Applications of Singular Perturbations, Springer, New York, 2005

F. Verhulst,Methods and Applications of Singular Perturbations, Springer, New York, 2005. Acknowledgments The Lean 4 formalization accompanying this paper was developed with assistance from Claude (Anthropic). 15

work page 2005