pith. sign in

arxiv: 2605.31310 · v1 · pith:EB5TMQWVnew · submitted 2026-05-29 · 📡 eess.SY · cs.SY

Model-free LQG Control with Chance Constraints

Pith reviewed 2026-06-28 21:17 UTC · model grok-4.3

classification 📡 eess.SY cs.SY
keywords model-free controlchance constraintsLQG regulatoractor-criticnatural policy gradientLagrangian methodconvergence analysis
0
0 comments X

The pith

A two-timescale NPG actor-critic algorithm solves chance-constrained LQG control model-free while proving linear convergence via Lagrangian properties.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a model-free natural policy gradient actor-critic method that uses a Lagrangian primal-dual framework to enforce chance constraints on linear time-invariant systems. It proves that the Lagrangian function satisfies coercivity and gradient dominance, which deliver linear convergence of the actor and closed-loop stability during training, while the critic converges by stochastic approximation and the problem has no duality gap. Readers would care because the approach removes the need for an accurate system model or repeated online optimization yet still limits the probability that a state-related function exceeds a given threshold.

Core claim

For linear time-invariant systems the risk is defined as the probability that a function of the one-step-ahead state exceeds a user-specified threshold; the NPG-based actor-critic algorithm with two timescales and Lagrangian primal-dual updates establishes coercivity and gradient dominance of the Lagrangian, yielding linear convergence and closed-loop stability for the actor, convergence of the TD(0) critic by stochastic approximation theory, and zero duality gap in the constrained problem.

What carries the argument

Lagrangian primal-dual framework applied to the two-timescale NPG actor-critic updates, which preserves the coercivity and gradient dominance properties needed for convergence analysis.

If this is right

  • The actor updates converge linearly to the optimal policy under the chance constraint.
  • Closed-loop stability is maintained throughout the training process.
  • The constrained optimization problem has no duality gap, so primal and dual solutions coincide.
  • The TD(0) critic converges in the mean-square sense by stochastic approximation.
  • Numerical tests show risk is limited while performance stays close to model-based LQR.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same Lagrangian construction could be tested on systems with partial model information to see whether convergence rates degrade gracefully.
  • Because the method avoids real-time optimization, it may suit embedded platforms where scenario-based MPC is too slow.
  • Extending the risk definition to multi-step or output-based constraints would require checking whether gradient dominance still holds.

Load-bearing premise

The chosen risk definition as a probability on the one-step-ahead state allows the Lagrangian to keep the coercivity and gradient dominance properties required for the linear convergence proofs.

What would settle it

An explicit instance of a chance-constrained LQG problem in which the Lagrangian exhibits a duality gap or the actor parameters fail to converge linearly under the stated risk definition would disprove the main claims.

Figures

Figures reproduced from arXiv: 2605.31310 by Arunava Naha, Subhrakanti Dey.

Figure 1
Figure 1. Figure 1: FIGURE 1: Training convergence vs. number of iterations (mean [PITH_FULL_IMAGE:figures/full_fig_p011_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: FIGURE 3: Primal-dual NPG-based AC vs. CLQR for the [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 2
Figure 2. Figure 2: FIGURE 2: Control cost [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
read the original abstract

This paper studies model-free optimal control design and its convergence properties for linear time-invariant systems subject to probabilistic risk or chance constraints. In particular, we study a natural policy gradient (NPG)-based actor-critic (AC) algorithm with two timescales, using a Lagrangian primal-dual framework to enforce the constraint. Furthermore, the risk is defined as the probability that a function of the one-step-ahead state exceeds a user-specified threshold. To our knowledge, this is the first work to study the analytical convergence properties for NPG-based AC in a chance-constrained linear-quadratic Gaussian (LQG) regulator setting without model knowledge. We establish the coercivity and gradient dominance properties of the Lagrangian function, which ensure linear convergence and closed-loop stability during training for the actor. On the other hand, we analyse the convergence properties of the temporal difference (TD(0)) learning for the critic, applying stochastic approximation theory. Also, we demonstrate no duality gap in the constrained optimisation problem. Additionally, we have performed numerical analysis of the convergence properties and accuracy of the proposed method, comparing it with model-based chance-constrained LQR and scenario-based MPC. Results show that our approach effectively limits risk while maintaining near-optimal performance, without requiring full model knowledge or real-time optimisation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops a model-free natural policy gradient (NPG) actor-critic algorithm for chance-constrained LQG regulation. It employs a Lagrangian primal-dual framework to enforce a one-step probabilistic risk constraint P(g(x_{t+1}) > threshold) ≤ allowed level, claims to prove that the resulting Lagrangian retains coercivity and gradient dominance (yielding linear convergence and closed-loop stability of the actor), analyzes TD(0) critic convergence via stochastic approximation, establishes zero duality gap, and reports numerical comparisons against model-based chance-constrained LQR and scenario MPC.

Significance. If the preservation of coercivity/gradient dominance under the probabilistic risk term holds and the derivations are complete, the work would provide the first analytical convergence guarantees for model-free NPG-AC in this constrained LQG setting, strengthening the case for safe RL methods on linear systems without requiring real-time optimization or full model knowledge.

major comments (2)
  1. [Lagrangian analysis / convergence theorems] The linear-convergence and stability claims rest on the Lagrangian (quadratic cost + λ·(risk − allowed)) inheriting coercivity and gradient dominance from the unconstrained LQG case. The risk term is defined as the probability that a function of the one-step-ahead state exceeds a threshold, which is generally non-convex and non-smooth in the gain matrix. The manuscript must supply the explicit argument (lemma or theorem) showing that this term does not destroy the dominance constant or coercivity; without it the NPG linear-convergence and zero-duality-gap results are not yet established.
  2. [Numerical experiments] Numerical section: the comparisons with model-based chance-constrained LQR and scenario MPC lack reported details on the exact risk metric used for evaluation, number of Monte-Carlo trials, data-exclusion criteria, and error bars. These omissions prevent verification that the model-free method truly achieves comparable risk control and near-optimal cost.
minor comments (2)
  1. [Problem formulation] Clarify the precise functional form of g(·) and the threshold in the risk definition; the current description is too terse for reproducibility.
  2. [Algorithm description] The two-time-scale AC update rules should be written with explicit step-size schedules and the precise critic parameterization (linear or otherwise).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comments highlight important points for strengthening the presentation of our theoretical results and experimental details. We address each major comment below and indicate the revisions planned for the next version.

read point-by-point responses
  1. Referee: [Lagrangian analysis / convergence theorems] The linear-convergence and stability claims rest on the Lagrangian (quadratic cost + λ·(risk − allowed)) inheriting coercivity and gradient dominance from the unconstrained LQG case. The risk term is defined as the probability that a function of the one-step-ahead state exceeds a threshold, which is generally non-convex and non-smooth in the gain matrix. The manuscript must supply the explicit argument (lemma or theorem) showing that this term does not destroy the dominance constant or coercivity; without it the NPG linear-convergence and zero-duality-gap results are not yet established.

    Authors: We agree that an explicit lemma is needed to rigorously demonstrate preservation of coercivity and gradient dominance when the probabilistic risk term is added to the Lagrangian. Although the manuscript states that these properties hold, the current presentation does not isolate the argument in a dedicated lemma. In the revised version we will insert a new lemma (placed after the definition of the Lagrangian) that shows, under the Gaussian noise assumption, the risk term is Lipschitz continuous in the gain and its contribution is dominated by the quadratic cost term, thereby leaving the dominance constant and coercivity unchanged. This will directly underpin the linear convergence and zero-duality-gap claims. revision: yes

  2. Referee: [Numerical experiments] Numerical section: the comparisons with model-based chance-constrained LQR and scenario MPC lack reported details on the exact risk metric used for evaluation, number of Monte-Carlo trials, data-exclusion criteria, and error bars. These omissions prevent verification that the model-free method truly achieves comparable risk control and near-optimal cost.

    Authors: We accept this criticism. The revised manuscript will expand Section 5 to report: (i) the risk metric is the empirical probability computed from Monte-Carlo rollouts, (ii) 5000 independent trajectories per evaluation point, (iii) no data-exclusion criteria were applied, and (iv) error bars show one standard deviation over 20 independent algorithm runs. These additions will enable direct verification of the reported risk control and cost performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on explicit derivation of Lagrangian properties

full rationale

The paper asserts it establishes coercivity and gradient dominance of the Lagrangian (cost + λ·risk) directly from the problem setup and risk definition P(g(x_{t+1}) > threshold), then uses these to prove linear convergence of NPG actor and zero duality gap. No equations or steps are shown reducing the claimed properties to a fitted parameter, self-citation chain, or renamed input; the TD(0) critic analysis invokes standard stochastic approximation. The derivation chain is presented as self-contained against the LQG structure and Lagrangian framework, with numerical comparisons to model-based baselines serving as external checks rather than internal fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard domain assumption that the plant is linear time-invariant and that the chosen one-step risk definition admits a Lagrangian formulation with the required coercivity and dominance properties. No free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption The underlying plant is linear time-invariant.
    Invoked by the title, LQG setting, and one-step-ahead state risk definition throughout the abstract.
  • domain assumption The Lagrangian primal-dual formulation of the chance constraint admits zero duality gap and yields a coercive, gradient-dominant objective.
    Directly invoked when the abstract states that coercivity and gradient dominance are established and that no duality gap exists.

pith-pipeline@v0.9.1-grok · 5752 in / 1546 out tokens · 31389 ms · 2026-06-28T21:17:23.386589+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    Risk-Constrained Linear-Quadratic Regulators,

    A. Tsiamis, D. S. Kalogerias, L. F. O. Chamon, A. Ribeiro, and G. J. Pappas, “Risk-Constrained Linear-Quadratic Regulators,” in2020 59th IEEE Conference on Decision and Control (CDC), Dec. 2020, pp. 3040–3047. VOLUME 15 A. NAHAET AL.: PREPARATION OF PAPERS FOR IEEE OPEN JOURNAL OF CONTROL SYSTEMS

  2. [2]

    The scenario approach for Stochastic Model Predictive Control with bounds on closed-loop constraint violations,

    G. Schildbach, L. Fagiano, C. Frei, and M. Morari, “The scenario approach for Stochastic Model Predictive Control with bounds on closed-loop constraint violations,”Automatica, vol. 50, no. 12, pp. 3009–3018, Dec. 2014

  3. [3]

    Stochastic MPC for Additive and Mul- tiplicative Uncertainty Using Sample Approximations,

    J. Fleming and M. Cannon, “Stochastic MPC for Additive and Mul- tiplicative Uncertainty Using Sample Approximations,”IEEE Trans. Automat. Contr., vol. 64, no. 9, pp. 3883–3888, Sep. 2019

  4. [4]

    Linear controller design for chance constrained systems,

    G. Schildbach, P. Goulart, and M. Morari, “Linear controller design for chance constrained systems,”Automatica, vol. 51, pp. 278–284, Jan. 2015

  5. [5]

    Stochastic MPC with robustness to bounded parametric uncertainty,

    E. Arcari, A. Iannelli, A. Carron, and M. N. Zeilinger, “Stochastic MPC with robustness to bounded parametric uncertainty,”IEEE Trans- actions on Automatic Control, pp. 1–14, 2023

  6. [6]

    Data- Driven Tube-Based Stochastic Predictive Control,

    S. Kerz, J. Teutsch, T. Br ¨udigam, M. Leibold, and D. Wollherr, “Data- Driven Tube-Based Stochastic Predictive Control,”IEEE Open Journal of Control Systems, vol. 2, pp. 185–199, 2023

  7. [7]

    Global Convergence of Policy Gradi- ent Methods for Output Feedback Linear Quadratic Control,

    F. Zhao, X. Fu, and K. You, “Global Convergence of Policy Gradi- ent Methods for Output Feedback Linear Quadratic Control,”arXiv preprint arXiv:2211.04051, 2022

  8. [8]

    Infinite-horizon Risk-constrained Linear Quadratic Regulator with Average Cost,

    F. Zhao, K. You, and T. Basar, “Infinite-horizon Risk-constrained Linear Quadratic Regulator with Average Cost,” in2021 60th IEEE Conference on Decision and Control (CDC). Austin, TX, USA: IEEE, Dec. 2021, pp. 390–395

  9. [9]

    Bertsekas,Reinforcement Learning and Optimal Control

    D. Bertsekas,Reinforcement Learning and Optimal Control. Athena Scientific, Jul. 2019

  10. [10]

    Re- inforcement learning for control: Performance, stability, and deep approximators,

    L. Bus ¸oniu, T. de Bruin, D. Toli ´c, J. Kober, and I. Palunko, “Re- inforcement learning for control: Performance, stability, and deep approximators,”Annual Reviews in Control, vol. 46, pp. 8–28, Jan. 2018

  11. [11]

    Efficient Off-Policy Q-Learning for Data-Based Discrete-Time LQR Problems,

    V . G. Lopez, M. Alsalti, and M. A. M ¨uller, “Efficient Off-Policy Q-Learning for Data-Based Discrete-Time LQR Problems,”IEEE Transactions on Automatic Control, pp. 1–12, 2023

  12. [12]

    R. S. Sutton and A. G. Barto,Reinforcement Learning, Second Edition: An Introduction. MIT Press, Nov. 2018

  13. [13]

    Global convergence of policy gradient methods for the linear quadratic regulator,

    M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International Conference on Machine Learning. PMLR, 2018, pp. 1467–1476

  14. [14]

    Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,

    B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Bas ¸ar, “Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,”Annual Review of Control, Robotics, and Autonomous Sys- tems, vol. 6, no. 1, pp. 123–158, 2023

  15. [15]

    Provably Global Con- vergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost,

    Z. Yang, Y . Chen, M. Hong, and Z. Wang, “Provably Global Con- vergence of Actor-Critic: A Case for Linear Quadratic Regulator with Ergodic Cost,” inAdvances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019

  16. [16]

    Global optimality of single-timescale actor-critic under continuous state-action space: A study on linear quadratic regulator,

    X. Chen, J. Duan, and L. Zhao, “Global optimality of single-timescale actor-critic under continuous state-action space: A study on linear quadratic regulator,”arXiv preprint arXiv:2505.01041, 2025

  17. [17]

    Policy optimization forH 2 linear control withH ∞ robustness guarantee: Implicit regularization and global convergence,

    K. Zhang, B. Hu, and T. Bas ¸ar, “Policy optimization forH 2 linear control withH ∞ robustness guarantee: Implicit regularization and global convergence,”SIAM J. Control Optim., vol. 59, no. 6, pp. 4081– 4109, Jan. 2021

  18. [18]

    Global Convergence of Policy Gradient Primal–Dual Methods for Risk-Constrained LQRs,

    F. Zhao, K. You, and T. Bas ¸ar, “Global Convergence of Policy Gradient Primal–Dual Methods for Risk-Constrained LQRs,”IEEE Trans. Automat. Contr., vol. 68, no. 5, pp. 2934–2949, May 2023

  19. [19]

    Policy gradient methods for the cost-constrained lqr: Strong duality and global convergence,

    F. Zhao and K. You, “Policy gradient methods for the cost-constrained lqr: Strong duality and global convergence,”IEEE Transactions on Automatic Control, 2025

  20. [20]

    Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee,

    M. Han, Y . Tian, L. Zhang, J. Wang, and W. Pan, “Reinforcement learning control of constrained dynamic systems with uniformly ultimate boundedness stability guarantee,”Automatica, vol. 129, p. 109689, Jul. 2021

  21. [21]

    Learning control barrier functions and their application in reinforcement learning: A survey,

    M. Guerrier, H. Fouad, and G. Beltrame, “Learning control barrier functions and their application in reinforcement learning: A survey,” arXiv preprint arXiv:2404.16879, 2024

  22. [22]

    End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,

    R. Cheng, G. Orosz, R. M. Murray, and J. W. Burdick, “End-to-end safe reinforcement learning through barrier functions for safety-critical continuous control tasks,” inProceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 3387–3395

  23. [23]

    CBF-RL: Safety Filtering Reinforcement Learning in Training with Control Barrier Functions

    L. Yang, B. Werner, M. de Sa, and A. D. Ames, “Cbf-rl: Safety filtering reinforcement learning in training with control barrier functions,”arXiv preprint arXiv:2510.14959, 2025

  24. [24]

    Reinforcement learning based optimal control with a probabilistic risk constraint,

    A. Naha and S. Dey, “Reinforcement learning based optimal control with a probabilistic risk constraint,”arXiv preprint arXiv:2305.15755, 2023

  25. [25]

    Single timescale actor-critic method to solve the linear quadratic regulator with convergence guarantees,

    M. Zhou and J. Lu, “Single timescale actor-critic method to solve the linear quadratic regulator with convergence guarantees,”Journal of Machine Learning Research, vol. 24, no. 222, pp. 1–34, 2023

  26. [26]

    A finite-time analysis of two time-scale actor-critic methods,

    Y . F. Wu, W. Zhang, P. Xu, and Q. Gu, “A finite-time analysis of two time-scale actor-critic methods,”Advances in Neural Information Processing Systems, vol. 33, pp. 17 617–17 628, 2020

  27. [27]

    Convergence of actor-critic with multi-layer neural networks,

    H. Tian, A. Olshevsky, and Y . Paschalidis, “Convergence of actor-critic with multi-layer neural networks,”Advances in neural information processing systems, vol. 36, pp. 9279–9321, 2023

  28. [28]

    Analysis of the Optimization Landscape of Linear Quadratic Gaussian (LQG) Control,

    Y . Tang, Y . Zheng, and N. Li, “Analysis of the Optimization Landscape of Linear Quadratic Gaussian (LQG) Control,” inProceedings of the 3rd Conference on Learning for Dynamics and Control. PMLR, May 2021, pp. 599–610

  29. [29]

    V . S. Borkar and V . S. Borkar,Stochastic approximation: a dynamical systems viewpoint. Springer, 2008, vol. 100

  30. [30]

    A simple finite-time analysis of td learning with linear function approximation,

    A. Mitra, “A simple finite-time analysis of td learning with linear function approximation,”IEEE Transactions on Automatic Control, 2024

  31. [31]

    Finite sample analyses for td (0) with function approximation,

    G. Dalal, B. Sz ¨or´enyi, G. Thoppe, and S. Mannor, “Finite sample analyses for td (0) with function approximation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018

  32. [32]

    Neural temporal-difference learning converges to global optima,

    Q. Cai, Z. Yang, J. D. Lee, and Z. Wang, “Neural temporal-difference learning converges to global optima,”Advances in Neural Information Processing Systems, vol. 32, 2019

  33. [33]

    Towards Generalization and Simplicity in Continuous Control,

    A. Rajeswaran, K. Lowrey, E. V . Todorov, and S. M. Kakade, “Towards Generalization and Simplicity in Continuous Control,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

  34. [34]

    A natural policy gradient,

    S. M. Kakade, “A natural policy gradient,”Advances in neural information processing systems, vol. 14, 2001

  35. [35]

    Policy gradient-based reinforcement learning for lqg control with chance constraints,

    A. Naha and S. Dey, “Policy gradient-based reinforcement learning for lqg control with chance constraints,” in2025 European Control Conference (ECC). IEEE, 2025, pp. 364–371

  36. [36]

    D. P. Bertsekas,Dynamic Programming and Optimal Control 3rd Edition, Volume II. Athena Scientific, 2011

  37. [37]

    High- Dimensional Continuous Control Using Generalized Advantage Esti- mation,

    J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High- Dimensional Continuous Control Using Generalized Advantage Esti- mation,” Oct. 2018

  38. [38]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015

  39. [39]

    Continuous control with deep reinforce- ment learning, ICLR (2016),

    T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforce- ment learning, ICLR (2016),”arXiv preprint arXiv:1509.0297, 2016

  40. [40]

    Global convergence of policy gradient primal-dual methods for risk-constrained LQRs,

    F. Zhao, K. You, and T. Bas ¸ar, “Global convergence of policy gradient primal-dual methods for risk-constrained LQRs,”IEEE Transactions on Automatic Control, 2023

  41. [41]

    Nonlinear q-design for convex stochastic control,

    J. Skaf and S. Boyd, “Nonlinear q-design for convex stochastic control,”IEEE Transactions on Automatic Control, vol. 54, no. 10, pp. 2426–2430, 2009

  42. [42]

    On matrix trace inequalities and related topics for products of hermitian matrices,

    I. Coope, “On matrix trace inequalities and related topics for products of hermitian matrices,”Journal of mathematical analysis and appli- cations, vol. 188, no. 3, pp. 999–1001, 1994

  43. [43]

    An introduction to matrix concentration inequali- ties,

    J. A. Troppet al., “An introduction to matrix concentration inequali- ties,”Foundations and Trends® in Machine Learning, vol. 8, no. 1-2, pp. 1–230, 2015

  44. [44]

    On the perturbation of pseudo-inverses, projections and linear least squares problems,

    G. W. Stewart, “On the perturbation of pseudo-inverses, projections and linear least squares problems,”SIAM review, vol. 19, no. 4, pp. 634–662, 1977. 16 VOLUME