Multitask LQG Control: Performance and Generalization Bounds

Charis Stamouli; George J. Pappas; James Anderson; Kasra Fallah; Leonardo F. Toso

arxiv: 2604.16730 · v1 · submitted 2026-04-17 · 🧮 math.OC

Multitask LQG Control: Performance and Generalization Bounds

Leonardo F. Toso , Kasra Fallah , Charis Stamouli , George J. Pappas , James Anderson This is my paper

Pith reviewed 2026-05-10 07:30 UTC · model grok-4.3

classification 🧮 math.OC

keywords multitask LQG controlpolicy gradient methodsgeneralization boundsbisimulation functionheterogeneity biashistory-dependent liftingLQR equivalence

0 comments

The pith

Learning a common lifted controller across LQG tasks induces heterogeneity bias bounded via a bisimulation function, with performance and generalization guarantees that depend on this measure.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies learning one stabilizing controller that generalizes across many different linear quadratic Gaussian control problems drawn from a distribution. It introduces a history-dependent lifting that converts each partially observed LQG system into an equivalent fully observed LQR system of higher dimension. This equivalence allows the authors to characterize the bias incurred by forcing a single controller to serve all tasks through a bisimulation function that quantifies task heterogeneity. Performance and generalization bounds are then stated explicitly in terms of this bisimulation measure. In the model-free setting the same lifting shows that sharing data across tasks reduces the variance of policy-gradient estimates in direct proportion to the number of tasks.

Core claim

By means of a history-dependent lifting, the multitask LQG problem is recast as an equivalent high-dimensional multitask LQR problem. Learning a common lifted controller for this lifted problem induces a heterogeneity bias that is characterized by a bisimulation function. Performance and generalization guarantees are established that depend explicitly on bisimulation-based heterogeneity measures. In the model-free setting, multitask learning reduces the variance of policy gradient estimates proportionally to the number of tasks.

What carries the argument

The history-dependent lifting that transforms the multitask LQG problem into a high-dimensional multitask LQR problem, together with the bisimulation function used to measure task heterogeneity.

If this is right

Performance and generalization guarantees depend explicitly on the bisimulation-based heterogeneity measure.
Model-free multitask learning reduces policy gradient estimation variance proportionally to the number of tasks in the training set.
A single common lifted controller stabilizes all systems in the distribution with bounded cost.
The approach applies to stochastic and partially observed linear systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

When the bisimulation measure is small, multitask learning is expected to outperform separate single-task controllers.
The variance-reduction result implies that collecting additional tasks can improve sample efficiency without changing the per-task data budget.
The bounds may be used to decide which subset of available tasks should be grouped together for joint training.

Load-bearing premise

The history-dependent lifting recasts the multitask LQG problem into an equivalent high-dimensional multitask LQR problem to which policy-gradient analysis applies directly.

What would settle it

An experiment on LQG systems in which policy-gradient variance fails to decrease proportionally with the number of tasks, or in which realized cost exceeds the derived bound for a known bisimulation distance.

Figures

Figures reproduced from arXiv: 2604.16730 by Charis Stamouli, George J. Pappas, James Anderson, Kasra Fallah, Leonardo F. Toso.

**Figure 1.** Figure 1: Multitask LQG on partially observed cart-pole systems. (left) Task-specific optimality gaps (first six training tasks) over iterations. (middle) Train (N = 100) and test (50) optimality gaps with ±1-std, showing strong generalization. (right) Relative RMSE of the onepoint ZO gradient estimator with respect to number of tasks N. Task generation. Each task T (i) is generated by sampling the physical paramet… view at source ↗

**Figure 2.** Figure 2: Additional numerical results for the partially observed inverted-pendulum task. Top-left: task-specific optimality gaps. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

read the original abstract

We study multitask learning for stochastic and partially observed control systems, focusing on the linear quadratic Gaussian (LQG) problem. Our goal is to learn a common stabilizing controller that generalizes across a distribution of systems and objectives. To this end, we leverage a history-dependent lifting that recasts the multitask LQG problem into an equivalent high-dimensional multitask LQR problem, allowing for the analysis of policy gradient methods. We show that learning a common lifted controller induces a heterogeneity bias which we characterize via a "bisimulation function". We establish performance and generalization guarantees that explicitly depend on such bisimulation-based heterogeneity measures. For model-free, we demonstrate that multitask learning reduces policy gradient estimation variance proportionally to the number of tasks in the training set.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives explicit performance and generalization bounds for a shared lifted controller in multitask LQG by using history stacking to reach an LQR problem and a bisimulation function to measure heterogeneity bias, plus a variance reduction for policy gradients.

read the letter

The main thing to know is that the authors lift the multitask partially observed LQG problem to an equivalent high-dimensional LQR via history-dependent state augmentation, then characterize the bias from using one common controller with a bisimulation function and derive performance and generalization bounds that depend explicitly on that measure. They also show that averaging policy gradients across tasks reduces estimation variance linearly with the number of tasks in the training set. This is the core contribution. The lifting step lets them reuse standard LQR analysis tools on the stochastic partially observed case, and the bisimulation characterization turns the heterogeneity into a quantifiable term in the bounds. The variance claim follows directly from basic properties of sample averaging and does not appear to rely on extra assumptions beyond the task distribution. The approach is consistent with prior single-task LQG lifting results and multitask RL variance arguments, without obvious circularity in the definitions. The bisimulation function is introduced externally to the learning procedure, which keeps the math clean. One soft spot is that the bounds will only be useful if the bisimulation measure can be bounded or estimated without excessive conservatism, and the state augmentation from history lifting will raise the effective dimension, which could loosen the guarantees in practice. The abstract does not mention numerical checks or examples, so it is unclear how tight the bounds turn out to be on concrete systems. This paper is for researchers working on theoretical guarantees in linear stochastic control and multitask RL. A reader who needs explicit dependence of generalization error on task heterogeneity would get direct value from the derivations. I would send it for peer review because the claims are specific, the construction is internally consistent, and the topic sits at the intersection of control and learning where such bounds are still scarce.

Referee Report

0 major / 3 minor

Summary. The paper studies multitask learning for stochastic partially observed LQG control systems. It employs a history-dependent lifting to recast the multitask LQG problem as an equivalent high-dimensional multitask LQR problem. The authors characterize the heterogeneity bias induced by a common lifted controller via a bisimulation function, derive performance and generalization guarantees that depend explicitly on bisimulation-based heterogeneity measures, and show that multitask learning reduces policy gradient estimation variance proportionally to the number of tasks in the model-free setting.

Significance. If the derivations hold, the work provides a useful theoretical bridge between multitask learning and classical LQG control, with explicit bounds that quantify the impact of system heterogeneity through bisimulation. The variance-reduction result for policy gradients is a concrete, practically relevant contribution for model-free multitask control. The lifting step is standard, but its combination with bisimulation measures for generalization bounds adds a clear incremental value to the literature on robust and multitask control.

minor comments (3)

[Abstract and §2] The abstract and introduction would benefit from a short, self-contained statement of the key assumptions required for the lifting equivalence to hold (e.g., stabilizability, detectability, and noise statistics).
[§4] Clarify whether the bisimulation function is assumed known or must be estimated from data; if the latter, discuss how estimation error propagates into the performance and generalization bounds.
[§5] In the variance-reduction argument, explicitly state the independence assumptions across tasks and whether the proportionality holds only in expectation or almost surely.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary of our work on multitask LQG control and for recommending minor revision. The referee's description accurately reflects the paper's use of history-dependent lifting to recast the problem as multitask LQR, the characterization of heterogeneity bias via bisimulation functions, the resulting performance and generalization bounds, and the policy-gradient variance reduction proportional to the number of tasks. We are pleased that these elements are viewed as providing a useful theoretical bridge and a concrete practical contribution.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper recasts multitask LQG into an equivalent high-dimensional LQR via standard history-dependent lifting, then characterizes heterogeneity bias with a bisimulation function drawn from external control theory. Performance and generalization bounds are stated to depend explicitly on this bisimulation measure, while the model-free variance reduction follows directly from statistical averaging over tasks. None of these steps reduce by construction to fitted parameters, self-definitions, or load-bearing self-citations; the lifting equivalence, bisimulation characterization, and variance scaling are presented as consequences of the construction without circular reduction to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review performed from abstract only; full technical assumptions, free parameters, and any invented constructs are not visible.

axioms (2)

domain assumption Each task is a linear quadratic Gaussian system
Standard modeling assumption for LQG problems stated in the abstract.
domain assumption History-dependent lifting produces an equivalent high-dimensional multitask LQR problem
Central technical step invoked to enable policy-gradient analysis.

invented entities (1)

Bisimulation function for heterogeneity bias no independent evidence
purpose: Characterize the heterogeneity bias induced by a common lifted controller
Introduced in the abstract to quantify task differences for the performance bounds.

pith-pipeline@v0.9.0 · 5434 in / 1335 out tokens · 61450 ms · 2026-05-10T07:30:00.203128+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

arXiv preprint arXiv:2310.01362 , year=

L. Wang, K. Zhang, A. Zhou, M. Simchowitz, and R. Tedrake, “Fleet Policy Learning via Weight Merging and An Application to Robotic Tool-Use,”arXiv preprint arXiv:2310.01362, 2023

work page arXiv 2023
[2]

Deep reinforcement learning for autonomous driving: A survey,

B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yo- gamani, and P. P ´erez, “Deep reinforcement learning for autonomous driving: A survey,”IEEE transactions on intelligent transportation systems, vol. 23, no. 6, pp. 4909–4926, 2021

work page 2021
[3]

Distributed control applications within sensor networks,

B. Sinopoli, C. Sharp, L. Schenato, S. Schaffert, and S. S. Sastry, “Distributed control applications within sensor networks,”Proceedings of the IEEE, vol. 91, no. 8, pp. 1235–1246, 2003

work page 2003
[4]

K. Zhou, J. C. Doyle, and K. Glover,Robust and Optimal Control. Englewood Cliffs, NJ, USA: Prentice Hall, 1996

work page 1996
[5]

Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,

H. Wang, L. F. Toso, A. Mitra, and J. Anderson, “Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,”arXiv preprint arXiv:2308.11743, 2023

work page arXiv 2023
[6]

Policy gradient bounds in multitask LQR,

C. Stamouli, L. F. Toso, A. Tsiamis, G. J. Pappas, and J. Anderson, “Policy gradient bounds in multitask LQR,”IEEE Control Systems Letters, 2025

work page 2025
[7]

Policy gradient for LQR with domain randomization,

T. Fujinami, B. D. Lee, N. Matni, and G. J. Pappas, “Policy gradient for LQR with domain randomization,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 4174–4181

work page 2025
[8]

Meta-learning linear quadratic regulators: a policy gradient maml approach for model-free LQR,

L. F. Toso, D. Zhan, J. Anderson, and H. Wang, “Meta-learning linear quadratic regulators: a policy gradient maml approach for model-free LQR,” in6th Annual Learning for Dynamics & Control Conference. PMLR, 2024, pp. 902–915

work page 2024
[9]

On the Convergence of Policy Gradient for Designing a Linear Quadratic Regulator by Leveraging a Proxy System,

L. Ye, A. Mitra, and V . Gupta, “On the Convergence of Policy Gradient for Designing a Linear Quadratic Regulator by Leveraging a Proxy System,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 6016–6021

work page 2024
[10]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International conference on machine learning. PMLR, 2018, pp. 1467–1476

work page 2018
[11]

Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,

H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovi ´c, “Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,”IEEE Transactions on Automatic Control, vol. 67, no. 5, pp. 2435–2450, 2021

work page 2021
[12]

Learning optimal controllers for linear systems with multiplicative noise via policy gradient,

B. Gravell, P. M. Esfahani, and T. Summers, “Learning optimal controllers for linear systems with multiplicative noise via policy gradient,”IEEE Transactions on Automatic Control, vol. 66, no. 11, pp. 5283–5298, 2020

work page 2020
[13]

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,

B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Bas ¸ar, “Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,”Annual Review of Control, Robotics, and Autonomous Sys- tems, vol. 6, pp. 123–158, 2023

work page 2023
[14]

On the lack of gradient domination for linear quadratic Gaussian problems with incomplete state information,

H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanovi ´c, “On the lack of gradient domination for linear quadratic Gaussian problems with incomplete state information,” in2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 1120–1124

work page 2021
[15]

Analysis of the optimization landscape of linear quadratic gaussian (LQG) control,

Y . Tang, Y . Zheng, and N. Li, “Analysis of the optimization landscape of linear quadratic gaussian (LQG) control,” inLearning for dynamics and control. PMLR, 2021, pp. 599–610

work page 2021
[16]

Globally convergent policy gradient methods for linear quadratic control of partially observed systems,

F. Zhao, X. Fu, and K. You, “Globally convergent policy gradient methods for linear quadratic control of partially observed systems,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 5506–5511, Jan. 2023

work page 2023
[17]

On the Gradient Domination of the LQG Problem,

K. Fallah, L. F. Toso, and J. Anderson, “On the Gradient Domination of the LQG Problem,”arXiv preprint arXiv:2507.09026, 2025

work page arXiv 2025
[18]

Asynchronous heterogeneous linear quadratic regulator design,

L. F. Toso, H. Wang, and J. Anderson, “Asynchronous heterogeneous linear quadratic regulator design,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 801–808

work page 2024
[19]

Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,

D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, “Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,” inThe 22nd international conference on artificial intelligence and statistics. PMLR, 2019, pp. 2916–2925

work page 2019
[20]

Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning,

D. Zhan, L. F. Toso, and J. Anderson, “Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning,”arXiv preprint arXiv:2502.02332, 2025

work page arXiv 2025
[21]

Adversarially Robust Multi- task Adaptive Control,

K. Fallah, L. F. Toso, and J. Anderson, “Adversarially Robust Multi- task Adaptive Control,”arXiv preprint arXiv:2511.05444, 2025

work page arXiv 2025
[22]

Approximate Bisimulation: A Bridge Between Computer Science and Control Theory,

A. Girard and G. J. Pappas, “Approximate Bisimulation: A Bridge Between Computer Science and Control Theory,”European Journal of Control, vol. 17, no. 5-6, pp. 568–578, 2011

work page 2011
[23]

Theoretical convergence of multi- step model-agnostic meta-learning,

K. Ji, J. Yang, and Y . Liang, “Theoretical convergence of multi- step model-agnostic meta-learning,”The Journal of Machine Learning Research, vol. 23, no. 1, pp. 1317–1357, 2022

work page 2022
[24]

A theoretical understanding of gradient bias in meta- reinforcement learning,

B. Liu, X. Feng, J. Ren, L. Mai, R. Zhu, H. Zhang, J. Wang, and Y . Yang, “A theoretical understanding of gradient bias in meta- reinforcement learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 059–31 072, 2022

work page 2022
[25]

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning,

Y . Schnitzer, M. Jackermeier, A. Abate, and D. Parker, “Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning,” arXiv preprint arXiv:2602.02098, 2026

work page arXiv 2026
[26]

Generalization bounds for meta-learning via pac-bayes and uniform stability,

A. Farid and A. Majumdar, “Generalization bounds for meta-learning via pac-bayes and uniform stability,”Advances in neural information processing systems, vol. 34, pp. 2173–2186, 2021

work page 2021
[27]

Transformers As Generalizable Optimal Controllers,

T. B. Mohaya, M. F. AL-Sunni, J. M. Dolan, and P. Seiler, “Transformers As Generalizable Optimal Controllers,”arXiv preprint arXiv:2603.14910, 2026

work page arXiv 2026
[28]

Output-feedback synthesis orbit geom- etry: Quotient manifolds and LQG direct policy optimization,

S. Kraisler and M. Mesbahi, “Output-feedback synthesis orbit geom- etry: Quotient manifolds and LQG direct policy optimization,”IEEE Control Systems Letters, vol. 8, pp. 1577–1582, 2024

work page 2024
[29]

G. H. Hardy,Divergent series. American Mathematical Society, 2024, vol. 334

work page 2024
[30]

Approximation metrics based on probabilistic bisimulations for general state-space markov processes: a survey,

A. Abate, “Approximation metrics based on probabilistic bisimulations for general state-space markov processes: a survey,”Electronic Notes in Theoretical Computer Science, vol. 297, pp. 3–25, 2013

work page 2013
[31]

Layered multirate control of constrained linear systems,

C. Stamouli, A. Tsiamis, M. Morari, and G. J. Pappas, “Layered multirate control of constrained linear systems,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 3027– 3034

work page 2025
[32]

Compo- sitional abstractions of interconnected discrete-time stochastic control systems,

A. Lavaei, S. E. Z. Soudjani, R. Majumdar, and M. Zamani, “Compo- sitional abstractions of interconnected discrete-time stochastic control systems,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE, 2017, pp. 3551–3556

work page 2017
[33]

Vershynin,High-dimensional probability: An introduction with applications in data science

R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47

work page 2018
[34]

Convergence and sample complexity of policy gradient methods for stabilizing linear systems,

F. Zhao, X. Fu, and K. You, “Convergence and sample complexity of policy gradient methods for stabilizing linear systems,”IEEE Transactions on Automatic Control, 2024

work page 2024
[35]

Learning over all stabilizing nonlinear controllers for a partially-observed linear system,

R. Wang, N. H. Barbara, M. Revay, and I. R. Manchester, “Learning over all stabilizing nonlinear controllers for a partially-observed linear system,”IEEE Control Systems Letters, vol. 7, pp. 91–96, 2022

work page 2022
[36]

CVXPY: A Python-embedded modeling language for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,”Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016
[37]

Rate-optimal non- asymptotics for the quadratic prediction error method,

C. Stamouli, I. Ziemann, and G. J. Pappas, “Rate-optimal non- asymptotics for the quadratic prediction error method,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 5723–5730

work page 2024
[38]

User-friendly tail bounds for sums of random matrices,

J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of computational mathematics, vol. 12, pp. 389–434, 2012. XI. APPENDIXROADMAP This appendix provides detailed proofs, technical derivations, and additional experimental results supporting the main text. Section XII includes additional experimental details, such as system dyn...

work page 2012
[39]

ComputeF (i) eK =A (i) eK ⊗A (i) eK ,C (i) eK =S (i)†⊤ ⋆ ⊗E (i) eK , andν (i) = vec(Σ(i) ν )for each taski, and form the joint quantities F (ij) eK = diag F (i) eK , F (j) eK , C (ij) eK = h C(i) eK −C(j) eK i ,andν (ij) = ν(i) ν(j)

work page
[40]

Setλ (ij) eK andη (ij) eK via (35)-(36), and compute the derived constantsζ= 1 + (η (ij) eK )−1 andλ ′ =λ (ij) eK −η (ij) eK (1−λ (ij) eK )

work page
[41]

Solve the SDP (37) to obtainM (ij) eK

work page
[42]

Remark XIV .1.It is important to emphasize the main difference between problem(37)and the one in multitask LQR setting [6]

Evaluate the bisimulation-based heterogeneity measure via bij(eK) := ζν (ij)⊤M(ij) eK ν(ij) λ′ . Remark XIV .1.It is important to emphasize the main difference between problem(37)and the one in multitask LQR setting [6]. In that setting, the bisimulation measure involves the term p λmin(M)in the denominator, which requires an epigraph reformulation and a ...

work page

[1] [1]

arXiv preprint arXiv:2310.01362 , year=

L. Wang, K. Zhang, A. Zhou, M. Simchowitz, and R. Tedrake, “Fleet Policy Learning via Weight Merging and An Application to Robotic Tool-Use,”arXiv preprint arXiv:2310.01362, 2023

work page arXiv 2023

[2] [2]

Deep reinforcement learning for autonomous driving: A survey,

B. R. Kiran, I. Sobh, V . Talpaert, P. Mannion, A. A. Al Sallab, S. Yo- gamani, and P. P ´erez, “Deep reinforcement learning for autonomous driving: A survey,”IEEE transactions on intelligent transportation systems, vol. 23, no. 6, pp. 4909–4926, 2021

work page 2021

[3] [3]

Distributed control applications within sensor networks,

B. Sinopoli, C. Sharp, L. Schenato, S. Schaffert, and S. S. Sastry, “Distributed control applications within sensor networks,”Proceedings of the IEEE, vol. 91, no. 8, pp. 1235–1246, 2003

work page 2003

[4] [4]

K. Zhou, J. C. Doyle, and K. Glover,Robust and Optimal Control. Englewood Cliffs, NJ, USA: Prentice Hall, 1996

work page 1996

[5] [5]

Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,

H. Wang, L. F. Toso, A. Mitra, and J. Anderson, “Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,”arXiv preprint arXiv:2308.11743, 2023

work page arXiv 2023

[6] [6]

Policy gradient bounds in multitask LQR,

C. Stamouli, L. F. Toso, A. Tsiamis, G. J. Pappas, and J. Anderson, “Policy gradient bounds in multitask LQR,”IEEE Control Systems Letters, 2025

work page 2025

[7] [7]

Policy gradient for LQR with domain randomization,

T. Fujinami, B. D. Lee, N. Matni, and G. J. Pappas, “Policy gradient for LQR with domain randomization,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 4174–4181

work page 2025

[8] [8]

Meta-learning linear quadratic regulators: a policy gradient maml approach for model-free LQR,

L. F. Toso, D. Zhan, J. Anderson, and H. Wang, “Meta-learning linear quadratic regulators: a policy gradient maml approach for model-free LQR,” in6th Annual Learning for Dynamics & Control Conference. PMLR, 2024, pp. 902–915

work page 2024

[9] [9]

On the Convergence of Policy Gradient for Designing a Linear Quadratic Regulator by Leveraging a Proxy System,

L. Ye, A. Mitra, and V . Gupta, “On the Convergence of Policy Gradient for Designing a Linear Quadratic Regulator by Leveraging a Proxy System,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 6016–6021

work page 2024

[10] [10]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” in International conference on machine learning. PMLR, 2018, pp. 1467–1476

work page 2018

[11] [11]

Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,

H. Mohammadi, A. Zare, M. Soltanolkotabi, and M. R. Jovanovi ´c, “Convergence and sample complexity of gradient methods for the model-free linear–quadratic regulator problem,”IEEE Transactions on Automatic Control, vol. 67, no. 5, pp. 2435–2450, 2021

work page 2021

[12] [12]

Learning optimal controllers for linear systems with multiplicative noise via policy gradient,

B. Gravell, P. M. Esfahani, and T. Summers, “Learning optimal controllers for linear systems with multiplicative noise via policy gradient,”IEEE Transactions on Automatic Control, vol. 66, no. 11, pp. 5283–5298, 2020

work page 2020

[13] [13]

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,

B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Bas ¸ar, “Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,”Annual Review of Control, Robotics, and Autonomous Sys- tems, vol. 6, pp. 123–158, 2023

work page 2023

[14] [14]

On the lack of gradient domination for linear quadratic Gaussian problems with incomplete state information,

H. Mohammadi, M. Soltanolkotabi, and M. R. Jovanovi ´c, “On the lack of gradient domination for linear quadratic Gaussian problems with incomplete state information,” in2021 60th IEEE Conference on Decision and Control (CDC). IEEE, 2021, pp. 1120–1124

work page 2021

[15] [15]

Analysis of the optimization landscape of linear quadratic gaussian (LQG) control,

Y . Tang, Y . Zheng, and N. Li, “Analysis of the optimization landscape of linear quadratic gaussian (LQG) control,” inLearning for dynamics and control. PMLR, 2021, pp. 599–610

work page 2021

[16] [16]

Globally convergent policy gradient methods for linear quadratic control of partially observed systems,

F. Zhao, X. Fu, and K. You, “Globally convergent policy gradient methods for linear quadratic control of partially observed systems,” IFAC-PapersOnLine, vol. 56, no. 2, pp. 5506–5511, Jan. 2023

work page 2023

[17] [17]

On the Gradient Domination of the LQG Problem,

K. Fallah, L. F. Toso, and J. Anderson, “On the Gradient Domination of the LQG Problem,”arXiv preprint arXiv:2507.09026, 2025

work page arXiv 2025

[18] [18]

Asynchronous heterogeneous linear quadratic regulator design,

L. F. Toso, H. Wang, and J. Anderson, “Asynchronous heterogeneous linear quadratic regulator design,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 801–808

work page 2024

[19] [19]

Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,

D. Malik, A. Pananjady, K. Bhatia, K. Khamaru, P. Bartlett, and M. Wainwright, “Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,” inThe 22nd international conference on artificial intelligence and statistics. PMLR, 2019, pp. 2916–2925

work page 2019

[20] [20]

Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning,

D. Zhan, L. F. Toso, and J. Anderson, “Coreset-Based Task Selection for Sample-Efficient Meta-Reinforcement Learning,”arXiv preprint arXiv:2502.02332, 2025

work page arXiv 2025

[21] [21]

Adversarially Robust Multi- task Adaptive Control,

K. Fallah, L. F. Toso, and J. Anderson, “Adversarially Robust Multi- task Adaptive Control,”arXiv preprint arXiv:2511.05444, 2025

work page arXiv 2025

[22] [22]

Approximate Bisimulation: A Bridge Between Computer Science and Control Theory,

A. Girard and G. J. Pappas, “Approximate Bisimulation: A Bridge Between Computer Science and Control Theory,”European Journal of Control, vol. 17, no. 5-6, pp. 568–578, 2011

work page 2011

[23] [23]

Theoretical convergence of multi- step model-agnostic meta-learning,

K. Ji, J. Yang, and Y . Liang, “Theoretical convergence of multi- step model-agnostic meta-learning,”The Journal of Machine Learning Research, vol. 23, no. 1, pp. 1317–1357, 2022

work page 2022

[24] [24]

A theoretical understanding of gradient bias in meta- reinforcement learning,

B. Liu, X. Feng, J. Ren, L. Mai, R. Zhu, H. Zhang, J. Wang, and Y . Yang, “A theoretical understanding of gradient bias in meta- reinforcement learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 31 059–31 072, 2022

work page 2022

[25] [25]

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning,

Y . Schnitzer, M. Jackermeier, A. Abate, and D. Parker, “Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning,” arXiv preprint arXiv:2602.02098, 2026

work page arXiv 2026

[26] [26]

Generalization bounds for meta-learning via pac-bayes and uniform stability,

A. Farid and A. Majumdar, “Generalization bounds for meta-learning via pac-bayes and uniform stability,”Advances in neural information processing systems, vol. 34, pp. 2173–2186, 2021

work page 2021

[27] [27]

Transformers As Generalizable Optimal Controllers,

T. B. Mohaya, M. F. AL-Sunni, J. M. Dolan, and P. Seiler, “Transformers As Generalizable Optimal Controllers,”arXiv preprint arXiv:2603.14910, 2026

work page arXiv 2026

[28] [28]

Output-feedback synthesis orbit geom- etry: Quotient manifolds and LQG direct policy optimization,

S. Kraisler and M. Mesbahi, “Output-feedback synthesis orbit geom- etry: Quotient manifolds and LQG direct policy optimization,”IEEE Control Systems Letters, vol. 8, pp. 1577–1582, 2024

work page 2024

[29] [29]

G. H. Hardy,Divergent series. American Mathematical Society, 2024, vol. 334

work page 2024

[30] [30]

Approximation metrics based on probabilistic bisimulations for general state-space markov processes: a survey,

A. Abate, “Approximation metrics based on probabilistic bisimulations for general state-space markov processes: a survey,”Electronic Notes in Theoretical Computer Science, vol. 297, pp. 3–25, 2013

work page 2013

[31] [31]

Layered multirate control of constrained linear systems,

C. Stamouli, A. Tsiamis, M. Morari, and G. J. Pappas, “Layered multirate control of constrained linear systems,” in2025 IEEE 64th Conference on Decision and Control (CDC). IEEE, 2025, pp. 3027– 3034

work page 2025

[32] [32]

Compo- sitional abstractions of interconnected discrete-time stochastic control systems,

A. Lavaei, S. E. Z. Soudjani, R. Majumdar, and M. Zamani, “Compo- sitional abstractions of interconnected discrete-time stochastic control systems,” in2017 IEEE 56th Annual Conference on Decision and Control (CDC). IEEE, 2017, pp. 3551–3556

work page 2017

[33] [33]

Vershynin,High-dimensional probability: An introduction with applications in data science

R. Vershynin,High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47

work page 2018

[34] [34]

Convergence and sample complexity of policy gradient methods for stabilizing linear systems,

F. Zhao, X. Fu, and K. You, “Convergence and sample complexity of policy gradient methods for stabilizing linear systems,”IEEE Transactions on Automatic Control, 2024

work page 2024

[35] [35]

Learning over all stabilizing nonlinear controllers for a partially-observed linear system,

R. Wang, N. H. Barbara, M. Revay, and I. R. Manchester, “Learning over all stabilizing nonlinear controllers for a partially-observed linear system,”IEEE Control Systems Letters, vol. 7, pp. 91–96, 2022

work page 2022

[36] [36]

CVXPY: A Python-embedded modeling language for convex optimization,

S. Diamond and S. Boyd, “CVXPY: A Python-embedded modeling language for convex optimization,”Journal of Machine Learning Research, vol. 17, no. 83, pp. 1–5, 2016

work page 2016

[37] [37]

Rate-optimal non- asymptotics for the quadratic prediction error method,

C. Stamouli, I. Ziemann, and G. J. Pappas, “Rate-optimal non- asymptotics for the quadratic prediction error method,” in2024 IEEE 63rd Conference on Decision and Control (CDC). IEEE, 2024, pp. 5723–5730

work page 2024

[38] [38]

User-friendly tail bounds for sums of random matrices,

J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of computational mathematics, vol. 12, pp. 389–434, 2012. XI. APPENDIXROADMAP This appendix provides detailed proofs, technical derivations, and additional experimental results supporting the main text. Section XII includes additional experimental details, such as system dyn...

work page 2012

[39] [39]

ComputeF (i) eK =A (i) eK ⊗A (i) eK ,C (i) eK =S (i)†⊤ ⋆ ⊗E (i) eK , andν (i) = vec(Σ(i) ν )for each taski, and form the joint quantities F (ij) eK = diag F (i) eK , F (j) eK , C (ij) eK = h C(i) eK −C(j) eK i ,andν (ij) = ν(i) ν(j)

work page

[40] [40]

Setλ (ij) eK andη (ij) eK via (35)-(36), and compute the derived constantsζ= 1 + (η (ij) eK )−1 andλ ′ =λ (ij) eK −η (ij) eK (1−λ (ij) eK )

work page

[41] [41]

Solve the SDP (37) to obtainM (ij) eK

work page

[42] [42]

Remark XIV .1.It is important to emphasize the main difference between problem(37)and the one in multitask LQR setting [6]

Evaluate the bisimulation-based heterogeneity measure via bij(eK) := ζν (ij)⊤M(ij) eK ν(ij) λ′ . Remark XIV .1.It is important to emphasize the main difference between problem(37)and the one in multitask LQR setting [6]. In that setting, the bisimulation measure involves the term p λmin(M)in the denominator, which requires an epigraph reformulation and a ...

work page