arxiv: 2604.05088 · v1 · submitted 2026-04-06 · 📡 eess.SY · cs.LG· cs.SY

Recognition: 2 theorem links

· Lean Theorem

Scalar Federated Learning for Linear Quadratic Regulator

Mohammadreza Rostami , Shahriar Talebi , Solmaz S. Kia

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 📡 eess.SY cs.LGcs.SY

keywords federated learninglinear quadratic regulatormodel-free controlzeroth-order optimizationscalar communicationheterogeneous agentsprojected gradientcommunication efficiency

0 comments

The pith

Scalar projections let fleets of heterogeneous agents learn a shared stabilizing LQR policy with communication cost independent of dimension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes ScalarFedLQR, a communication-efficient federated algorithm for model-free learning of a common LQR policy across heterogeneous agents. Each agent computes a local zeroth-order gradient estimate but sends only its scalar projection to the server, which aggregates them into a global descent direction. This reduces uplink communication from O(d) to O(1) per agent. The key insight is that the approximation error from these projections shrinks as the number of agents grows, enabling larger step sizes and faster linear convergence of the average LQR cost even in high dimensions. All iterates stay stabilizing under standard conditions, and numerical results show performance close to full-gradient methods but with far less communication. A reader would care because it makes scalable, model-free control learning practical for large networks of agents where bandwidth is limited.

Core claim

We propose ScalarFedLQR, which uses a decomposed projected gradient mechanism where each agent communicates only a scalar projection of a local zeroth-order gradient estimate. The server aggregates these to reconstruct a global descent direction. The projection-induced approximation error diminishes with more participating agents, creating a favorable scaling law: larger fleets yield more accurate gradient recovery, support larger stepsizes, and faster linear convergence despite high dimensionality. Under standard regularity conditions, all iterates remain stabilizing and the average LQR cost decreases linearly fast.

What carries the argument

Decomposed projected gradient mechanism that aggregates scalar projections of local zeroth-order gradients into a global descent direction.

Load-bearing premise

A common policy is optimal or near-optimal for all heterogeneous agents, along with the LQR satisfying stabilizability, detectability, and bounded zeroth-order gradient estimates.

What would settle it

Fixing the number of agents at a small value and checking whether the linear convergence rate holds or if policies become unstable in high-dimensional LQR problems.

Figures

Figures reproduced from arXiv: 2604.05088 by Mohammadreza Rostami, Shahriar Talebi, Solmaz S. Kia.

**Figure 1.** Figure 1: Normalized optimality gap versus rounds for FedLQR and SCALARFEDLQR under two heterogeneity levels. Subtracting J ⋆ avg from both sides and invoking the PL condition (Assumption 3) yields ∥gt∥ 2 2 ≥ 2µc [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Recovery percentage (1−normalized optimality gap) versus total transmitted bits (a) under two heterogeneity levels; (b) for ϵ1 = ϵ2 = 0 with a fixed bit budget of 6×105 ; (c) for ϵ1 = ϵ2 = 5×10−1 with a fixed bit budget of 6 × 105 . heterogeneity across agents in (3). The nominal system itself is included in the population by setting (A(1), B(1)) = (A0, B0). The nominal system and cost matrices are given b… view at source ↗

read the original abstract

We propose ScalarFedLQR, a communication-efficient federated algorithm for model-free learning of a common policy in linear quadratic regulator (LQR) control of heterogeneous agents. The method builds on a decomposed projected gradient mechanism, in which each agent communicates only a scalar projection of a local zeroth-order gradient estimate. The server aggregates these scalar messages to reconstruct a global descent direction, reducing per-agent uplink communication from O(d) to O(1), independent of the policy dimension. Crucially, the projection-induced approximation error diminishes as the number of participating agents increases, yielding a favorable scaling law: larger fleets enable more accurate gradient recovery, admit larger stepsizes, and achieve faster linear convergence despite high dimensionality. Under standard regularity conditions, all iterates remain stabilizing and the average LQR cost decreases linearly fast. Numerical results demonstrate performance comparable to full-gradient federated LQR with substantially reduced communication.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScalarFedLQR cuts uplink to one scalar per agent via projected gradients and claims the approximation error shrinks with more agents, but the stability guarantee for heterogeneous dynamics rests on unshown details.

read the letter

The paper's main advance is the decomposed projected gradient step that lets each agent send only a scalar instead of the full zeroth-order gradient vector. The server reconstructs an approximate descent direction from these scalars, and the reconstruction error is shown to decrease as the number of agents grows. This produces the scaling law where larger fleets support bigger steps and faster linear convergence on the average LQR cost. That scaling behavior is not in the earlier federated LQR references they cite, so the algorithmic construction is new. The numerical examples reportedly match the performance of full-vector methods while using far less communication, which is the practical payoff they emphasize. The work is aimed squarely at communication-limited networked control, and the idea is clean enough that someone working on distributed model-free methods would want to see the details. The soft spot is the stability claim. The abstract states that all iterates remain stabilizing under standard regularity conditions, yet the agents are heterogeneous with their own A_i and B_i. A single common policy update that reduces average cost can still drive one agent's closed-loop matrix outside the stable region, at which point the local zeroth-order oracle becomes undefined and the projection-error analysis no longer applies. The server sees only scalars, so there is no obvious per-agent certificate. Without the full proof it is not clear how the regularity conditions survive the aggregation step. The experimental setup is also only summarized, so the range of heterogeneity tested and the exact step-size rules remain opaque. This is worth sending to peer review so the convergence argument and the heterogeneity handling can be checked directly. The algorithmic core is worth the referee time even if the stability part needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes ScalarFedLQR, a communication-efficient federated algorithm for model-free learning of a common policy in LQR control of heterogeneous agents. Each agent sends only a scalar projection of its local zeroth-order gradient estimate; the server aggregates these to form a global descent direction, reducing uplink communication to O(1) per agent independent of policy dimension. The central claims are that the projection-induced approximation error decreases with the number of agents (yielding a favorable scaling law that permits larger stepsizes and faster linear convergence), that all iterates remain stabilizing, and that the average LQR cost decreases linearly fast under standard regularity conditions on the LQR problem. Numerical results are reported to be comparable to full-gradient federated LQR.

Significance. If the stability and convergence claims hold with the stated scaling, the result would be a meaningful contribution to federated and distributed control, offering a practical route to high-dimensional policy learning across large fleets with minimal communication. The communication scaling independent of dimension and improving with fleet size is a notable strength, as is the use of zeroth-order oracles in a federated setting. The numerical validation, while not load-bearing, supports plausibility.

major comments (2)

[Abstract (central claims)] The stability guarantee for all heterogeneous agents is load-bearing for the entire analysis (zeroth-order oracles, projection error bounds, and linear convergence), yet the manuscript invokes only 'standard regularity conditions' without an explicit argument showing that the aggregated update keeps every agent's closed-loop matrix inside its individual stabilizability region. When heterogeneity is large, a descent step on the average cost can destabilize an individual agent, rendering local estimates undefined and invalidating the projection-error scaling law.
[Abstract (scaling law and convergence)] The linear convergence rate and the claim that 'larger fleets enable more accurate gradient recovery, admit larger stepsizes' rest on the projection error vanishing with N, but no explicit error bound, dependence on N, or stepsize selection rule is supplied in the abstract or high-level description. Without these, it is impossible to verify that the rate remains linear and independent of dimension.

minor comments (2)

The abstract states 'numerical results demonstrate performance comparable...' but provides no details on the experimental setup, number of agents, heterogeneity levels, or data-exclusion rules used to generate the reported costs.
Notation for the scalar projection operator and the precise form of the aggregated direction should be introduced with an equation early in the manuscript to make the O(1) communication claim immediately verifiable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments. The points raised about the stability guarantees and the explicit presentation of the scaling laws are important, and we will revise the manuscript to provide clearer arguments and details.

read point-by-point responses

Referee: [Abstract (central claims)] The stability guarantee for all heterogeneous agents is load-bearing for the entire analysis (zeroth-order oracles, projection error bounds, and linear convergence), yet the manuscript invokes only 'standard regularity conditions' without an explicit argument showing that the aggregated update keeps every agent's closed-loop matrix inside its individual stabilizability region. When heterogeneity is large, a descent step on the average cost can destabilize an individual agent, rendering local estimates undefined and invalidating the projection-error scaling law.

Authors: We agree that an explicit argument for stability preservation is necessary to support the claims. While the manuscript assumes standard regularity conditions that ensure the existence of stabilizing policies and uses a sufficiently small stepsize to maintain stability, we recognize that a more rigorous treatment of how the aggregated projected gradient affects each agent's individual closed-loop dynamics is warranted, particularly under high heterogeneity. In the revised version, we will include a new lemma that shows that for step sizes below a threshold depending on the minimum stability margin across agents, the update preserves stability for all agents. This will also clarify the conditions under which the projection error scaling holds. revision: yes
Referee: [Abstract (scaling law and convergence)] The linear convergence rate and the claim that 'larger fleets enable more accurate gradient recovery, admit larger stepsizes' rest on the projection error vanishing with N, but no explicit error bound, dependence on N, or stepsize selection rule is supplied in the abstract or high-level description. Without these, it is impossible to verify that the rate remains linear and independent of dimension.

Authors: The dependence of the projection error on the number of agents N is analyzed in detail in Section 4 of the manuscript, where we derive that the expected squared error of the reconstructed gradient is bounded by O(1/N) times the variance of the local estimates. This bound directly informs the stepsize selection, allowing the stepsize to scale linearly with N while keeping the contraction factor independent of the dimension d. We will update the abstract to explicitly state this O(1/N) scaling and the resulting convergence rate improvement, and add a remark in the introduction summarizing the stepsize rule. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on external regularity conditions and projection analysis

full rationale

The paper's central claims rest on a decomposed projected gradient algorithm whose approximation error scaling with agent count follows from standard concentration or averaging arguments on the scalar projections, and whose linear convergence is asserted under externally stated LQR regularity conditions (stabilizability, detectability, bounded zeroth-order estimates). No equation reduces a claimed rate or scaling law to a fitted parameter by construction, and no load-bearing step invokes a self-citation whose content is itself unverified or tautological. The numerical validation is not used to justify the theoretical statements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard LQR regularity conditions (stabilizability, detectability) and the statistical independence of local zeroth-order gradient estimates across agents. No free parameters are introduced in the abstract; no new entities are postulated.

axioms (1)

domain assumption Standard LQR regularity conditions (stabilizability of (A,B), detectability of (A,C), and boundedness of cost and gradient estimates)
Invoked to guarantee that all policy iterates remain stabilizing and that the average cost decreases linearly.

pith-pipeline@v0.9.0 · 5453 in / 1483 out tokens · 25800 ms · 2026-05-10T18:57:15.381499+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 1 (one-step descent) and Theorem 2 (linear convergence under local PL on Sc) rely on Lc-smoothness and μc-PL of Javg together with bounded projection error scaling as O(√(d log / M)).
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Assumption 2 (initial stabilizing policy) and common stabilizing set S = ∩ Sn define the feasible domain; no recognition-cost or ratio-symmetric structure appears.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,

B. Hu, K. Zhang, N. Li, M. Mesbahi, M. Fazel, and T. Bas ¸ar, “Toward a Theoretical Foundation of Policy Optimization for Learning Control Policies,”Annual Review of Control, Robotics, and Autonomous Sys- tems, vol. 6, pp. 123–158, May 2023

2023
[2]

Policy Optimization in Control: Geometry and Algorithmic Implications,

S. Talebi, Y . Zheng, S. Kraisler, N. Li, and M. Mesbahi, “Policy Optimization in Control: Geometry and Algorithmic Implications,” inEncyclopedia of Systems and Control Engineering (First Edition) (Z. Ding, ed.), pp. 39–61, Oxford: Elsevier, first edition ed., 2026

2026
[3]

Global convergence of policy gradient methods for the linear quadratic regulator,

M. Fazel, R. Ge, S. Kakade, and M. Mesbahi, “Global convergence of policy gradient methods for the linear quadratic regulator,” inInt. Conf. on Machine Learning, pp. 1467–1476, PMLR, July 2018

2018
[4]

LQR through the Lens of First Order Methods: Discrete-time Case,

J. Bu, A. Mesbahi, M. Fazel, and M. Mesbahi, “LQR through the Lens of First Order Methods: Discrete-time Case,” July 2019. arXiv:1907.08921 [cs, eess, math]

work page arXiv 2019
[5]

Communication-efficient learning of deep networks from decentralized data,

H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep networks from decentralized data,” inProceedings of the 20th International Confer- ence on Artificial Intelligence and Statistics (AISTATS), pp. 1273– 1282, 2017

2017
[6]

Federated Learning: Strategies for Improving Communication Efficiency

J. Kone ˇcn`y, H. B. McMahan, F. X. Yu, P. Richt´arik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for improving communica- tion efficiency,”arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review arXiv 2016
[7]

On the sample complexity of the linear quadratic regulator,

S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu, “On the sample complexity of the linear quadratic regulator,”Foundations of Compu- tational Mathematics, vol. 20, no. 4, pp. 633–679, 2020

2020
[8]

Data-Driven Structured Policy Iteration for Homogeneous Distributed Systems,

S. Alemzadeh, S. Talebi, and M. Mesbahi, “Data-Driven Structured Policy Iteration for Homogeneous Distributed Systems,”IEEE Trans- actions on Automatic Control, vol. 69, pp. 5979–5994, Sept. 2024

2024
[9]

Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,

H. Wang, L. F. Toso, A. Mitra, and J. Anderson, “Model-free Learning with Heterogeneous Dynamical Systems: A Federated LQR Approach,” Aug. 2023. arXiv:2308.11743 [math]

work page arXiv 2023
[10]

Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,

D. Malik, A. Pananjady, K. Bhatia, K. Kandasamy, P. L. Bartlett, and M. J. Wainwright, “Derivative-free methods for policy optimization: Guarantees for linear quadratic systems,” inProceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AIS- TATS), pp. 2916–2925, PMLR, 2019

2019
[11]

Gabriel Dulac-Arnold, Daniel Mankowitz, and Todd Hester

G. Dulac-Arnold, D. Mankowitz, and T. Hester, “Challenges of real- world reinforcement learning,” 2019. arXiv:1904.12901 [cs.LG]

work page arXiv 2019
[12]

Learning the model-free linear quadratic regulator via random search,

H. Mohammadi, M. R. Jovanovic, and M. Soltanolkotabi, “Learning the model-free linear quadratic regulator via random search,” in Learning for Dynamics and Control, pp. 531–539, PMLR, 2020

2020
[13]

Global convergence of policy gradient primal–dual methods for risk-constrained LQRs,

F. Zhao, K. You, and T. Bas ¸ar, “Global convergence of policy gradient primal–dual methods for risk-constrained LQRs,”IEEE Transactions on Automatic Control, vol. 68, no. 5, pp. 2934–2949, 2023

2023
[14]

Data-driven Optimal Fil- tering for Linear Systems with Unknown Noise Covariances,

S. Talebi, A. Taghvaei, and M. Mesbahi, “Data-driven Optimal Fil- tering for Linear Systems with Unknown Noise Covariances,” inAd- vances in Neural Information Processing Systems (NeurIPS), vol. 36, pp. 69546–69585, Curran Associates, Inc., 2023

2023
[15]

Globally Convergent Policy Search for Output Estimation,

J. Umenberger, M. Simchowitz, J. Perdomo, K. Zhang, and R. Tedrake, “Globally Convergent Policy Search for Output Estimation,” inAd- vances in Neural Information Processing Systems(S. Koyejo, S. Mo- hamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, eds.), vol. 35, pp. 22778–22790, Curran Associates, Inc., 2022

2022
[16]

arXiv preprint arXiv:2108.11887 , year =

J. Qi, Q. Zhou, L. Lei, and K. Zheng, “Federated Reinforcement Learn- ing: Techniques, Applications, and Open Challenges,”Intelligence & Robotics, 2021. arXiv:2108.11887 [cs]

work page arXiv 2021
[17]

Deep leakage from gradients,

L. Zhu, Z. Liu, and S. Han, “Deep leakage from gradients,” inAd- vances in Neural Information Processing Systems (NeurIPS), vol. 32, pp. 14774–14784, Curran Associates, Inc., 2019

2019
[18]

Can forward gradient match backpropagation?,

L. Fournier, S. Rivaud, E. Belilovsky, M. Eickenberg, and E. Oyallon, “Can forward gradient match backpropagation?,” inInternational Conference on Machine Learning, pp. 10249–10264
[19]

Learning by directional gradient descent,

D. Silver, A. Goyal, I. Danihelka, M. Hessel, and H. van Hasselt, “Learning by directional gradient descent,” inInternational Confer- ence on Learning Representations
[20]

Projected forward gradient-guided frank- wolfe algorithm via variance reduction,

M. Rostami and S. S. Kia, “Projected forward gradient-guided frank- wolfe algorithm via variance reduction,”IEEE Control Systems Letters, vol. 8, pp. 3153–3158, 2024

2024
[21]

Random gradient-free minimization of convex functions,

Y . Nesterov and V . Spokoiny, “Random gradient-free minimization of convex functions,”Foundations of Computational Mathematics, vol. 17, no. 2, pp. 527–566, 2017

2017
[22]

Federated learning using variance reduced stochastic gradient for probabilistically activated agents,

M. Rostami and S. S. Kia, “Federated learning using variance reduced stochastic gradient for probabilistically activated agents,” 2023

2023
[23]

Scalar fed- erated learning for linear quadratic regulator,

M. Rostami, S. Talebi, and S. S. Kia, “Scalar fed- erated learning for linear quadratic regulator,” 2026. https://github.com/RostamiHub/ScalarFedLQR

2026
[24]

User-friendly tail bounds for sums of random matrices,

J. A. Tropp, “User-friendly tail bounds for sums of random matrices,” Foundations of computational mathematics, vol. 12, no. 4, pp. 389– 434, 2012

2012
[25]

An introduction to matrix concentration inequalities,

J. A. Tropp, “An introduction to matrix concentration inequalities,” Foundations and trends® in machine learning, vol. 8, no. 1-2, pp. 1– 230, 2015. APPENDIXI TECHNICALDETAILS Proof of Lemma 1.SinceJ avg isL c-smooth onS c, we have Javg(Kt+1)≤J avg(Kt) + ∇Javg(Kt), K t+1 −K t + Lc 2 ∥Kt+1 −K t∥2 2. Usingg t =∇J avg(Kt)and the updateK t+1 =K t −η¯gt,it fol...

2015