Recognition: no theorem link
Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic
Pith reviewed 2026-05-15 02:03 UTC · model grok-4.3
The pith
Agents share a linear subspace for collaboration while keeping personalized policies, yielding finite-time convergence rates that scale linearly with the number of agents under single-timescale Markovian updates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under canonical single-timescale updates with Markovian sampling, the federated actor-critic framework with shared linear subspace and personalized local heads achieves finite-time convergence: the critic error converges to zero at rate Õ(1/((1-γ)^4 √(TK))) and the policy gradient norm at Õ(1/((1-γ)^6 √(TK))), delivering linear speedup in K despite heterogeneous transition kernels, distinct Markovian trajectories, and coupled policy-critic dynamics. The proof relies on a new joint linear approximation framework, perturbation analysis for projected subspace updates and QR steps, and conditional mixing arguments for heterogeneous noise, together with fine-grained bounds on function-value Dis
What carries the argument
The joint linear approximation framework that tracks the coupled evolution of the shared subspace projection, local critic heads, and local actors, together with the perturbation analysis of the projected subspace updates and the conditional mixing bounds for heterogeneous Markovian noise.
If this is right
- The critic error vanishes at the stated rate even under Markovian sampling and coupled updates.
- The policy gradient norm vanishes at its stated rate, implying convergence to stationary policies.
- Linear speedup in K is obtained without requiring identical environments or multiple timescales.
- The learned shared trunk supports downstream transfer to new tasks.
- Empirical gains appear over both single-agent PPO and standard FedAvg PPO on action-map heterogeneous Hopper-v5.
Where Pith is reading between the lines
- The subspace-sharing idea could be tested with nonlinear function approximators if the perturbation analysis can be extended beyond linear projections.
- Similar rates may hold for other single-timescale methods such as natural actor-critic once the same mixing and perturbation tools are applied.
- In practice the approach suggests that federated robotic fleets can share a low-dimensional feature trunk while each robot retains its own policy head, reducing communication while preserving adaptation.
Load-bearing premise
A single common linear subspace suffices to capture the shared structure across heterogeneous environments, with the remaining differences handled by local heads, and that the perturbation and mixing arguments continue to hold when policy updates are coupled to the subspace estimates.
What would settle it
A controlled experiment in which the observed critic error or policy-gradient norm fails to decrease proportionally to 1/√K when the number of agents is increased while keeping total samples TK fixed, in a setting with clearly heterogeneous transition kernels.
Figures
read the original abstract
Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-\gamma)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-\gamma)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $\gamma\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a federated actor-critic framework in which agents collaboratively estimate a shared linear subspace representation while maintaining personalized local critic heads and policy (actor) components. Under single-timescale updates with Markovian sampling from heterogeneous transition kernels, the authors claim finite-time convergence via a novel joint linear approximation framework, with critic error converging at rate Õ(1/((1-γ)^4 √(TK))) and policy-gradient norm at Õ(1/((1-γ)^6 √(TK))), establishing linear speedup in K despite coupled dynamics and temporal dependence. The analysis relies on new perturbation bounds for projected subspace updates and QR steps together with conditional mixing arguments; experiments on federated Hopper-v5 with action-map heterogeneity show gains over Single PPO and FedAvg PPO.
Significance. If the claimed rates hold, the result would be significant for supplying the first finite-time guarantees for single-timescale federated actor-critic with partial sharing, directly addressing the tension between collaboration and personalization in heterogeneous RL. The joint linear approximation framework and the perturbation analysis for subspace projection under Markovian noise constitute a technical contribution that could be reused in other multi-agent settings. The linear speedup in K is a strong, practically relevant claim.
major comments (3)
- [§4] §4 (Finite-time analysis) and the joint linear approximation framework: the perturbation analysis for projected subspace updates and QR decomposition steps does not explicitly bound the additional drift term arising from concurrent single-timescale policy updates over the mixing horizon of each heterogeneous Markov chain. Because the actor evolves at the same rate as the critic and subspace estimate, this drift is not obviously absorbed into the stated Õ(1/((1-γ)^4 √(TK))) critic bound without an extra factor that would eliminate the claimed linear speedup in K.
- [§3.2] §3.2 (Joint linear approximation) and conditional mixing arguments: the fine-grained characterizations of discrepancies between function evaluations under Markovian sampling and under temporally frozen policies assume that the policy remains sufficiently stable over the mixing window, yet the single-timescale coupled dynamics make this stability dependent on the very rates being proved; a circularity or missing induction step appears in the argument.
- [Theorem 1] Theorem 1 (critic convergence) and Theorem 2 (policy-gradient norm): the final bounds are stated to hold under the assumption that a single common linear subspace is expressive enough for the shared structure across heterogeneous environments, but no quantitative condition on the approximation error of this subspace (e.g., a uniform bound on the residual after projection) is provided that would guarantee the claimed rates remain valid when the personalization heads cannot fully compensate.
minor comments (2)
- Notation for the common subspace dimension and the local head dimensions is introduced without a clear table or diagram; adding a schematic of the parameter decomposition would improve readability.
- The experimental section reports gains on Hopper-v5 but does not include ablation on the subspace dimension or on the number of local heads; these controls would strengthen the empirical support for the shared-representation hypothesis.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the analysis and clarify the technical arguments.
read point-by-point responses
-
Referee: [§4] §4 (Finite-time analysis) and the joint linear approximation framework: the perturbation analysis for projected subspace updates and QR decomposition steps does not explicitly bound the additional drift term arising from concurrent single-timescale policy updates over the mixing horizon of each heterogeneous Markov chain. Because the actor evolves at the same rate as the critic and subspace estimate, this drift is not obviously absorbed into the stated Õ(1/((1-γ)^4 √(TK))) critic bound without an extra factor that would eliminate the claimed linear speedup in K.
Authors: We thank the referee for this observation. The joint linear approximation framework controls the policy-induced drift over the mixing horizon by using the fact that policy changes are O(1/√(TK)) per step (from the policy-gradient bound) and showing via a new perturbation lemma that this drift contributes only lower-order terms absorbed into the Õ notation. The linear speedup in K is preserved because the drift bound scales with the per-agent sample size. We have added an explicit drift lemma (Lemma 4.3 in the revision) that quantifies this term and confirms it does not introduce an extra factor destroying the 1/√K speedup. revision: yes
-
Referee: [§3.2] §3.2 (Joint linear approximation) and conditional mixing arguments: the fine-grained characterizations of discrepancies between function evaluations under Markovian sampling and under temporally frozen policies assume that the policy remains sufficiently stable over the mixing window, yet the single-timescale coupled dynamics make this stability dependent on the very rates being proved; a circularity or missing induction step appears in the argument.
Authors: We agree that the original presentation left the stability argument implicit. The proof proceeds by a two-stage induction: first a coarse O(1) bound on policy variation over any fixed-length mixing window is established using only boundedness of the updates, and this coarse bound is then used to close the conditional mixing argument and obtain the fine-grained rate. The induction is made explicit in the revised §3.2 and the proof of the conditional mixing lemma, removing any circularity. revision: yes
-
Referee: [Theorem 1] Theorem 1 (critic convergence) and Theorem 2 (policy-gradient norm): the final bounds are stated to hold under the assumption that a single common linear subspace is expressive enough for the shared structure across heterogeneous environments, but no quantitative condition on the approximation error of this subspace (e.g., a uniform bound on the residual after projection) is provided that would guarantee the claimed rates remain valid when the personalization heads cannot fully compensate.
Authors: The referee correctly identifies that a quantitative condition is required. We have added Assumption 3.4 stating that the uniform projection residual of the shared subspace is bounded by ε (with ε = o(1/((1-γ)^2 √(TK))) for the leading terms to dominate). Under this assumption the critic error bound becomes Õ(1/((1-γ)^4 √(TK)) + ε) and the policy-gradient bound is likewise adjusted; the linear speedup in K is retained when ε is sufficiently small relative to the personalization capacity. A brief discussion of how the assumption can be verified in practice has also been included. revision: yes
Circularity Check
Finite-time rates derived via perturbation analysis and mixing arguments; no reduction to fitted inputs or self-definitional steps
full rationale
The claimed convergence rates follow from a joint linear approximation framework whose core steps are perturbation bounds on subspace projections/QR steps plus conditional mixing for heterogeneous Markovian noise under single-timescale coupled dynamics. These are standard analytic techniques applied to the algorithm's update rules; they do not define the target quantities in terms of themselves, fit parameters to the final bounds, or rely on load-bearing self-citations whose validity is internal to the paper. The linear speedup in K emerges from the analysis rather than being presupposed by the inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A common linear subspace exists that captures the shared structure of optimal policies across all agents despite distinct transition kernels.
- domain assumption The heterogeneous Markov chains satisfy conditional mixing bounds that allow the perturbation analysis to control the discrepancy between Markovian and frozen-policy function evaluations.
Reference graph
Works this paper leans on
-
[1]
R. K. Ando, T. Zhang, and P. Bartlett. A framework for learning predictive structures from multiple tasks and unlabeled data.Journal of machine learning research, 6(11), 2005
work page 2005
- [2]
-
[3]
J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018
work page 2018
-
[4]
R. Caruana. Multitask learning.Machine learning, 28:41–75, 1997
work page 1997
-
[5]
T. Chen, Y . Sun, and W. Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems.Advances in Neural Information Processing Systems, 34:25294–25307, 2021
work page 2021
-
[6]
X. Chen and L. Zhao. Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[7]
X. Chen and L. Zhao. On the convergence of continuous single-timescale actor-critic. In Forty-second International Conference on Machine Learning, 2025
work page 2025
-
[8]
L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai. Exploiting shared representations for personalized federated learning. InInternational conference on machine learning, pages 2089–2099. PMLR, 2021
work page 2089
-
[9]
C. Daskalakis, N. Golowich, and K. Zhang. The complexity of markov equilibrium in stochastic games. InThe Thirty Sixth Annual Conference on Learning Theory, pages 4180–4234. PMLR, 2023
work page 2023
- [10]
-
[11]
S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei. Few-shot learning via learning the representation, provably. InInternational Conference on Learning Representations, 2021
work page 2021
-
[12]
J. C. Duchi, V . Feldman, L. Hu, and K. Talwar. Subspace recovery from heterogeneous data with non-isotropic noise.Advances in Neural Information Processing Systems, 35:5854–5866, 2022
work page 2022
- [13]
-
[14]
J. Filar and K. Vrieze.Competitive Markov decision processes. Springer Science & Business Media, 2012
work page 2012
- [15]
- [16]
-
[17]
C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. InConference on learning theory, pages 2137–2143. PMLR, 2020
work page 2020
-
[18]
H. Jin, Y . Peng, W. Yang, S. Wang, and Z. Zhang. Federated reinforcement learning with environment heterogeneity. InInternational Conference on Artificial Intelligence and Statistics, pages 18–37. PMLR, 2022
work page 2022
-
[19]
S. Khodadadian, P. Sharma, G. Joshi, and S. T. Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. InInternational Conference on Machine Learning, pages 10997–11057. PMLR, 2022. 10
work page 2022
-
[20]
V . Konda and J. Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999
work page 1999
- [21]
- [22]
-
[23]
T. Li, S. Hu, A. Beirami, and V . Smith. Ditto: Fair and robust federated learning through personalization. InInternational Conference on Machine Learning, pages 6357–6368. PMLR, 2021
work page 2021
-
[24]
C. Mclaughlin and L. Su. Personalized federated learning via feature distribution adaptation. Advances in Neural Information Processing Systems, 37:77038–77059, 2024
work page 2024
-
[25]
A. Mitra. A simple finite-time analysis of td learning with linear function approximation.IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024
work page 2024
-
[26]
A. Y . Mitrophanov. Sensitivity and convergence of uniformly ergodic markov chains.Journal of Applied Probability, 42(4):1003–1014, 2005
work page 2005
- [27]
-
[28]
A. Olshevsky and B. Gharesifard. A small gain analysis of single timescale actor critic.SIAM Journal on Control and Optimization, 61(2):980–1007, 2023
work page 2023
-
[29]
S. Qiu, Z. Yang, J. Ye, and Z. Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021
work page 2021
-
[30]
G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning for multiagent networked systems.Operations Research, 70(6):3601–3628, 2022
work page 2022
-
[31]
S. Salgia and Y . Chi. The sample-communication complexity trade-off in federated q-learning. Advances in Neural Information Processing Systems, 37:39694–39747, 2025
work page 2025
-
[32]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation andtd learning. InConference on learning theory, pages 2803–2830. PMLR, 2019
work page 2019
-
[34]
R. S. Sutton and A. G. Barto.Reinforcement learning: An introduction. MIT press, 2018
work page 2018
-
[35]
R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour. Policy gradient methods for reinforce- ment learning with function approximation.Advances in neural information processing systems, 12, 1999
work page 1999
- [36]
-
[37]
K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Statistically and computationally efficient linear meta-representation learning.Advances in Neural Information Processing Systems, 34:18487–18500, 2021
work page 2021
-
[38]
Y . Tian, Y . Gu, and Y . Feng. Learning from similar linear representations: Adaptivity, minimax- ity, and robustness.Journal of Machine Learning Research, 26(187):1–125, 2025
work page 2025
-
[39]
N. Tripuraneni, C. Jin, and M. Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021
work page 2021
-
[40]
J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning.Automatica, 35(11):1799–1808, 1999. 11
work page 1999
-
[41]
H. Wang, S. He, Z. Zhang, F. M. Miao, and J. Anderson. Momentum for the win: collaborative federated reinforcement learning across heterogeneous environments. InProceedings of the 41st International Conference on Machine Learning, pages 50530–50560, 2024
work page 2024
- [42]
- [43]
-
[44]
M. Wang, P. Yang, and L. Su. On the convergence rates of federated q-learning across heteroge- neous environments.Transactions on Machine Learning Research, 2025
work page 2025
-
[45]
J. Woo, G. Joshi, and Y . Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 37157–37216. PMLR, 23–29 Jul 2023
work page 2023
-
[46]
Y . F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020
work page 2020
- [47]
- [48]
- [49]
-
[50]
J. Xu, X. Tong, and S.-L. Huang. Personalized federated learning with feature alignment and classifier collaboration. InThe Eleventh International Conference on Learning Representations, 2023
work page 2023
-
[51]
T. Yang, S. Cen, Y . Wei, Y . Chen, and Y . Chi. Federated natural policy gradient and actor critic methods for multi-task reinforcement learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[52]
T. Yang, S. Cen, Y . Wei, Y . Chen, and Y . Chi. Federated natural policy gradient and actor critic methods for multi-task reinforcement learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 121304–121375. Curran Associates, Inc., 2024
work page 2024
- [53]
- [54]
- [55]
- [56]
- [57]
-
[58]
S. Zou, T. Xu, and Y . Liang. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019. 13 Appendices A Limitations 15 B Broader Impacts 15 C Related Work 15 D Notations and Preliminary Results 16 D.1 Auxiliary Markov chains for critic and actor . . . . . . . . . . . . . . . . . . . ....
work page 2019
-
[59]
studied federated natural policy-gradient and actor-critic methods for multi-task reinforcement learning, where agents collaborate to learn a shared policy under task heterogeneity. More recently,
-
[60]
analyzed a single-loop federated actor-critic method for learning a shared policy across heteroge- neous environments. Their “single-loop” terminology refers to preserving the critic across policy updates, while the actor is still updated after multiple critic communication rounds. [48] studied that, in the setting of FRL, Proximal Policy Optimization (PP...
-
[61]
L−1X ℓ=0 γℓ rk t,ℓ + (γϕ(sk t,ℓ+1)−ϕ(s k t,ℓ))⊤B∗ωk,∗ t ϕ(sk t,0) # −Eµk θk t−τ ,πθk t−τ ,P k
, λcc2 4 16L2 ∗,1CG,1 ) .(133) Set ζ= L1/4 √ T , c 2 1 = λν 4r .(134) With this choice, the first-order contraction and absorption requirements in (124) and (131) are satisfied provided 2304r2U 2 ω λ2ν2L2 72U 2 ω λc2 2L + 24UωUδ λL(1−γ) ≤ 1 4 .(135) The remaining small-stepsize requirements are collected as L1/4 √ T ≤min ( 1 2UδUω , 1√ 6UδUω , λν 3rU 2 δ ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.