arxiv: 2605.14423 · v1 · submitted 2026-05-14 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Collaborative Yet Personalized Policy Training: Single-Timescale Federated Actor-Critic

Leo Muxing Wang , Pengkun Yang , Lili Su

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:03 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords federated reinforcement learningactor-criticfinite-time convergenceheterogeneous environmentslinear subspace representationMarkovian samplingpolicy personalizationlinear speedup

0 comments

The pith

Agents share a linear subspace for collaboration while keeping personalized policies, yielding finite-time convergence rates that scale linearly with the number of agents under single-timescale Markovian updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a federated actor-critic method in which agents jointly learn a shared linear subspace representation while each maintains its own local critic head and policy. Under standard single-timescale updates and Markovian sampling from heterogeneous environments, the method establishes explicit finite-time rates showing that both critic error and policy gradient norm decrease as the number of agents K grows. The analysis introduces perturbation bounds on projected subspace updates and conditional mixing arguments to handle the coupled dynamics and temporal dependence. A reader would care because the result supplies the first such guarantees for collaborative yet personalized reinforcement learning without forcing a single shared policy or multiple timescales.

Core claim

Under canonical single-timescale updates with Markovian sampling, the federated actor-critic framework with shared linear subspace and personalized local heads achieves finite-time convergence: the critic error converges to zero at rate Õ(1/((1-γ)^4 √(TK))) and the policy gradient norm at Õ(1/((1-γ)^6 √(TK))), delivering linear speedup in K despite heterogeneous transition kernels, distinct Markovian trajectories, and coupled policy-critic dynamics. The proof relies on a new joint linear approximation framework, perturbation analysis for projected subspace updates and QR steps, and conditional mixing arguments for heterogeneous noise, together with fine-grained bounds on function-value Dis

What carries the argument

The joint linear approximation framework that tracks the coupled evolution of the shared subspace projection, local critic heads, and local actors, together with the perturbation analysis of the projected subspace updates and the conditional mixing bounds for heterogeneous Markovian noise.

If this is right

The critic error vanishes at the stated rate even under Markovian sampling and coupled updates.
The policy gradient norm vanishes at its stated rate, implying convergence to stationary policies.
Linear speedup in K is obtained without requiring identical environments or multiple timescales.
The learned shared trunk supports downstream transfer to new tasks.
Empirical gains appear over both single-agent PPO and standard FedAvg PPO on action-map heterogeneous Hopper-v5.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The subspace-sharing idea could be tested with nonlinear function approximators if the perturbation analysis can be extended beyond linear projections.
Similar rates may hold for other single-timescale methods such as natural actor-critic once the same mixing and perturbation tools are applied.
In practice the approach suggests that federated robotic fleets can share a low-dimensional feature trunk while each robot retains its own policy head, reducing communication while preserving adaptation.

Load-bearing premise

A single common linear subspace suffices to capture the shared structure across heterogeneous environments, with the remaining differences handled by local heads, and that the perturbation and mixing arguments continue to hold when policy updates are coupled to the subspace estimates.

What would settle it

A controlled experiment in which the observed critic error or policy-gradient norm fails to decrease proportionally to 1/√K when the number of agents is increased while keeping total samples TK fixed, in a setting with clearly heterogeneous transition kernels.

Figures

Figures reproduced from arXiv: 2605.14423 by Leo Muxing Wang, Lili Su, Pengkun Yang.

**Figure 2.** Figure 2: Foundation transfer. The trunk is loaded from a single FedPer 6-UNIQUE checkpoint; the actor and critic output heads are randomly re-initialized. FROZEN trains only the new output heads (1,031/69,895 parameters); TRAINABLE fine-tunes the entire warm-started network; SCRATCH trains a single PPO model from scratch. The dashed line marks the 0.4 M reporting budget [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 1.** Figure 1: Same six clients under two action-map heterogeneity structures: grouped, where three clients share each action map, and 6-UNIQUE, where every client has a distinct action map. Thus, personalization with a shared trunk yields its largest gain precisely in the regime where direct parameter averaging becomes unreliable. 6.2 The FedPer trunk is a transferable foundation model The 6-UNIQUE result in Section 6.… view at source ↗

**Figure 3.** Figure 3: Training curves over per-client environment steps for the heterogeneity ablation in Sec [PITH_FULL_IMAGE:figures/full_fig_p093_3.png] view at source ↗

read the original abstract

Despite the popularity of the actor-critic method and the practical needs of collaborative policy training, existing works typically either overlook environmental heterogeneity or give up personalization altogether by training a single shared policy across all agents. We consider a federated actor-critic framework in which agents share a common linear subspace representation while maintaining personalized local policy components, and agents iteratively estimate the common subspace, local critic heads, and local policies (i.e., actors). Under canonical single-timescale updates with Markovian sampling, we establish finite-time convergence via a novel joint linear approximation framework. Specifically, we show that the critic error converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-\gamma)^4\sqrt{TK}))$, and the policy gradient norm converges to zero at the rate of $\tilde{\mathcal{O}}(1/((1-\gamma)^6\sqrt{TK}))$, where $T$ is the number of rounds, $K$ is the number of agents, and $\gamma\in (0,1)$ is the discount factor. These results demonstrate linear speedup with respect to the number of agents $K$, despite heterogeneous Markovian trajectories under distinct transition kernels and coupled learning dynamics. To address these challenges, we develop a new perturbation analysis for the projected subspace updates and QR decomposition steps, together with conditional mixing arguments for heterogeneous Markovian noise. Furthermore, to handle the additional complications induced by policy updates and temporal dependence, we establish fine-grained characterizations of the discrepancies between function evaluations under Markovian sampling and under temporally frozen policies. Experiments instantiate the framework within PPO on federated \texttt{Hopper-v5} action-map heterogeneity, showing gains over Single PPO and FedAvg PPO and downstream transfer from the learned shared trunk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper offers finite-time rates for a shared-subspace personalized federated actor-critic but the single-timescale perturbation bounds are the part to watch closely.

read the letter

The punchline is that this paper presents a single-timescale federated actor-critic algorithm with a shared linear subspace and personalized local policies, along with finite-time convergence rates that achieve linear speedup in the number of agents despite heterogeneous Markovian sampling. What the paper does well is address a real practical need in collaborative policy training. Instead of forcing a single shared policy or training everything independently, they allow agents to share a common representation subspace while personalizing the policy components. The analysis introduces a joint linear approximation framework with perturbation arguments for the subspace updates and QR decompositions, plus conditional mixing lemmas for the heterogeneous noise. The experiments instantiate this in PPO on a federated Hopper-v5 environment with action-map heterogeneity and show improvements over Single PPO and FedAvg PPO, plus some transfer benefits from the shared trunk. That gives some evidence that the approach can work in practice. The soft spots are around the analysis assumptions and bounds. The rates depend on the perturbation analysis controlling errors from the projected subspace updates and the mixing arguments handling temporal dependence under coupled dynamics. With single-timescale updates, the policy is changing at the same pace as the critic and subspace estimate, so the drift terms over each chain's mixing horizon could be larger than assumed. If those are not bounded tightly enough, the claimed O(1/sqrt(TK)) rates might pick up extra factors that remove the linear speedup. The assumption that one common linear subspace is expressive enough for the shared structure across distinct transition kernels is also central and could be restrictive in more complex environments. Since the review was from the abstract, the full derivation of the joint approximation and the precise discrepancy characterizations would need close reading. This paper is for researchers in federated reinforcement learning and distributed control who are interested in theoretical guarantees for heterogeneous settings. A reader looking for new analysis techniques in actor-critic methods with linear function approximation and Markovian sampling would get value from it. I recommend sending it to peer review. The problem is relevant, the claims are specific, and the experimental results provide a starting point for validation even if the theory needs refinement.

Referee Report

3 major / 2 minor

Summary. The paper proposes a federated actor-critic framework in which agents collaboratively estimate a shared linear subspace representation while maintaining personalized local critic heads and policy (actor) components. Under single-timescale updates with Markovian sampling from heterogeneous transition kernels, the authors claim finite-time convergence via a novel joint linear approximation framework, with critic error converging at rate Õ(1/((1-γ)^4 √(TK))) and policy-gradient norm at Õ(1/((1-γ)^6 √(TK))), establishing linear speedup in K despite coupled dynamics and temporal dependence. The analysis relies on new perturbation bounds for projected subspace updates and QR steps together with conditional mixing arguments; experiments on federated Hopper-v5 with action-map heterogeneity show gains over Single PPO and FedAvg PPO.

Significance. If the claimed rates hold, the result would be significant for supplying the first finite-time guarantees for single-timescale federated actor-critic with partial sharing, directly addressing the tension between collaboration and personalization in heterogeneous RL. The joint linear approximation framework and the perturbation analysis for subspace projection under Markovian noise constitute a technical contribution that could be reused in other multi-agent settings. The linear speedup in K is a strong, practically relevant claim.

major comments (3)

[§4] §4 (Finite-time analysis) and the joint linear approximation framework: the perturbation analysis for projected subspace updates and QR decomposition steps does not explicitly bound the additional drift term arising from concurrent single-timescale policy updates over the mixing horizon of each heterogeneous Markov chain. Because the actor evolves at the same rate as the critic and subspace estimate, this drift is not obviously absorbed into the stated Õ(1/((1-γ)^4 √(TK))) critic bound without an extra factor that would eliminate the claimed linear speedup in K.
[§3.2] §3.2 (Joint linear approximation) and conditional mixing arguments: the fine-grained characterizations of discrepancies between function evaluations under Markovian sampling and under temporally frozen policies assume that the policy remains sufficiently stable over the mixing window, yet the single-timescale coupled dynamics make this stability dependent on the very rates being proved; a circularity or missing induction step appears in the argument.
[Theorem 1] Theorem 1 (critic convergence) and Theorem 2 (policy-gradient norm): the final bounds are stated to hold under the assumption that a single common linear subspace is expressive enough for the shared structure across heterogeneous environments, but no quantitative condition on the approximation error of this subspace (e.g., a uniform bound on the residual after projection) is provided that would guarantee the claimed rates remain valid when the personalization heads cannot fully compensate.

minor comments (2)

Notation for the common subspace dimension and the local head dimensions is introduced without a clear table or diagram; adding a schematic of the parameter decomposition would improve readability.
The experimental section reports gains on Hopper-v5 but does not include ablation on the subspace dimension or on the number of local heads; these controls would strengthen the empirical support for the shared-representation hypothesis.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and have revised the manuscript to strengthen the analysis and clarify the technical arguments.

read point-by-point responses

Referee: [§4] §4 (Finite-time analysis) and the joint linear approximation framework: the perturbation analysis for projected subspace updates and QR decomposition steps does not explicitly bound the additional drift term arising from concurrent single-timescale policy updates over the mixing horizon of each heterogeneous Markov chain. Because the actor evolves at the same rate as the critic and subspace estimate, this drift is not obviously absorbed into the stated Õ(1/((1-γ)^4 √(TK))) critic bound without an extra factor that would eliminate the claimed linear speedup in K.

Authors: We thank the referee for this observation. The joint linear approximation framework controls the policy-induced drift over the mixing horizon by using the fact that policy changes are O(1/√(TK)) per step (from the policy-gradient bound) and showing via a new perturbation lemma that this drift contributes only lower-order terms absorbed into the Õ notation. The linear speedup in K is preserved because the drift bound scales with the per-agent sample size. We have added an explicit drift lemma (Lemma 4.3 in the revision) that quantifies this term and confirms it does not introduce an extra factor destroying the 1/√K speedup. revision: yes
Referee: [§3.2] §3.2 (Joint linear approximation) and conditional mixing arguments: the fine-grained characterizations of discrepancies between function evaluations under Markovian sampling and under temporally frozen policies assume that the policy remains sufficiently stable over the mixing window, yet the single-timescale coupled dynamics make this stability dependent on the very rates being proved; a circularity or missing induction step appears in the argument.

Authors: We agree that the original presentation left the stability argument implicit. The proof proceeds by a two-stage induction: first a coarse O(1) bound on policy variation over any fixed-length mixing window is established using only boundedness of the updates, and this coarse bound is then used to close the conditional mixing argument and obtain the fine-grained rate. The induction is made explicit in the revised §3.2 and the proof of the conditional mixing lemma, removing any circularity. revision: yes
Referee: [Theorem 1] Theorem 1 (critic convergence) and Theorem 2 (policy-gradient norm): the final bounds are stated to hold under the assumption that a single common linear subspace is expressive enough for the shared structure across heterogeneous environments, but no quantitative condition on the approximation error of this subspace (e.g., a uniform bound on the residual after projection) is provided that would guarantee the claimed rates remain valid when the personalization heads cannot fully compensate.

Authors: The referee correctly identifies that a quantitative condition is required. We have added Assumption 3.4 stating that the uniform projection residual of the shared subspace is bounded by ε (with ε = o(1/((1-γ)^2 √(TK))) for the leading terms to dominate). Under this assumption the critic error bound becomes Õ(1/((1-γ)^4 √(TK)) + ε) and the policy-gradient bound is likewise adjusted; the linear speedup in K is retained when ε is sufficiently small relative to the personalization capacity. A brief discussion of how the assumption can be verified in practice has also been included. revision: yes

Circularity Check

0 steps flagged

Finite-time rates derived via perturbation analysis and mixing arguments; no reduction to fitted inputs or self-definitional steps

full rationale

The claimed convergence rates follow from a joint linear approximation framework whose core steps are perturbation bounds on subspace projections/QR steps plus conditional mixing for heterogeneous Markovian noise under single-timescale coupled dynamics. These are standard analytic techniques applied to the algorithm's update rules; they do not define the target quantities in terms of themselves, fit parameters to the final bounds, or rely on load-bearing self-citations whose validity is internal to the paper. The linear speedup in K emerges from the analysis rather than being presupposed by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of a sufficiently expressive common linear subspace across heterogeneous environments and on standard but non-trivial mixing properties of the Markov chains induced by the evolving policies.

axioms (2)

domain assumption A common linear subspace exists that captures the shared structure of optimal policies across all agents despite distinct transition kernels.
Invoked to justify the shared-trunk architecture and the projected subspace updates.
domain assumption The heterogeneous Markov chains satisfy conditional mixing bounds that allow the perturbation analysis to control the discrepancy between Markovian and frozen-policy function evaluations.
Required for the finite-time bounds under single-timescale coupled dynamics.

pith-pipeline@v0.9.0 · 5616 in / 1633 out tokens · 72818 ms · 2026-05-15T02:03:44.452727+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

[1]

R. K. Ando, T. Zhang, and P. Bartlett. A framework for learning predictive structures from multiple tasks and unlabeled data.Journal of machine learning research, 6(11), 2005

work page 2005
[2]

Bengio, A

Y . Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013

work page 2013
[3]

Bhandari, D

J. Bhandari, D. Russo, and R. Singal. A finite time analysis of temporal difference learning with linear function approximation. InConference on learning theory, pages 1691–1692. PMLR, 2018

work page 2018
[4]

R. Caruana. Multitask learning.Machine learning, 28:41–75, 1997

work page 1997
[5]

T. Chen, Y . Sun, and W. Yin. Closing the gap: Tighter analysis of alternating stochastic gradient methods for bilevel problems.Advances in Neural Information Processing Systems, 34:25294–25307, 2021

work page 2021
[6]

Chen and L

X. Chen and L. Zhao. Finite-time analysis of single-timescale actor-critic.Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[7]

Chen and L

X. Chen and L. Zhao. On the convergence of continuous single-timescale actor-critic. In Forty-second International Conference on Machine Learning, 2025

work page 2025
[8]

Collins, H

L. Collins, H. Hassani, A. Mokhtari, and S. Shakkottai. Exploiting shared representations for personalized federated learning. InInternational conference on machine learning, pages 2089–2099. PMLR, 2021

work page 2089
[9]

Daskalakis, N

C. Daskalakis, N. Golowich, and K. Zhang. The complexity of markov equilibrium in stochastic games. InThe Thirty Sixth Annual Conference on Learning Theory, pages 4180–4234. PMLR, 2023

work page 2023
[10]

Y . Deng, M. M. Kamani, and M. Mahdavi. Adaptive personalized federated learning.arXiv preprint arXiv:2003.13461, 2020

work page arXiv 2003
[11]

S. S. Du, W. Hu, S. M. Kakade, J. D. Lee, and Q. Lei. Few-shot learning via learning the representation, provably. InInternational Conference on Learning Representations, 2021

work page 2021
[12]

J. C. Duchi, V . Feldman, L. Hu, and K. Talwar. Subspace recovery from heterogeneous data with non-isotropic noise.Advances in Neural Information Processing Systems, 35:5854–5866, 2022

work page 2022
[13]

Fallah, A

A. Fallah, A. Mokhtari, and A. Ozdaglar. Personalized federated learning with theoretical guar- antees: A model-agnostic meta-learning approach.Advances in Neural Information Processing Systems, 33:3557–3568, 2020

work page 2020
[14]

Filar and K

J. Filar and K. Vrieze.Competitive Markov decision processes. Springer Science & Business Media, 2012

work page 2012
[15]

R. Hu, Y . Chen, and L. Huang. Finite-time convergence analysis of actor-critic with evolving reward.arXiv preprint arXiv:2510.12334, 2025

work page arXiv 2025
[16]

Jiang, J

Y . Jiang, J. Koneˇcn`y, K. Rush, and S. Kannan. Improving federated learning personalization via model agnostic meta learning.arXiv preprint arXiv:1909.12488, 2019

work page arXiv 1909
[17]

C. Jin, Z. Yang, Z. Wang, and M. I. Jordan. Provably efficient reinforcement learning with linear function approximation. InConference on learning theory, pages 2137–2143. PMLR, 2020

work page 2020
[18]

H. Jin, Y . Peng, W. Yang, S. Wang, and Z. Zhang. Federated reinforcement learning with environment heterogeneity. InInternational Conference on Artificial Intelligence and Statistics, pages 18–37. PMLR, 2022

work page 2022
[19]

Khodadadian, P

S. Khodadadian, P. Sharma, G. Joshi, and S. T. Maguluri. Federated reinforcement learning: Linear speedup under markovian sampling. InInternational Conference on Machine Learning, pages 10997–11057. PMLR, 2022. 10

work page 2022
[20]

Konda and J

V . Konda and J. Tsitsiklis. Actor-critic algorithms.Advances in neural information processing systems, 12, 1999

work page 1999
[21]

Labbi, P

S. Labbi, P. Mangold, D. Tiapkin, and E. Moulines. On global convergence rates for feder- ated softmax policy gradient under heterogeneous environments. InThe 29th International Conference on Artificial Intelligence and Statistics

work page
[22]

LeCun, Y

Y . LeCun, Y . Bengio, and G. Hinton. Deep learning.nature, 521(7553):436–444, 2015

work page 2015
[23]

T. Li, S. Hu, A. Beirami, and V . Smith. Ditto: Fair and robust federated learning through personalization. InInternational Conference on Machine Learning, pages 6357–6368. PMLR, 2021

work page 2021
[24]

Mclaughlin and L

C. Mclaughlin and L. Su. Personalized federated learning via feature distribution adaptation. Advances in Neural Information Processing Systems, 37:77038–77059, 2024

work page 2024
[25]

A. Mitra. A simple finite-time analysis of td learning with linear function approximation.IEEE Transactions on Automatic Control, 70(2):1388–1394, 2024

work page 2024
[26]

A. Y . Mitrophanov. Sensitivity and convergence of uniformly ergodic markov chains.Journal of Applied Probability, 42(4):1003–1014, 2005

work page 2005
[27]

X. Niu, L. Su, J. Xu, and P. Yang. Collaborative learning with shared linear representations: Statistical rates and optimal algorithms.arXiv preprint arXiv:2409.04919, 2024

work page arXiv 2024
[28]

Olshevsky and B

A. Olshevsky and B. Gharesifard. A small gain analysis of single timescale actor critic.SIAM Journal on Control and Optimization, 61(2):980–1007, 2023

work page 2023
[29]

S. Qiu, Z. Yang, J. Ye, and Z. Wang. On finite-time convergence of actor-critic algorithm.IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021

work page 2021
[30]

G. Qu, A. Wierman, and N. Li. Scalable reinforcement learning for multiagent networked systems.Operations Research, 70(6):3601–3628, 2022

work page 2022
[31]

Salgia and Y

S. Salgia and Y . Chi. The sample-communication complexity trade-off in federated q-learning. Advances in Neural Information Processing Systems, 37:39694–39747, 2025

work page 2025
[32]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[33]

Srikant and L

R. Srikant and L. Ying. Finite-time error bounds for linear stochastic approximation andtd learning. InConference on learning theory, pages 2803–2830. PMLR, 2019

work page 2019
[34]

R. S. Sutton and A. G. Barto.Reinforcement learning: An introduction. MIT press, 2018

work page 2018
[35]

R. S. Sutton, D. McAllester, S. Singh, and Y . Mansour. Policy gradient methods for reinforce- ment learning with function approximation.Advances in neural information processing systems, 12, 1999

work page 1999
[36]

T Dinh, N

C. T Dinh, N. Tran, and J. Nguyen. Personalized federated learning with moreau envelopes. Advances in Neural Information Processing Systems, 33:21394–21405, 2020

work page 2020
[37]

K. K. Thekumparampil, P. Jain, P. Netrapalli, and S. Oh. Statistically and computationally efficient linear meta-representation learning.Advances in Neural Information Processing Systems, 34:18487–18500, 2021

work page 2021
[38]

Y . Tian, Y . Gu, and Y . Feng. Learning from similar linear representations: Adaptivity, minimax- ity, and robustness.Journal of Machine Learning Research, 26(187):1–125, 2025

work page 2025
[39]

Tripuraneni, C

N. Tripuraneni, C. Jin, and M. Jordan. Provable meta-learning of linear representations. In International Conference on Machine Learning, pages 10434–10443. PMLR, 2021

work page 2021
[40]

J. N. Tsitsiklis and B. Van Roy. Average cost temporal-difference learning.Automatica, 35(11):1799–1808, 1999. 11

work page 1999
[41]

H. Wang, S. He, Z. Zhang, F. M. Miao, and J. Anderson. Momentum for the win: collaborative federated reinforcement learning across heterogeneous environments. InProceedings of the 41st International Conference on Machine Learning, pages 50530–50560, 2024

work page 2024
[42]

H. Wang, A. Mitra, H. Hassani, G. J. Pappas, and J. Anderson. Federated temporal difference learning with linear function approximation under environmental heterogeneity.arXiv preprint arXiv:2302.02212, 2023

work page arXiv 2023
[43]

L. M. Wang, P. Yang, and L. Su. Personalized multi-agent average reward td-learning via joint linear approximation.arXiv preprint arXiv:2603.02426, 2026

work page arXiv 2026
[44]

M. Wang, P. Yang, and L. Su. On the convergence rates of federated q-learning across heteroge- neous environments.Transactions on Machine Learning Research, 2025

work page 2025
[45]

J. Woo, G. Joshi, and Y . Chi. The blessing of heterogeneity in federated q-learning: Linear speedup and beyond. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Research, pages 37157–37216. PMLR, 23–29 Jul 2023

work page 2023
[46]

Y . F. Wu, W. Zhang, P. Xu, and Q. Gu. A finite-time analysis of two time-scale actor-critic methods.Advances in Neural Information Processing Systems, 33:17617–17628, 2020

work page 2020
[47]

Xie and S

Z. Xie and S. Song. FedKL: Tackling Data Heterogeneity in Federated Reinforcement Learning by Penalizing KL Divergence.IEEE Journal on Selected Areas in Communications, 41(4):1227– 1242, Apr. 2023

work page 2023
[48]

Xie and S

Z. Xie and S. Song. The actor-critic update order matters for ppo in federated reinforcement learning.arXiv preprint arXiv:2506.01261, 2025

work page arXiv 2025
[49]

Xiong, S

G. Xiong, S. Wang, D. Jiang, and J. Li. On the linear speedup of personalized federated reinforcement learning with shared representations. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[50]

J. Xu, X. Tong, and S.-L. Huang. Personalized federated learning with feature alignment and classifier collaboration. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[51]

T. Yang, S. Cen, Y . Wei, Y . Chen, and Y . Chi. Federated natural policy gradient and actor critic methods for multi-task reinforcement learning. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[52]

T. Yang, S. Cen, Y . Wei, Y . Chen, and Y . Chi. Federated natural policy gradient and actor critic methods for multi-task reinforcement learning. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 121304–121375. Curran Associates, Inc., 2024

work page 2024
[53]

Zhang, H

C. Zhang, H. Wang, A. Mitra, and J. Anderson. Finite-time analysis of on-policy heterogeneous federated reinforcement learning.arXiv preprint arXiv:2401.15273, 2024

work page arXiv 2024
[54]

Zhang, A

K. Zhang, A. Koppel, H. Zhu, and T. Basar. Global convergence of policy gradient methods to (almost) locally optimal policies.SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020

work page 2020
[55]

Zhang, Z

K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Basar. Fully decentralized multi-agent reinforcement learning with networked agents. InInternational conference on machine learning, pages 5872–

work page
[56]

Zhang, G

Y . Zhang, G. Qu, P. Xu, Y . Lin, Z. Chen, and A. Wierman. Global convergence of localized policy iteration in networked multi-agent reinforcement learning.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 7(1):1–51, 2023

work page 2023
[57]

Zhu and X

Y . Zhu and X. Gong. Single-loop federated actor-critic across heterogeneous environments. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 23054–23062, 2025. 12

work page 2025
[58]

S. Zou, T. Xu, and Y . Liang. Finite-sample analysis for sarsa with linear function approximation. Advances in neural information processing systems, 32, 2019. 13 Appendices A Limitations 15 B Broader Impacts 15 C Related Work 15 D Notations and Preliminary Results 16 D.1 Auxiliary Markov chains for critic and actor . . . . . . . . . . . . . . . . . . . ....

work page 2019
[59]

More recently,

studied federated natural policy-gradient and actor-critic methods for multi-task reinforcement learning, where agents collaborate to learn a shared policy under task heterogeneity. More recently,

work page
[60]

single-loop

analyzed a single-loop federated actor-critic method for learning a shared policy across heteroge- neous environments. Their “single-loop” terminology refers to preserving the critic across policy updates, while the actor is still updated after multiple critic communication rounds. [48] studied that, in the setting of FRL, Proximal Policy Optimization (PP...

work page
[61]

L−1X ℓ=0 γℓ rk t,ℓ + (γϕ(sk t,ℓ+1)−ϕ(s k t,ℓ))⊤B∗ωk,∗ t ϕ(sk t,0) # −Eµk θk t−τ ,πθk t−τ ,P k

, λcc2 4 16L2 ∗,1CG,1 ) .(133) Set ζ= L1/4 √ T , c 2 1 = λν 4r .(134) With this choice, the first-order contraction and absorption requirements in (124) and (131) are satisfied provided 2304r2U 2 ω λ2ν2L2 72U 2 ω λc2 2L + 24UωUδ λL(1−γ) ≤ 1 4 .(135) The remaining small-stepsize requirements are collected as L1/4 √ T ≤min ( 1 2UδUω , 1√ 6UδUω , λν 3rU 2 δ ...

work page