arxiv: 2604.17747 · v1 · submitted 2026-04-20 · 💻 cs.LG

Recognition: unknown

Efficient Federated RLHF via Zeroth-Order Policy Optimization

Deyi Wang , Qining Zhang , Lei Ying

Authors on Pith no claims yet

Pith reviewed 2026-05-10 04:44 UTC · model grok-4.3

classification 💻 cs.LG

keywords federated reinforcement learningRLHFzeroth-order optimizationpolicy optimizationedge computingMuJoCo benchmarksconvergence analysis

0 comments

The pith

Par-S²ZPO achieves federated RLHF with sample complexity matching centralized methods but fewer policy updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops Par-S²ZPO, a federated algorithm for reinforcement learning from human feedback that runs on resource-limited edge devices. It relies on zeroth-order optimization using binary perturbations to keep communication, computation, and memory costs low by design. The analysis derives an upper bound showing the method matches centralized RLHF in sample complexity while requiring fewer policy update iterations to converge. Experiments on four MuJoCo tasks confirm it outperforms a standard FedAvg-based federated RLHF approach.

Core claim

The paper establishes that Partitioned Sign-based Stochastic Zeroth-order Policy Optimization (Par-S²ZPO) yields an upper bound on convergence rate equivalent to its centralized counterpart in sample complexity yet strictly faster in the number of policy update iterations required, while delivering higher performance than FedAvg-based RLHF on four MuJoCo reinforcement learning tasks.

What carries the argument

Par-S²ZPO, which partitions the policy optimization across agents and applies sign-based stochastic zeroth-order updates driven by binary perturbations to produce low-complexity federated updates.

If this is right

Federated RLHF training becomes feasible on edge devices without large communication overhead.
The same zeroth-order approach can be applied to other policy optimization problems that face communication constraints.
Theoretical guarantees allow practitioners to predict iteration counts needed for convergence in distributed settings.
Resource-constrained agents can participate in human-feedback loops while preserving privacy through local updates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may generalize to other federated reinforcement learning settings beyond human feedback if the reward variance remains comparable.
Testing the algorithm on tasks with greater data heterogeneity across agents would reveal whether the partitioning step scales robustly.
Combining Par-S²ZPO with existing first-order federated methods could yield hybrid algorithms that trade off communication for faster early progress.

Load-bearing premise

Binary perturbation zeroth-order estimates must stay stable and unbiased under the non-stationary high-variance rewards of RLHF, and federated partitioning must not introduce heterogeneity that violates the derived convergence bounds.

What would settle it

A direct comparison on the MuJoCo tasks showing either higher total samples needed than the centralized bound or no reduction in required policy updates relative to FedAvg would falsify the efficiency claims.

Figures

Figures reproduced from arXiv: 2604.17747 by Deyi Wang, Lei Ying, Qining Zhang.

**Figure 1.** Figure 1: ParFed RL In this section, we introduce the partitioned federated reinforcement learning (ParFed-RL) framework as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Histograms of sample batches’ average returns from pre-perturbed and post-perturbed policies under [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

**Figure 3.** Figure 3: Par-S2ZPO versus FedAvg-S2ZPO Besides a better performance, Par-S2ZPO is more communication- and memory-efficient. In each iteration, FedAvg-S2ZPO requires transmitting K full-network perturbations, whereas Par-S2ZPO transmits only one perturbation split across K partitions, reducing communication cost by a factor of K. Under FedAvg-S2ZPO, each agent must store two full actor networks (original and perturb… view at source ↗

read the original abstract

This paper considers reinforcement learning from human feedback in a federated learning setting with resource-constrained agents, such as edge devices. We propose an efficient federated RLHF algorithm, named Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S$^2$ZPO). The algorithm is built on zeroth-order optimization with binary perturbation, resulting in low communication, computation, and memory complexity by design. Our theoretical analysis establishes an upper bound on the convergence rate of Par-S$^2$ZPO, revealing that it is as efficient as its centralized counterpart in terms of sample complexity but converges faster in terms of policy update iterations. Our experimental results show that it outperforms a FedAvg-based RLHF on four MuJoCo RL tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Par-S²ZPO introduces a partitioned sign-based zeroth-order method for federated RLHF that targets edge-device constraints and claims centralized sample complexity with faster iterations plus better MuJoCo results, but the non-stationary reward assumption remains the main unverified piece.

read the letter

The paper's main move is to build a federated RLHF algorithm around binary-perturbation zeroth-order updates with partitioning. This keeps communication, compute, and memory low by design for resource-limited agents like edge devices. The theoretical claim is that the method matches centralized sample complexity while needing fewer policy updates, and the experiments report better performance than a FedAvg baseline on four MuJoCo tasks. That combination of low overhead and reported gains is the concrete contribution worth noting. The design fits the federated setting without requiring full gradients, which is a practical step forward for distributed alignment work. The experiments use standard RL tasks, so they give a basic check on whether the approach runs at all. The soft spot sits in the convergence analysis. The bound assumes the sign-based estimator stays sufficiently unbiased and low-variance even though RLHF rewards are learned, time-varying, and high-variance. If that assumption slips under policy rollouts and human feedback updates, the iteration advantage and sample-complexity equivalence could weaken. The abstract states the bound but does not show the derivation steps or how RLHF-specific non-stationarity is controlled, so the claim is plausible but not yet tight. No circularity or obvious fitting issues appear. This paper is aimed at people working on federated learning, efficient RL, or privacy-preserving alignment. A reader who needs concrete algorithms for constrained devices will get the most from the method description and the reported comparisons. It is worth sending to peer review so the proof and experimental details can be checked properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes Partitioned, Sign-based Stochastic Zeroth-order Policy Optimization (Par-S²ZPO) for federated RLHF with resource-constrained agents. Built on binary-perturbation zeroth-order optimization, it claims low communication/computation/memory costs. Theoretical analysis gives an upper bound on convergence rate showing sample-complexity equivalence to the centralized case but faster policy-update iterations. Experiments report outperformance versus FedAvg-based RLHF on four MuJoCo tasks.

Significance. If the convergence bound holds under RLHF reward dynamics and the empirical gains are reproducible, the work could enable practical federated RLHF on edge devices by avoiding first-order gradients while preserving sample efficiency.

major comments (2)

[Theoretical Analysis] Theoretical Analysis section: the stated upper bound on convergence rate (matching centralized sample complexity) rests on the binary-perturbation zeroth-order estimator remaining sufficiently unbiased with controlled variance; no explicit bias term, variance bound, or proof sketch is supplied to confirm this under the non-stationary, high-variance human-feedback reward model, which is load-bearing for the sample-complexity claim.
[Experiments] Experiments section: the claim of outperformance on four MuJoCo tasks lacks reported details on number of agents, hyper-parameters, number of runs, variance estimates, or statistical tests, preventing verification of the iteration-speed advantage over FedAvg.

minor comments (1)

[Abstract] Abstract and notation: ensure the definition of Par-S²ZPO and the partitioning scheme are introduced before the theoretical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical Analysis section: the stated upper bound on convergence rate (matching centralized sample complexity) rests on the binary-perturbation zeroth-order estimator remaining sufficiently unbiased with controlled variance; no explicit bias term, variance bound, or proof sketch is supplied to confirm this under the non-stationary, high-variance human-feedback reward model, which is load-bearing for the sample-complexity claim.

Authors: We appreciate the referee highlighting the need for explicit justification. Our bound relies on standard unbiasedness and variance properties of binary-perturbation ZO estimators under Lipschitz smoothness, treating the reward as fixed within each policy update (standard in RLHF). To address the non-stationarity concern directly, we will add an explicit bias-variance decomposition and a concise proof sketch in the appendix of the revised manuscript. revision: yes
Referee: [Experiments] Experiments section: the claim of outperformance on four MuJoCo tasks lacks reported details on number of agents, hyper-parameters, number of runs, variance estimates, or statistical tests, preventing verification of the iteration-speed advantage over FedAvg.

Authors: We agree that additional details are required for reproducibility. The revised manuscript will report the number of agents, complete hyper-parameter settings, number of independent runs with variance estimates, and statistical tests supporting the performance differences versus FedAvg. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation chain is self-contained and independent of its inputs

full rationale

The paper proposes the new Par-S²ZPO algorithm via binary-perturbation zeroth-order optimization and states a convergence-rate upper bound that matches centralized sample complexity while improving iteration count. No equations, fitted parameters, or self-citations are shown that reduce the bound or performance claims to a tautology or re-labeling of the inputs. The analysis rests on standard ZO smoothness/variance assumptions applied to the federated RLHF setting rather than on any self-definitional or load-bearing self-citation step. Experimental comparison to FedAvg is empirical and external to the derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard RL convergence assumptions and the applicability of zeroth-order methods to policy optimization; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Standard assumptions for convergence analysis in stochastic zeroth-order optimization and RL policy gradients
Typical background for any RLHF convergence claim; invoked implicitly when stating the upper bound.

pith-pipeline@v0.9.0 · 5417 in / 1290 out tokens · 39810 ms · 2026-05-10T04:44:51.236281+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 5 canonical work pages · 2 internal anchors

[1]

Arora, N

S. Arora, N. Cohen, N. Golowich, and W. Hu. A convergence analysis of gradient descent for deep linear neural networks. InInternational Conference on Learning Representations, 2019

2019
[2]

Bengs, R

V . Bengs, R. Busa-Fekete, A. E. Mesaoudi-Paul, and E. Hullermeier. Preference-based online learning with dueling bandits: A survey.Journal of Machine Learning Research, 22(7):1–108, 2021. 13

2021
[3]

Bottou, F

L. Bottou, F. E. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.SIAM Review, 60(2):223–311, 2018

2018
[4]

J. Chen, D. Zhou, Y . Tang, Z. Yang, Y . Cao, and Q. Gu. Closing the generalization gap of adaptive gradient methods in training deep neural networks. InProceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 3267–3275. International Joint Conferences on Artificial Intelligence Organization, 7 2020. Main track

2020
[5]

Christiano, J

P. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei. Deep reinforcement learning from human preferences. InAdvances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017

2017
[6]

F. X. Fan, C. Tan, Y .-S. Ong, R. Wattenhofer, and W.-T. Ooi. Fedrlhf: A convergence-guaranteed federated framework for privacy-preserving and personalized rlhf.arXiv preprint arXiv:2412.15538, 2024

work page arXiv 2024
[7]

Kim, H.-B

S. Kim, H.-B. Shin, and S.-W. Lee. Aligning humans and robots via reinforcement learning from implicit human feedback, 2025

2025
[8]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[9]

Liu, P.-Y

S. Liu, P.-Y . Chen, X. Chen, and M. Hong. signSGD via zeroth-order oracle. InInternational Conference on Learning Representations, 2019

2019
[10]

S. Liu, B. Kailkhura, P.-Y . Chen, P. Ting, S. Chang, and L. Amini. Zeroth-order stochastic variance reduction for nonconvex optimization. InAdvances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

2018
[11]

Y . Liu, X. Zhang, Y . Kang, L. Li, T. Chen, M. Hong, and Q. Yang. Fedbcd: A communication-efficient collaborative learning framework for distributed features.IEEE Transactions on Signal Processing, 70:4277– 4290, 2022

2022
[12]

McMahan, E

B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas. Communication-efficient learning of deep networks from decentralized data. InInt. Conf. Artificial Intelligence and Statistics (AISTATS), pages 1273–1282, 2017

2017
[13]

Nesterov and V

Y . Nesterov and V . Spokoiny. Random gradient-free minimization of convex functions.Found. Comput. Math., 17(2):527–566, apr 2017

2017
[14]

Ouyang, J

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P. Welinder, P. F. Christiano, J. Leike, and R. Lowe. Training language models to follow instructions with human feedback. InAdvances in Neural Information Processing Systems, volume...

2022
[15]

Pirotta, M

M. Pirotta, M. Restelli, and L. Bascetta. Policy gradient in lipschitz markov decision processes.Mach. Learn., 100(2–3):255–283, Sept. 2015

2015
[16]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems, volume 36, pages 53728–53741. Curran Associates, Inc., 2023

2023
[17]

Schulman, F

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms, 2017. 14

2017
[18]

Gymnasium: A Standard Interface for Reinforcement Learning Environments

M. Towers, A. Kwiatkowski, J. Terry, J. U. Balis, G. De Cola, T. Deleu, M. Goulão, A. Kallinteris, M. Krimmel, A. KG, et al. Gymnasium: A standard interface for reinforcement learning environments.arXiv preprint arXiv:2407.17032, 2024

work page internal anchor Pith review arXiv 2024
[19]

Split learning for health: Distributed deep learning without sharing raw patient data

P. Vepakomma, O. Gupta, T. Swedish, and R. Raskar. Split learning for health: Distributed deep learning without sharing raw patient data.arXiv preprint arXiv:1812.00564, 2018

work page Pith review arXiv 2018
[20]

L. Wang, Q. Cai, Z. Yang, and Z. Wang. Neural policy gradient methods: Global optimality and rates of convergence. InInternational Conference on Learning Representations, 2020

2020
[21]

Zhang and L

Q. Zhang and L. Ying. Provable reinforcement learning from human feedback with an unknown link function. arXiv preprint arXiv:2506.03066, 2025

work page arXiv 2025
[22]

signrx∇ θVpπ θtq, µtvt,kys “1 . The same argument applies to where x∇θVpπ θtq, µtvt,ky ă0. As a consequence, we have onE c t,k that: A ∇θV pπθtq,sign

Q. Zhang and L. Ying. Zeroth-order policy gradient for reinforcement learning from human feedback without reward inference. InThe Thirteenth International Conference on Learning Representations, 2025. Appendix A Proof of Lemma 2 When the sampled perturbation vector vt,k aligns well with the positive/negative gradient direction on partition Ik, i.e., when ...

2025