A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning

Ji Liu; Kaiqing Zhang; Romeil Sandhu; Tamer Ba\c{s}ar; Yixuan Lin; Zhaoran Wang; Zhuoran Yang

arxiv: 1907.03053 · v1 · pith:MRQ3VIWNnew · submitted 2019-07-06 · 💻 cs.LG · cs.MA· stat.ML

A Communication-Efficient Multi-Agent Actor-Critic Algorithm for Distributed Reinforcement Learning

Yixuan Lin , Kaiqing Zhang , Zhuoran Yang , Zhaoran Wang , Tamer Ba\c{s}ar , Romeil Sandhu , Ji Liu This is my paper

Pith reviewed 2026-05-25 01:48 UTC · model grok-4.3

classification 💻 cs.LG cs.MAstat.ML

keywords distributed reinforcement learningmulti-agent actor-criticcommunication efficiencydirected graphscooperative policy learningstrongly connected networks

0 comments

The pith

A multi-agent actor-critic algorithm solves distributed reinforcement learning on directed graphs by transmitting only two scalars per step.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Agents connected over a directed graph aim to cooperatively maximize the average return of a shared policy by exchanging information only with immediate neighbors. The paper introduces a randomized multi-agent actor-critic method that performs local policy and value updates while restricting each transmission to two scalar values. The central result states that this procedure reaches the globally optimal solution whenever the graph is strongly connected. This matters for settings where agents must learn together yet cannot afford to share full trajectories or parameter vectors at every round.

Core claim

The authors present a randomized communication-efficient multi-agent actor-critic algorithm that enables a network of agents to maximize the globally averaged return through local communications on a directed graph. They demonstrate that the algorithm solves the distributed reinforcement learning problem for strongly connected graphs with each agent transmitting only two scalar-valued variables at each time step.

What carries the argument

randomized communication-efficient multi-agent actor-critic algorithm that performs local actor and critic updates while exchanging only two scalar values per transmission to maintain consensus across the directed graph.

If this is right

Each agent needs to send only two scalar values at each communication round while still achieving global optimality.
The method works for unidirectional communication links represented by a directed graph.
Convergence holds as long as the graph remains strongly connected.
The approach applies to any finite set of agents that can exchange messages with their out-neighbors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The two-scalar restriction may allow the same structure to be used in other distributed policy optimization settings that require consensus.
Randomization in the communication schedule could be examined for its effect on convergence speed when message delays occur.
The minimal communication load suggests the algorithm could run on hardware with strict bandwidth limits, such as embedded multi-robot teams.

Load-bearing premise

The underlying communication network forms a strongly connected directed graph.

What would settle it

Execute the algorithm on a directed graph that is not strongly connected and check whether the agents still converge to the policy that maximizes the globally averaged return.

read the original abstract

This paper considers a distributed reinforcement learning problem in which a network of multiple agents aim to cooperatively maximize the globally averaged return through communication with only local neighbors. A randomized communication-efficient multi-agent actor-critic algorithm is proposed for possibly unidirectional communication relationships depicted by a directed graph. It is shown that the algorithm can solve the problem for strongly connected graphs by allowing each agent to transmit only two scalar-valued variables at one time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a multi-agent actor-critic variant that runs on directed graphs while sending only two scalars per step, using standard consensus tools.

read the letter

The core claim is a randomized actor-critic method for cooperative RL where agents on a strongly connected directed graph exchange just two scalar values at each step to reach consensus. This targets real bandwidth limits in sensor nets or robot teams, where full gradients or value functions would be too expensive to share. The work applies push-sum style updates inside the actor-critic loop in a randomized fashion, which is a direct but useful extension of earlier multi-agent RL papers that mostly assumed bidirectional links or heavier messages. If the analysis goes through, it shows the global average return can still be maximized without central coordination. The practical angle is the explicit two-scalar bound, which is tighter than most prior distributed RL schemes. The limitation is that the communication trick itself rests on well-known consensus results rather than new theory, so the advance is mainly in the integration and the directed-graph setting. The abstract gives no proof sketch or rate, which leaves open questions about sample complexity and how the randomization affects variance. The strong-connectivity assumption is standard and not hidden. This is worth a referee for groups working on distributed RL under communication constraints; the problem is concrete and the proposal is grounded enough to merit checking the full analysis. I would send it to review rather than desk-reject.

Referee Report

0 major / 2 minor

Summary. The manuscript proposes a randomized multi-agent actor-critic algorithm for distributed reinforcement learning in which agents connected by a directed graph cooperatively maximize the globally averaged return. The central claim is that the algorithm solves this problem on strongly connected graphs while requiring each agent to transmit only two scalar-valued variables per time step.

Significance. If the convergence result holds under the stated graph condition, the work would contribute a communication-efficient protocol for distributed RL by adapting standard push-sum consensus techniques to the actor-critic setting. This addresses a practical bottleneck in multi-agent systems with unidirectional links and provides a concrete performance guarantee tied to strong connectivity.

minor comments (2)

The abstract asserts that the algorithm 'solves the problem' but does not list the precise assumptions (e.g., on step-size schedules, reward boundedness, or function approximation) required for the claimed convergence; these should be stated explicitly in the introduction or theorem statements.
Notation for the two transmitted scalars and the push-sum style updates should be introduced with a clear table or diagram in the algorithm section to improve readability for readers unfamiliar with consensus methods.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive evaluation of our manuscript. The referee's summary correctly identifies the key contribution: a randomized multi-agent actor-critic method that achieves convergence on strongly connected directed graphs while transmitting only two scalars per agent per step. We are pleased that the work is viewed as addressing a practical bottleneck in multi-agent RL with unidirectional communication.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces a new randomized multi-agent actor-critic algorithm for distributed RL and proves it solves the cooperative maximization problem on strongly connected directed graphs while transmitting only two scalars per step. This is a constructive algorithmic contribution whose central result follows from the algorithm definition plus standard consensus analysis under the explicit graph assumption; no step reduces a claimed prediction or uniqueness result to a fitted input, self-citation chain, or definitional tautology. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters, invented entities, or non-standard axioms; standard MDP assumptions are implicitly required but not detailed.

axioms (1)

domain assumption The underlying environment is a Markov decision process.
Standard background assumption for any actor-critic RL paper; invoked implicitly by the problem statement.

pith-pipeline@v0.9.0 · 5614 in / 1035 out tokens · 21366 ms · 2026-05-25T01:48:46.230545+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Mateos, J.A

G. Mateos, J.A. Bazerque, and G.B. Giannakis. Distribut ed sparse linear regression. IEEE Transactions on Signal Processing , 52(10):5262–5276, 2010. 12

2010
[2]

Kolla, K

R.K. Kolla, K. Jagannathan, and A. Gopalan. Collaborati ve learning of stochastic bandits over a social network. IEEE Transactions on Networking , 26(4):1782–1795, 2018

2018
[3]

Zhang, Z

K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸ sar. Fully de centralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881, 2018

2018
[4]

Jiang, A

Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in ﬁxed topology networks. In Advances in Neural Information Processing Systems , pages 5904–5914, 2017

2017
[5]

Boutilier

C. Boutilier. Planning, learning and coordination in mu lti-agent decision processes. In Conference on Theoretical Aspects of Rationality and Knowled ge, pages 195–210, 1996

1996
[6]

Lauer and M

M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in co- operative multi-agent systems. In International Conference on Machine Learning , pages 535–542, 2000

2000
[7]

M.L. Littman. Value-function reinforcement learning i n Markov games. Cognitive Systems Research, 2(1):55–66, 2001

2001
[8]

Wang and T

X. Wang and T. Sandholm. Reinforcement learning to play a n optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems , pages 1603–1610, 2003

2003
[9]

Kar, J.M

S. Kar, J.M. Moura, and H.V. Poor. QD-learning: A collabo rative distributed strategy for multi-agent reinforcement learning through consensus + in novations. IEEE Transactions on Signal Processing , 61(7):1848–1862, 2013

2013
[10]

Zhang, Z

K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸ sar. Finite -sample analyses for fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783 , 2018

work page arXiv 2018
[11]

D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan. Stochas tic primal-dual algorithm for distributed gradient temporal diﬀerence learning. arXiv preprint arXiv:1805.07918 , 2018

work page arXiv 2018
[12]

Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning

T.T. Doan, S.T. Maguluri, and J. Romberg. Convergence r ates of distributed TD (0) with linear function approximation for multi-agent reinfo rcement learning. arXiv preprint arXiv:1902.07393, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902
[13]

Hu and M.P

J. Hu and M.P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(11):1039–1069, 2003

2003
[14]

Foerster, Y.M

J. Foerster, Y.M. Assael, N. Freitas, and S. Whiteson. L earning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016

2016
[15]

R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordat ch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275 , 2017

work page arXiv 2017
[16]

Omidshaﬁei, J

S. Omidshaﬁei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observa bility. In International Conference on Machine Learning , pages 2681–2690, 2017

2017
[17]

Zhang, Z

K. Zhang, Z. Yang, and T. Ba¸ sar. Networked multi-agent reinforcement learning in contin- uous spaces. In Proceedings of IEEE Conference on Decision and Control, pages 2771–2776. IEEE, 2018

2018
[18]

Shoham, R

Y. Shoham, R. Powers, and T. Grenager. Multi-agent rein forcement learning: A critical survey. Technical Report, 2003. 13

2003
[19]

Kempe, A

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computa tion of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Compu ter Science , pages 482–491, 2003

2003
[20]

T. Chen, K. Zhang, G. B. Giannakis, and T. Ba¸ sar. Commun ication-eﬃcient distributed reinforcement learning. arXiv preprint arXiv:1812.03239 , 2018

work page arXiv 2018
[21]

Konda and J.N

V.R. Konda and J.N. Tsitsiklis. Actor-critic algorith ms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000

2000
[22]

Bhatnagar, R

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Nat ural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009

2009
[23]

X. Gao, J. Liu, and T. Ba¸ sar. Stochastic communication -eﬃcient distributed algorithms for solving linear algebraic equations. In Proceedings of the IEEE Multi-Conference on Systems and Control , pages 380–385, 2016

2016
[24]

Fullmer, L

D. Fullmer, L. Wang, and A.S. Morse. On the distributed c omputation of a common ﬁxed point of a family of paracontractions. In Proceedings of IEEE Conference on Decision and Control, pages 2289–2293, 2017

2017
[25]

L. Xiao, S. Boyd, and S. Lall. A scheme for robust distrib uted sensor fusion based on average consensus. In International Symposium on Information Processing in Sensor Networks, pages 63–70, 2005

2005
[26]

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking , 14(SI):2508–2530, 2006

2006
[27]

Aysal, M.E

T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione . Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing , 57(7):2748–2761, 2009

2009
[28]

Kempe, A

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computa tion of aggregate information. In Proceedings of IEEE Symposium on Foundations of Computer Scie nce, pages 482–491, 2003

2003
[29]

Nedi´ c and A

A. Nedi´ c and A. Olshevsky. Distributed optimization o ver time-varying directed graphs. IEEE Transactions on Automatic Control , 60(3):601–615, 2015

2015
[30]

Rezaienia, B

P. Rezaienia, B. Gharesifard, T. Linder, and B. Touri. C onvergence rate of push-sum algorithms on random graphs. In Proceedings of IEEE Conference on Decision and Control, pages 4218–4223, 2018

2018
[31]

Liu and A.S

J. Liu and A.S. Morse. Asynchronous distributed averag ing using double linear iterations. In Proceedings of the American Control Conference , pages 6620–6625, 2012

2012
[32]

Hadjicostis and T

C.N. Hadjicostis and T. Charalambous. Average consens us in the presence of delays in directed graph topologies. IEEE Transactions on Automatic Control , 59(3):763–768, 2014. 14

2014

[1] [1]

Mateos, J.A

G. Mateos, J.A. Bazerque, and G.B. Giannakis. Distribut ed sparse linear regression. IEEE Transactions on Signal Processing , 52(10):5262–5276, 2010. 12

2010

[2] [2]

Kolla, K

R.K. Kolla, K. Jagannathan, and A. Gopalan. Collaborati ve learning of stochastic bandits over a social network. IEEE Transactions on Networking , 26(4):1782–1795, 2018

2018

[3] [3]

Zhang, Z

K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸ sar. Fully de centralized multi-agent reinforcement learning with networked agents. In International Conference on Machine Learning, pages 5872–5881, 2018

2018

[4] [4]

Jiang, A

Z. Jiang, A. Balu, C. Hegde, and S. Sarkar. Collaborative deep learning in ﬁxed topology networks. In Advances in Neural Information Processing Systems , pages 5904–5914, 2017

2017

[5] [5]

Boutilier

C. Boutilier. Planning, learning and coordination in mu lti-agent decision processes. In Conference on Theoretical Aspects of Rationality and Knowled ge, pages 195–210, 1996

1996

[6] [6]

Lauer and M

M. Lauer and M. Riedmiller. An algorithm for distributed reinforcement learning in co- operative multi-agent systems. In International Conference on Machine Learning , pages 535–542, 2000

2000

[7] [7]

M.L. Littman. Value-function reinforcement learning i n Markov games. Cognitive Systems Research, 2(1):55–66, 2001

2001

[8] [8]

Wang and T

X. Wang and T. Sandholm. Reinforcement learning to play a n optimal Nash equilibrium in team Markov games. In Advances in Neural Information Processing Systems , pages 1603–1610, 2003

2003

[9] [9]

Kar, J.M

S. Kar, J.M. Moura, and H.V. Poor. QD-learning: A collabo rative distributed strategy for multi-agent reinforcement learning through consensus + in novations. IEEE Transactions on Signal Processing , 61(7):1848–1862, 2013

2013

[10] [10]

Zhang, Z

K. Zhang, Z. Yang, H. Liu, T. Zhang, and T. Ba¸ sar. Finite -sample analyses for fully decentralized multi-agent reinforcement learning. arXiv preprint arXiv:1812.02783 , 2018

work page arXiv 2018

[11] [11]

D. Lee, H. Yoon, V. Cichella, and N. Hovakimyan. Stochas tic primal-dual algorithm for distributed gradient temporal diﬀerence learning. arXiv preprint arXiv:1805.07918 , 2018

work page arXiv 2018

[12] [12]

Finite-Time Analysis of Distributed TD(0) with Linear Function Approximation for Multi-Agent Reinforcement Learning

T.T. Doan, S.T. Maguluri, and J. Romberg. Convergence r ates of distributed TD (0) with linear function approximation for multi-agent reinfo rcement learning. arXiv preprint arXiv:1902.07393, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1902

[13] [13]

Hu and M.P

J. Hu and M.P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of Machine Learning Research, 4(11):1039–1069, 2003

2003

[14] [14]

Foerster, Y.M

J. Foerster, Y.M. Assael, N. Freitas, and S. Whiteson. L earning to communicate with deep multi-agent reinforcement learning. In Advances in Neural Information Processing Systems, pages 2137–2145, 2016

2016

[15] [15]

R. Lowe, Y. Wu, A. Tamar, J. Harb, P. Abbeel, and I. Mordat ch. Multi-agent actor-critic for mixed cooperative-competitive environments. arXiv preprint arXiv:1706.02275 , 2017

work page arXiv 2017

[16] [16]

Omidshaﬁei, J

S. Omidshaﬁei, J. Pazis, C. Amato, J. P. How, and J. Vian. Deep decentralized multi-task multi-agent reinforcement learning under partial observa bility. In International Conference on Machine Learning , pages 2681–2690, 2017

2017

[17] [17]

Zhang, Z

K. Zhang, Z. Yang, and T. Ba¸ sar. Networked multi-agent reinforcement learning in contin- uous spaces. In Proceedings of IEEE Conference on Decision and Control, pages 2771–2776. IEEE, 2018

2018

[18] [18]

Shoham, R

Y. Shoham, R. Powers, and T. Grenager. Multi-agent rein forcement learning: A critical survey. Technical Report, 2003. 13

2003

[19] [19]

Kempe, A

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computa tion of aggregate information. In Proceedings of the 44th IEEE Symposium on Foundations of Compu ter Science , pages 482–491, 2003

2003

[20] [20]

T. Chen, K. Zhang, G. B. Giannakis, and T. Ba¸ sar. Commun ication-eﬃcient distributed reinforcement learning. arXiv preprint arXiv:1812.03239 , 2018

work page arXiv 2018

[21] [21]

Konda and J.N

V.R. Konda and J.N. Tsitsiklis. Actor-critic algorith ms. In Advances in Neural Information Processing Systems, pages 1008–1014, 2000

2000

[22] [22]

Bhatnagar, R

S. Bhatnagar, R. Sutton, M. Ghavamzadeh, and M. Lee. Nat ural actor-critic algorithms. Automatica, 45(11):2471–2482, 2009

2009

[23] [23]

X. Gao, J. Liu, and T. Ba¸ sar. Stochastic communication -eﬃcient distributed algorithms for solving linear algebraic equations. In Proceedings of the IEEE Multi-Conference on Systems and Control , pages 380–385, 2016

2016

[24] [24]

Fullmer, L

D. Fullmer, L. Wang, and A.S. Morse. On the distributed c omputation of a common ﬁxed point of a family of paracontractions. In Proceedings of IEEE Conference on Decision and Control, pages 2289–2293, 2017

2017

[25] [25]

L. Xiao, S. Boyd, and S. Lall. A scheme for robust distrib uted sensor fusion based on average consensus. In International Symposium on Information Processing in Sensor Networks, pages 63–70, 2005

2005

[26] [26]

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized gossip algorithms. IEEE/ACM Transactions on Networking , 14(SI):2508–2530, 2006

2006

[27] [27]

Aysal, M.E

T.C. Aysal, M.E. Yildiz, A.D. Sarwate, and A. Scaglione . Broadcast gossip algorithms for consensus. IEEE Transactions on Signal Processing , 57(7):2748–2761, 2009

2009

[28] [28]

Kempe, A

D. Kempe, A. Dobra, and J. Gehrke. Gossip-based computa tion of aggregate information. In Proceedings of IEEE Symposium on Foundations of Computer Scie nce, pages 482–491, 2003

2003

[29] [29]

Nedi´ c and A

A. Nedi´ c and A. Olshevsky. Distributed optimization o ver time-varying directed graphs. IEEE Transactions on Automatic Control , 60(3):601–615, 2015

2015

[30] [30]

Rezaienia, B

P. Rezaienia, B. Gharesifard, T. Linder, and B. Touri. C onvergence rate of push-sum algorithms on random graphs. In Proceedings of IEEE Conference on Decision and Control, pages 4218–4223, 2018

2018

[31] [31]

Liu and A.S

J. Liu and A.S. Morse. Asynchronous distributed averag ing using double linear iterations. In Proceedings of the American Control Conference , pages 6620–6625, 2012

2012

[32] [32]

Hadjicostis and T

C.N. Hadjicostis and T. Charalambous. Average consens us in the presence of delays in directed graph topologies. IEEE Transactions on Automatic Control , 59(3):763–768, 2014. 14

2014