End-to-end Decentralized Multi-robot Navigation in Unknown Complex Environments via Deep Reinforcement Learning

Hui Cheng; Juntong Lin; Peiwei Zheng; Xuyun Yang

arxiv: 1907.01713 · v1 · pith:3TSVQCIVnew · submitted 2019-07-03 · 💻 cs.RO

End-to-end Decentralized Multi-robot Navigation in Unknown Complex Environments via Deep Reinforcement Learning

Juntong Lin , Xuyun Yang , Peiwei Zheng , Hui Cheng This is my paper

Pith reviewed 2026-05-25 10:45 UTC · model grok-4.3

classification 💻 cs.RO

keywords multi-robot navigationdeep reinforcement learningdecentralized executionunknown environmentslidar sensingconnectivity maintenancecollision avoidanceend-to-end control

0 comments

The pith

A deep reinforcement learning method trains decentralized policies for robot teams to reach goals in unknown environments by mapping raw lidar directly to velocity commands.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that a single deep reinforcement learning framework can produce navigation behavior for multiple robots operating without maps or a central controller. The geometric center of the team is driven toward a target position while each robot avoids collisions with obstacles and with teammates and stays within communication range. Training occurs with full access to all robots' states in simulation, but each robot later runs its own policy using only its local lidar readings. If successful this removes the need to build and maintain obstacle maps or to rely on external localization during deployment.

Core claim

The paper claims that end-to-end policies obtained through centralized training and decentralized execution allow a team of robots to drive their geometric centroid to a goal location in previously unseen complex environments, while each robot uses only its own raw lidar measurements to produce velocity commands that avoid collisions and preserve connectivity.

What carries the argument

Centralized training with decentralized execution mechanism that learns policies mapping raw lidar scans to velocity commands without intermediate mapping or localization steps.

If this is right

Robot teams can operate without building or updating explicit obstacle maps.
Control commands are generated directly from raw sensor readings on each robot.
Connectivity and collision avoidance emerge from the learned policies rather than from separate planning layers.
The same trained policy set can be executed on any number of robots provided each has its own lidar.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may reduce the computational load at deployment time because no central fusion or map server is required after training.
Similar training pipelines could be tested with other range sensors or with cameras if the input representation is adjusted accordingly.
Scaling the number of robots would require checking whether the centralized training phase remains tractable as team size grows.

Load-bearing premise

Policies learned from simulated multi-robot interactions will produce safe, connected, goal-reaching motion when each robot runs its policy independently on real lidar data in new environments.

What would settle it

Deploy the learned policies on physical UGVs in an indoor test environment whose layout was never seen during training and record whether the team centroid reaches the goal without any inter-robot collisions or loss of connectivity.

Figures

Figures reproduced from arXiv: 1907.01713 by Hui Cheng, Juntong Lin, Peiwei Zheng, Xuyun Yang.

**Figure 1.** Figure 1: b, where the geometric centroid of the formation is required to reach the goal position efficiently while avoiding collision and maintaining connectivity of the robot team. A deep reinforcement learning (DRL)-based approach is proposed to accomplish the multi-robot navigation task. By means of a centralized learning and decentralized executing mechanism as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: An overview of our method. During centralized learning period, the [PITH_FULL_IMAGE:figures/full_fig_p001_2.png] view at source ↗

**Figure 3.** Figure 3: Setting description. Information about teammates [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 4.** Figure 4: Policy network architecture [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Value function network architecture. used as the non-linear function approximator for the value function Vφ. As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Avoid obstacles of different shapes. Robots are pink, brown and blue [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Navigation in narrow scenarios. ∆t = 0.5s. In each run, the positions of the robot team and the goal are randomly initialized. • Reward settings: efficiency term re = −0.5, weights of reward functions wg = 10, wc = 1, wf = 10, wp = 5, goal reward rgoal = 10, collision penalty rcollision = −100, radius of goal zone g = 0.15m, and communication range threshold d = 3.5m. • PPO settings: number of iteration … view at source ↗

**Figure 8.** Figure 8: Navigation in random scenarios. 2) Narrow scenarios: Narrow scenes are common in real world and is challenging in the sense that it imposes strict spacial constraints to the robot team. To verify the effectiveness of the derived policy in narrow scenes, we set up scenarios including narrow passageway and corridor with corners. It can be seen in the [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 10.** Figure 10: A quantitative comparison among independent PPO, MAPPO and [PITH_FULL_IMAGE:figures/full_fig_p006_10.png] view at source ↗

**Figure 11.** Figure 11: A demonstration of typical failure cases. The robot team cannot [PITH_FULL_IMAGE:figures/full_fig_p007_11.png] view at source ↗

**Figure 12.** Figure 12: Real-world experimental platform. real-time end-to-end decentralized execution, each UGV is equipped with a 2D-laser scanner (RPLIDAR-A2) and an onboard computer (Nvidia Jetson TX2). We wrap the UGVs with boxes and set laser scanners at different heights to make laser scanners detect teammate UGVs. The OptiTrack motion capture system is used to provide the positions of the robots. The snapshots of an expe… view at source ↗

**Figure 13.** Figure 13: Snapshots of a team of 3 holonomic UGVs navigating through obstacles. [PITH_FULL_IMAGE:figures/full_fig_p008_13.png] view at source ↗

read the original abstract

In this paper, a novel deep reinforcement learning (DRL)-based method is proposed to navigate the robot team through unknown complex environments, where the geometric centroid of the robot team aims to reach the goal position while avoiding collisions and maintaining connectivity. Decentralized robot-level policies are derived using a mechanism of centralized learning and decentralized executing. The proposed method can derive end-to-end policies, which map raw lidar measurements into velocity control commands of robots without the necessity of constructing obstacle maps. Simulation and indoor real-world unmanned ground vehicles (UGVs) experimental results verify the effectiveness of the proposed method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper applies standard CTDE DRL to multi-robot lidar navigation with a centroid goal, but the abstract-level claims rest on unquantified sim and real tests that leave the transfer story unconvincing.

read the letter

The paper trains a team of robots so their centroid heads toward a goal in unmapped space. Each robot outputs velocity from its own lidar scan, with training done centrally and execution kept local. Connectivity and collision avoidance are part of the reward. The end-to-end lidar-to-action route avoids explicit mapping, which is the main practical angle they highlight. They mention both simulation runs and indoor UGV hardware tests as confirmation. That setup is straightforward and matches what many groups are already trying with multi-agent RL. The application to team centroid control plus connectivity is a reasonable extension of single-robot DRL navigation work, though not a conceptual leap. The soft spot is the evidence. The abstract states that the experiments verify effectiveness, yet supplies no success rates, no comparison methods, no breakdown of how often connectivity held under decentralized execution, and no discussion of sensor noise or dynamics mismatch between sim and real. Without those numbers it is difficult to judge whether the policies generalize or simply succeeded on the particular test layouts. The sim-to-real step therefore stays the least supported part of the story. This is the kind of paper that would interest a robotics lab already running DRL navigation experiments and looking for team extensions. A reader who wants concrete performance data or ablation results will come away wanting more. It is worth sending to peer review so the authors can add the missing metrics and failure cases; the underlying idea is workable enough that referees could usefully press on the validation details rather than reject outright.

Referee Report

1 major / 1 minor

Summary. The paper proposes a DRL-based method for multi-robot navigation in unknown complex environments. The geometric centroid of the robot team is driven toward a goal position while the robots avoid collisions with each other and obstacles and maintain connectivity. Decentralized execution policies are obtained via centralized training; each robot maps raw LiDAR scans directly to velocity commands without constructing an explicit obstacle map. Effectiveness is asserted on the basis of simulation results and indoor real-world UGV experiments.

Significance. If the empirical claims are substantiated with quantitative metrics and proper controls, the work would contribute a concrete demonstration of map-free, end-to-end decentralized multi-robot control learned via DRL. The centralized-training/decentralized-execution paradigm is applied to a connectivity-constrained centroid-navigation task, which could be useful for scalable team behaviors in unstructured settings.

major comments (1)

[Abstract and experimental-results section] Abstract and experimental-results section: the claim that 'simulation and indoor real-world UGV experimental results verify the effectiveness' is load-bearing for the central contribution, yet the manuscript supplies no quantitative metrics (success rate, time-to-goal, connectivity-maintenance fraction, collision counts), no baseline comparisons, and no description of how connectivity or inter-robot distances were measured or enforced under fully decentralized execution on real hardware. This absence prevents assessment of the sim-to-real gap and out-of-distribution generalization that the skeptic correctly flags as the weakest link.

minor comments (1)

[Method section] Notation for the reward function and connectivity constraint should be introduced with explicit equations rather than prose descriptions only.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comment correctly identifies that the current manuscript version does not provide the quantitative metrics, baseline comparisons, or hardware measurement details needed to fully substantiate the experimental claims. We will revise the paper to address this.

read point-by-point responses

Referee: [Abstract and experimental-results section] Abstract and experimental-results section: the claim that 'simulation and indoor real-world UGV experimental results verify the effectiveness' is load-bearing for the central contribution, yet the manuscript supplies no quantitative metrics (success rate, time-to-goal, connectivity-maintenance fraction, collision counts), no baseline comparisons, and no description of how connectivity or inter-robot distances were measured or enforced under fully decentralized execution on real hardware. This absence prevents assessment of the sim-to-real gap and out-of-distribution generalization that the skeptic correctly flags as the weakest link.

Authors: We agree that the absence of quantitative metrics and baselines limits the strength of the claims. In the revised manuscript we will add a dedicated experimental results subsection containing: (1) tabulated success rates, mean time-to-goal, collision counts, and connectivity-maintenance fractions (defined as the percentage of time steps in which all pairwise distances remain below the communication threshold) for both simulation and real-robot trials; (2) direct comparisons against two standard decentralized baselines (reciprocal velocity obstacles and artificial potential fields) using identical environment and team configurations; and (3) an explicit description of the real-hardware measurement protocol, including use of external motion-capture for ground-truth distances during data collection while confirming that policy execution itself remained fully decentralized. These additions will allow readers to evaluate the sim-to-real gap directly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical DRL method with external validation

full rationale

The paper presents a DRL approach for multi-robot navigation using centralized training and decentralized execution to produce end-to-end lidar-to-velocity policies. No derivation chain reduces a claimed result to its own inputs by construction, no fitted parameters are relabeled as predictions, and no load-bearing self-citations or uniqueness theorems are invoked. Claims rest on simulation and real UGV experiments rather than self-referential mathematics, making the work self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no specific free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5628 in / 1057 out tokens · 41839 ms · 2026-05-25T10:45:22.416346+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 6 internal anchors

[1]

Towards decentralization of multi-robot navigation functions,

H. G. Tanner and A. Kumar, “Towards decentralization of multi-robot navigation functions,” in IEEE Int. Conf. on Robotics and Automation , 2005, pp. 4132–4137

work page 2005
[2]

Formation of a group of unmanned aerial vehicles (uavs),

T. J. Koo and S. M. Shahruz, “Formation of a group of unmanned aerial vehicles (uavs),” in American Control Conf. , 2001, pp. 69–74

work page 2001
[3]

Social potentials for scalable multi-robot formations,

T. Balch and M. Hybinette, “Social potentials for scalable multi-robot formations,” in IEEE Int. Conf. on Robotics and Automation, 2000, pp. 73–80

work page 2000
[4]

Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation,

A. Wasik, J. N. Pereira, R. Ventura, P. U. Lima, and A. Martinoli, “Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2016, pp. 1721–1728

work page 2016
[5]

Flocking for multi-agent dynamic systems: algo- rithms and theory,

R. Olfati-Saber, “Flocking for multi-agent dynamic systems: algo- rithms and theory,” IEEE Trans. on Automatic Control , vol. 51, no. 3, pp. 401–420, March 2006

work page 2006
[6]

Coordination and navigation of heterogeneous mavugv formations localized by a hawk- eye-like approach under a model predictive control scheme,

M. Saska, V . V onsek, T. Krajnk, and L. Peuil, “Coordination and navigation of heterogeneous mavugv formations localized by a hawk- eye-like approach under a model predictive control scheme,” The Int. Journal of Robotics Research , vol. 33, no. 10, pp. 1393–1412, 2014

work page 2014
[7]

Multi-robot navigation in formation via sequential convex programming,

J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2015, pp. 4634–4641

work page 2015
[8]

Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus,

J. Alonso-Mora, E. Montijano, M. Schwager, and D. Rus, “Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus,” in IEEE Int. Conf. on Robotics and Automation, 2016, pp. 5356–5363

work page 2016
[9]

Neural networks based reinforcement learning for mobile robots obstacle avoidance,

M. Duguleana and G. Mogan, “Neural networks based reinforcement learning for mobile robots obstacle avoidance,” Expert Systems with Applications, vol. 62, no. C, pp. 104–115, 2016

work page 2016
[10]

Towards Monocular Vision based Obstacle Avoidance through Deep Reinforcement Learning

L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision based obstacle avoidance through deep reinforcement learning,” arXiv preprint arXiv:1706.09829 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Self-supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation

G. Kahn, A. Villaﬂor, B. Ding, P. Abbeel, and S. Levine, “Self- supervised deep reinforcement learning with generalized computation graphs for robot navigation,” arXiv preprint arXiv:1709.10489 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[12]

Deep-learned collision avoidance policy for distributed multiagent navigation,

P. Long, W. Liu, and J. Pan, “Deep-learned collision avoidance policy for distributed multiagent navigation,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 656–663, 2017

work page 2017
[13]

Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning

P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep rein- forcement learning,” arXiv preprint arXiv:1709.10082 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[14]

Socially aware motion planning with deep reinforcement learning,

Y . F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2017, pp. 1343–1350

work page 2017
[15]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998

work page 1998
[16]

Multi-agent reinforcement learning: Independent vs. coop- erative agents,

M. Tan, “Multi-agent reinforcement learning: Independent vs. coop- erative agents,” in Proceedings of the tenth Int. Conf. on Machine Learning, 1993, pp. 330–337

work page 1993
[17]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . WU, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,” in Advances in Neural Information Processing Systems 30 , 2017, pp. 6379–6390

work page 2017
[18]

Model-free reinforcement learning for fully cooperative multi-agent graphical games,

Q. Zhang, D. Zhao, and F. L. Lewis, “Model-free reinforcement learning for fully cooperative multi-agent graphical games,” 2018 Int. Joint Conf. on Neural Networks , pp. 1–6, 2018

work page 2018
[19]

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,

K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980

work page 1980
[20]

Backpropagation applied to hand- written zip code recognition,

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand- written zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989

work page 1989
[21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Pytorch implementations of reinforcement learning al- gorithms,

I. Kostrikov, “Pytorch implementations of reinforcement learning al- gorithms,” https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018

work page 2018
[24]

Automatic differenti- ation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differenti- ation in pytorch,” 2017

work page 2017
[25]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[1] [1]

Towards decentralization of multi-robot navigation functions,

H. G. Tanner and A. Kumar, “Towards decentralization of multi-robot navigation functions,” in IEEE Int. Conf. on Robotics and Automation , 2005, pp. 4132–4137

work page 2005

[2] [2]

Formation of a group of unmanned aerial vehicles (uavs),

T. J. Koo and S. M. Shahruz, “Formation of a group of unmanned aerial vehicles (uavs),” in American Control Conf. , 2001, pp. 69–74

work page 2001

[3] [3]

Social potentials for scalable multi-robot formations,

T. Balch and M. Hybinette, “Social potentials for scalable multi-robot formations,” in IEEE Int. Conf. on Robotics and Automation, 2000, pp. 73–80

work page 2000

[4] [4]

Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation,

A. Wasik, J. N. Pereira, R. Ventura, P. U. Lima, and A. Martinoli, “Graph-based distributed control for adaptive multi-robot patrolling through local formation transformation,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2016, pp. 1721–1728

work page 2016

[5] [5]

Flocking for multi-agent dynamic systems: algo- rithms and theory,

R. Olfati-Saber, “Flocking for multi-agent dynamic systems: algo- rithms and theory,” IEEE Trans. on Automatic Control , vol. 51, no. 3, pp. 401–420, March 2006

work page 2006

[6] [6]

Coordination and navigation of heterogeneous mavugv formations localized by a hawk- eye-like approach under a model predictive control scheme,

M. Saska, V . V onsek, T. Krajnk, and L. Peuil, “Coordination and navigation of heterogeneous mavugv formations localized by a hawk- eye-like approach under a model predictive control scheme,” The Int. Journal of Robotics Research , vol. 33, no. 10, pp. 1393–1412, 2014

work page 2014

[7] [7]

Multi-robot navigation in formation via sequential convex programming,

J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2015, pp. 4634–4641

work page 2015

[8] [8]

Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus,

J. Alonso-Mora, E. Montijano, M. Schwager, and D. Rus, “Distributed multi-robot formation control among obstacles: A geometric and optimization approach with consensus,” in IEEE Int. Conf. on Robotics and Automation, 2016, pp. 5356–5363

work page 2016

[9] [9]

Neural networks based reinforcement learning for mobile robots obstacle avoidance,

M. Duguleana and G. Mogan, “Neural networks based reinforcement learning for mobile robots obstacle avoidance,” Expert Systems with Applications, vol. 62, no. C, pp. 104–115, 2016

work page 2016

[10] [10]

Towards Monocular Vision based Obstacle Avoidance through Deep Reinforcement Learning

L. Xie, S. Wang, A. Markham, and N. Trigoni, “Towards monocular vision based obstacle avoidance through deep reinforcement learning,” arXiv preprint arXiv:1706.09829 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Self-supervised Deep Reinforcement Learning with Generalized Computation Graphs for Robot Navigation

G. Kahn, A. Villaﬂor, B. Ding, P. Abbeel, and S. Levine, “Self- supervised deep reinforcement learning with generalized computation graphs for robot navigation,” arXiv preprint arXiv:1709.10489 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[12] [12]

Deep-learned collision avoidance policy for distributed multiagent navigation,

P. Long, W. Liu, and J. Pan, “Deep-learned collision avoidance policy for distributed multiagent navigation,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 656–663, 2017

work page 2017

[13] [13]

Towards Optimally Decentralized Multi-Robot Collision Avoidance via Deep Reinforcement Learning

P. Long, T. Fan, X. Liao, W. Liu, H. Zhang, and J. Pan, “Towards optimally decentralized multi-robot collision avoidance via deep rein- forcement learning,” arXiv preprint arXiv:1709.10082 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[14] [14]

Socially aware motion planning with deep reinforcement learning,

Y . F. Chen, M. Everett, M. Liu, and J. P. How, “Socially aware motion planning with deep reinforcement learning,” in IEEE/RSJ Int. Conf. on Intelligent Robots and Systems , 2017, pp. 1343–1350

work page 2017

[15] [15]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 1998

work page 1998

[16] [16]

Multi-agent reinforcement learning: Independent vs. coop- erative agents,

M. Tan, “Multi-agent reinforcement learning: Independent vs. coop- erative agents,” in Proceedings of the tenth Int. Conf. on Machine Learning, 1993, pp. 330–337

work page 1993

[17] [17]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . WU, A. Tamar, J. Harb, O. Pieter Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,” in Advances in Neural Information Processing Systems 30 , 2017, pp. 6379–6390

work page 2017

[18] [18]

Model-free reinforcement learning for fully cooperative multi-agent graphical games,

Q. Zhang, D. Zhao, and F. L. Lewis, “Model-free reinforcement learning for fully cooperative multi-agent graphical games,” 2018 Int. Joint Conf. on Neural Networks , pp. 1–6, 2018

work page 2018

[19] [19]

Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,

K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics, vol. 36, no. 4, pp. 193–202, 1980

work page 1980

[20] [20]

Backpropagation applied to hand- written zip code recognition,

Y . LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to hand- written zip code recognition,” Neural Computation, vol. 1, no. 4, pp. 541–551, 1989

work page 1989

[21] [21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

OpenAI Gym

G. Brockman, V . Cheung, L. Pettersson, J. Schneider, J. Schul- man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Pytorch implementations of reinforcement learning al- gorithms,

I. Kostrikov, “Pytorch implementations of reinforcement learning al- gorithms,” https://github.com/ikostrikov/pytorch-a2c-ppo-acktr, 2018

work page 2018

[24] [24]

Automatic differenti- ation in pytorch,

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer, “Automatic differenti- ation in pytorch,” 2017

work page 2017

[25] [25]

Adam: A Method for Stochastic Optimization

D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014