Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints

Alejandro Ribeiro; Arbaaz Khan; Brent Schlotfeldt; Chi Zhang; Jiayue Wu; Osbert Bastani; Sarah Y. Tang; Shuo Li; Vijay Kumar

arxiv: 1907.05300 · v1 · pith:CLX44FZQnew · submitted 2019-07-11 · 💻 cs.RO

Learning Safe Unlabeled Multi-Robot Planning with Motion Constraints

Arbaaz Khan , Chi Zhang , Shuo Li , Jiayue Wu , Brent Schlotfeldt , Sarah Y. Tang , Alejandro Ribeiro , Osbert Bastani

show 1 more author

Vijay Kumar

This is my paper

Pith reviewed 2026-05-24 23:12 UTC · model grok-4.3

classification 💻 cs.RO

keywords multi-robot motion planningunlabeled robotsreinforcement learningvelocity obstaclescollision avoidancetrajectory planningmotion constraints

0 comments

The pith

A multi-agent reinforcement learning method uses velocity-obstacle projection to produce collision-free trajectories for unlabeled robots under arbitrary dynamics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper frames unlabeled multi-robot goal assignment and trajectory planning in obstacle-filled 2D spaces as a multi-agent reinforcement learning task that receives only a sparse global reward. It introduces a smooth projection step derived from velocity obstacles that modifies actions to keep all trajectories collision-free with respect to neighbors and static obstacles. The same learned policy works across different robot dynamic models without requiring new hand-crafted costs or generators for each model. A reader would care if this removes the need to redesign planning algorithms when robot hardware or kinematics change.

Core claim

By casting the unlabeled multi-robot motion planning problem as multi-agent reinforcement learning with a sparse global reward and then applying a velocity-obstacle-derived smooth projection to the learned actions, the resulting policy produces trajectories that are collision-free for every robot relative to its neighbors and to obstacles while remaining applicable to arbitrary robot models.

What carries the argument

The smooth projection step that uses velocity obstacles to adjust actions and enforce collision-free motion.

If this is right

The same training procedure works for any robot kinematic or dynamic model without rewriting the optimization cost.
Collision-free motion is guaranteed for all robots with respect to neighbors and obstacles during execution.
Goal assignment and trajectory generation are solved jointly inside the learned policy.
The approach applies directly to workspaces containing static obstacles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the projection preserves feasibility under the learned policy, the same structure could be tested on teams larger than those shown in simulation to check scaling behavior.
The method separates safety enforcement from goal-reaching, so replacing the sparse reward with a denser one might improve convergence without losing the collision guarantee.
Because the projection acts after the policy output, the framework could be combined with any future reinforcement-learning algorithm that outputs velocities or controls.

Load-bearing premise

The learned policy, once filtered by the projection, will still reach the assigned goals rather than getting stuck or cycling.

What would settle it

A set of simulation trials in which the learned policy plus projection produces at least one run where one or more robots never reach their assigned goals.

Figures

Figures reproduced from arXiv: 1907.05300 by Alejandro Ribeiro, Arbaaz Khan, Brent Schlotfeldt, Chi Zhang, Jiayue Wu, Osbert Bastani, Sarah Y. Tang, Shuo Li, Vijay Kumar.

**Figure 1.** Figure 1: Learning Unlabeled Motion Planning Robots observe their own state and velocity, relative positions of all goals and other entities within a sensing region. For robot n, this information is compiled into a vector sn. Each robot uses its own policy network to compute an action an. During training, policy networks also exchange information i with a centralized Q-network which uses information from all robots… view at source ↗

**Figure 2.** Figure 2: Velocity Obstacle Velocity obstacle VOn b (vb(t)) of obstacle b to robot n. When there exist multiple obstacles, the VO is defined as the union of all velocity obstacles. where z is some normally distributed noise z ∼ N (0, σ2 ), κ is mass of the robot and is some fixed constant and ∆ is the fixed time interval. Similarly, the rotational acceleration is derived from the torque and integrated over twice to … view at source ↗

**Figure 3.** Figure 3: Min Projection vs Sigmoid Projection of velocity When using the minimum projection given in Eq 18 the safe velocity has a discontinuity. However, if we assume that the event vn(t) = vj ± ∆v occurs with very low probability, we get p(v safe n (t)) a continuous function. is given as : v safe n (t) = vn(t)1V O0 (vn(t))+P sig V O0 (vn(t))(1−1V O0 (vn(t)) (21) and from this we conclude that p(ot+1|ot, at) is a … view at source ↗

**Figure 4.** Figure 4: Training curves for holonomic robots (with 2,3 and 4 Obstacles (Obs)). We observe that the proposed MARL+RVO algorithm is able to converge and perform better than a centralized RL (C-PPO) policy. The global reward scale is different for each plot since it is a function of the space the robots operate in. Each curve is produced by running three independent runs of the algorithm. Darker line represents mean … view at source ↗

**Figure 5.** Figure 5: Training curves for non holonomic robots (with 2,3 and 4 Obstacles (Obs) When the robot dynamics are changed, MARL+RVO is still able to converge without making any changes to the loss function or the training parameters. B. Simulation Results We first observe from [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectory executed by 3 robots Trajectories executed by robots in three randomly generated episodes after training is complete. In addition to robots reaching their goals in collision free manners, the proposed approach also aligns robots to desired final goal orientations. satisfied. These assumptions restrict the class of problems that can be solved by traditional algorithms. By utilizing RL, we remove … view at source ↗

**Figure 7.** Figure 7: Time taken to reach goal (with 2,3 and 4 Obstacles (Obs)) over 500 runs. We compare with Multi-Agent Reinforcement Learning (MARL), Reciprocal Velocity Obstacles (RVO), GAP [9] and our algorithm which combines the RL and safety(MARL+RVO). For the RVO method, we assign each robot its nearest goal (in terms of euclidean distance). GAP uses discrete nodes to search through space and hence its performance is c… view at source ↗

**Figure 8.** Figure 8: Training time to convergence. Training time for different configurations of robots and agents when trained on a NVIDIA DGX-1 (Tesla V100, 32GB × 8) parallel methods [23], [24] and software libraries [25] to scale up for reinforcement learning, scaling up the number of robots still poses a significant computing challenge. Methods attempting to learn hierarchical policies for agents such as those in [26] mig… view at source ↗

read the original abstract

In this paper, we present a learning approach to goal assignment and trajectory planning for unlabeled robots operating in 2D, obstacle-filled workspaces. More specifically, we tackle the unlabeled multi-robot motion planning problem with motion constraints as a multi-agent reinforcement learning problem with some sparse global reward. In contrast with previous works, which formulate an entirely new hand-crafted optimization cost or trajectory generation algorithm for a different robot dynamic model, our framework is a general approach that is applicable to arbitrary robot models. Further, by using the velocity obstacle, we devise a smooth projection that guarantees collision free trajectories for all robots with respect to their neighbors and obstacles. The efficacy of our algorithm is demonstrated through varied simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper pairs multi-agent RL with a velocity-obstacle projection for unlabeled multi-robot planning under arbitrary dynamics, but the safety layer does not address whether the system reaches goals instead of deadlocking.

read the letter

The paper combines multi-agent reinforcement learning with a velocity-obstacle based projection to handle goal assignment and trajectory planning for unlabeled robots in 2D workspaces with obstacles. The key idea is to use a sparse global reward for the RL part and then apply a smooth projection that is claimed to guarantee collision-free motion for arbitrary robot dynamics. What the work does well is avoid the need to redesign optimization problems or trajectory generators when the robot model changes. By making the safety layer model-agnostic through the projection, it offers a reusable pipeline. The unlabeled setting with global reward is a sensible way to approach the assignment problem without explicit matching. The soft spots are in the evidence and the completeness of the guarantees. The paper mentions varied simulations but gives no quantitative results, baselines, or failure cases, so it is difficult to assess how effective the method actually is. More importantly, while the projection ensures no collisions, there is no argument or data showing that the closed-loop system will reach the goals rather than enter deadlocks or infeasible velocity sets, especially with non-holonomic constraints or crowded environments. The sparse reward only encourages eventual success without preventing trapping behaviors. This paper is for robotics researchers focused on multi-agent planning and learning-based methods. A reader interested in practical implementations for different robot types might find the framework useful to build on. It deserves a serious referee because the approach is clearly stated and the safety mechanism is explicit, even if the empirical and theoretical support needs more development. I would recommend sending it to peer review to get feedback on the experiments and any additional proofs or analyses.

Referee Report

3 major / 1 minor

Summary. The paper proposes a multi-agent reinforcement learning framework to solve the unlabeled multi-robot motion planning problem with motion constraints in 2D obstacle-filled workspaces. It formulates the task with a sparse global reward and introduces a velocity-obstacle-based smooth projection operator asserted to guarantee collision-free trajectories for arbitrary robot models, avoiding the need for model-specific hand-crafted costs or trajectory generators. Efficacy is claimed via varied simulations.

Significance. If the projection ensures both collision avoidance and goal reachability while the RL component reliably produces feasible policies, the approach could supply a general, learning-based alternative to per-model optimization methods in multi-robot planning.

major comments (3)

[Abstract] Abstract: the central claim that the velocity-obstacle projection 'guarantees collision free trajectories for all robots with respect to their neighbors and obstacles' is load-bearing yet unsupported by any formal argument, invariance proof, or analysis of cases where the feasible set is empty or induces deadlocks, especially under non-holonomic constraints.
[Abstract] Abstract: the efficacy demonstration rests entirely on 'varied simulations' with no quantitative metrics, baseline comparisons, success rates, statistical significance, or failure-mode analysis supplied, leaving the performance claims for arbitrary models unassessable.
[Abstract] Abstract: the framework is presented as guaranteeing both safety and goal-reaching under the sparse global reward, but no argument or evidence addresses whether the closed-loop system (learned policy + projection) converges to assigned goals rather than trapping equilibria, which is required for the planning claim to hold.

minor comments (1)

[Abstract] The abstract would benefit from explicit mention of the specific robot dynamics tested and the number of robots/obstacles in the simulations to allow readers to gauge the scope of the generality claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for these constructive comments on the abstract claims. We address each point below and will revise the manuscript accordingly to provide the requested support and evaluation details.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the velocity-obstacle projection 'guarantees collision free trajectories for all robots with respect to their neighbors and obstacles' is load-bearing yet unsupported by any formal argument, invariance proof, or analysis of cases where the feasible set is empty or induces deadlocks, especially under non-holonomic constraints.

Authors: The projection is defined by mapping commanded velocities onto the collision-free velocity obstacle set for each robot-obstacle and robot-robot pair. By construction this mapping ensures that selected velocities lie outside all pairwise collision cones when the set is nonempty. We agree, however, that the manuscript lacks an explicit invariance argument and does not analyze the empty-set or deadlock cases, particularly for non-holonomic kinematics. We will add a dedicated subsection with the formal argument and a discussion of these edge cases. revision: yes
Referee: [Abstract] Abstract: the efficacy demonstration rests entirely on 'varied simulations' with no quantitative metrics, baseline comparisons, success rates, statistical significance, or failure-mode analysis supplied, leaving the performance claims for arbitrary models unassessable.

Authors: The current simulations illustrate applicability across several robot models, but we acknowledge the absence of aggregated quantitative metrics, baselines, success rates, and statistical reporting. We will expand the evaluation section with these elements, including tabulated success rates over repeated trials, comparisons against model-specific planners where feasible, and failure-case analysis. revision: yes
Referee: [Abstract] Abstract: the framework is presented as guaranteeing both safety and goal-reaching under the sparse global reward, but no argument or evidence addresses whether the closed-loop system (learned policy + projection) converges to assigned goals rather than trapping equilibria, which is required for the planning claim to hold.

Authors: The sparse reward is designed to incentivize goal assignment and reaching while the projection layer enforces safety at every step. We recognize that the manuscript provides no convergence argument or analysis of possible trapping equilibria for the combined policy-plus-projection system. We will add a discussion of this issue together with additional experiments that quantify goal-reaching success and identify any observed equilibria. revision: yes

Circularity Check

0 steps flagged

No circularity; framework combines standard RL and external velocity-obstacle projection

full rationale

The provided abstract and description contain no equations, fitted parameters, or self-citations that reduce any claim to a tautology or construction. The approach is described as using multi-agent RL with sparse reward plus a velocity-obstacle-based smooth projection for collision avoidance; both components are presented as external standard tools rather than derived from the paper's own outputs. No self-definitional steps, fitted-input predictions, or load-bearing self-citation chains appear. Reachability under sparse reward is a separate correctness question, not a circularity issue. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, invented entities, or detailed axioms; the velocity-obstacle projection is treated as a domain-standard tool whose correctness is assumed.

axioms (1)

domain assumption Velocity obstacle projection yields a smooth, always-feasible action that preserves the learned policy's intent while guaranteeing collision avoidance.
Invoked to convert the RL output into collision-free trajectories for arbitrary dynamics.

pith-pipeline@v0.9.0 · 5664 in / 1272 out tokens · 42049 ms · 2026-05-24T23:12:38.703796+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 7 internal anchors

[1]

Modeling and control of formations of nonholonomic mobile robots,

J. P. Desai, J. P. Ostrowski, and V . Kumar, “Modeling and control of formations of nonholonomic mobile robots,” IEEE transactions on Robotics and Automation , vol. 17, no. 6, pp. 905–908, 2001

work page 2001
[2]

Multi-robot navigation in formation via sequential convex programming,

J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on . IEEE, 2015, pp. 4634–4641

work page 2015
[3]

Dynamic perimeter surveillance with a team of robots,

D. Saldana, R. J. Alitappeh, L. C. Pimenta, R. Assunçao, and M. F. Campos, “Dynamic perimeter surveillance with a team of robots,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 5289–5294

work page 2016
[4]

Efﬁcient multi- robot motion planning for unlabeled discs in simple polygons,

A. Adler, M. De Berg, D. Halperin, and K. Solovey, “Efﬁcient multi- robot motion planning for unlabeled discs in simple polygons,” in Algorithmic Foundations of Robotics XI . Springer, 2015, pp. 1–17

work page 2015
[5]

Scram: Scalable collision- avoiding role assignment with minimal-makespan for formational po- sitioning,

P. MacAlpine, E. Price, and P. Stone, “Scram: Scalable collision- avoiding role assignment with minimal-makespan for formational po- sitioning,” in Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015

work page 2015
[6]

Distance optimal formation control on graphs with a tight convergence time guarantee,

J. Yu and M. LaValle, “Distance optimal formation control on graphs with a tight convergence time guarantee,” in Decision and Control (CDC), 2012 IEEE 51st Annual Conference on . IEEE, 2012, pp. 4023–4028

work page 2012
[7]

Capt: Concurrent assignment and planning of trajectories for multiple robots,

M. Turpin, N. Michael, and V . Kumar, “Capt: Concurrent assignment and planning of trajectories for multiple robots,” The International Journal of Robotics Research , vol. 33, no. 1, pp. 98–112, 2014

work page 2014
[8]

Multi-agent path planning and network ﬂow,

J. Yu and S. M. LaValle, “Multi-agent path planning and network ﬂow,” in Algorithmic foundations of robotics X . Springer, 2013, pp. 157–173

work page 2013
[9]

Goal assignment and trajectory planning for large teams of interchangeable robots,

M. Turpin, K. Mohta, N. Michael, and V . Kumar, “Goal assignment and trajectory planning for large teams of interchangeable robots,” Autonomous Robots, vol. 37, no. 4, pp. 401–415, 2014

work page 2014
[10]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,” in Advances in Neural Information Processing Systems , 2017, pp. 6379–6390

work page 2017
[11]

Counterfactual multi-agent policy gradients,

J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. White- son, “Counterfactual multi-agent policy gradients,” arXiv preprint arXiv:1705.08926, 2017

work page arXiv 2017
[12]

Markov games as a framework for multi-agent reinforcement learning,

M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine Learning Proceedings 1994 . Elsevier, 1994, pp. 157–163

work page 1994
[13]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforce- ment learning,” arXiv preprint arXiv:1509.02971 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[14]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

work page 1998
[15]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[16]

Multi-agent reinforcement learning: Independent vs. cooper- ative agents

M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- ative agents.”

work page
[17]

Motion planning in dynamic environments using velocity obstacles,

P. Fiorini and Z. Shiller, “Motion planning in dynamic environments using velocity obstacles,” The International Journal of Robotics Re- search, vol. 17, no. 7, pp. 760–772, 1998

work page 1998
[18]

Reciprocal n-body collision avoidance,

J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics research. Springer, 2011, pp. 3–19

work page 2011
[19]

Optimal reciprocal collision avoidance for multiple non-holonomic robots,

J. Alonso-Mora, A. Breitenmoser, M. Ruﬂi, P. Beardsley, and R. Siegwart, “Optimal reciprocal collision avoidance for multiple non-holonomic robots,” in Distributed Autonomous Robotic Systems . Springer, 2013, pp. 203–216

work page 2013
[20]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[21]

Reciprocal velocity obsta- cles for real-time multi-agent navigation,

J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obsta- cles for real-time multi-agent navigation,” in 2008 IEEE International Conference on Robotics and Automation. IEEE, 2008, pp. 1928–1935

work page 2008
[22]

Semi-Supervised Classification with Graph Convolutional Networks

T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graph convolutional networks,” arXiv preprint arXiv:1609.02907 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

Distributed Prioritized Experience Replay

D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver, “Distributed prioritized experience replay,” arXiv preprint arXiv:1803.00933 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunninget al., “Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures,” arXiv preprint arXiv:1802.01561 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[25]

RLlib: Abstractions for distributed reinforcement learning,

E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica, “RLlib: Abstractions for distributed reinforcement learning,” in International Conference on Machine Learning (ICML) , 2018

work page 2018
[26]

Scalable Centralized Deep Multi-Agent Reinforcement Learning via Policy Gradients

A. Khan, C. Zhang, D. D. Lee, V . Kumar, and A. Ribeiro, “Scalable centralized deep multi-agent reinforcement learning via policy gradi- ents,” arXiv preprint arXiv:1805.08776 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Cooperative collision avoidance for nonholonomic robots,

J. Alonso-Mora, P. Beardsley, and R. Siegwart, “Cooperative collision avoidance for nonholonomic robots,” IEEE Transactions on Robotics , vol. 34, no. 2, pp. 404–420, 2018

work page 2018

[1] [1]

Modeling and control of formations of nonholonomic mobile robots,

J. P. Desai, J. P. Ostrowski, and V . Kumar, “Modeling and control of formations of nonholonomic mobile robots,” IEEE transactions on Robotics and Automation , vol. 17, no. 6, pp. 905–908, 2001

work page 2001

[2] [2]

Multi-robot navigation in formation via sequential convex programming,

J. Alonso-Mora, S. Baker, and D. Rus, “Multi-robot navigation in formation via sequential convex programming,” in Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on . IEEE, 2015, pp. 4634–4641

work page 2015

[3] [3]

Dynamic perimeter surveillance with a team of robots,

D. Saldana, R. J. Alitappeh, L. C. Pimenta, R. Assunçao, and M. F. Campos, “Dynamic perimeter surveillance with a team of robots,” in Robotics and Automation (ICRA), 2016 IEEE International Conference on. IEEE, 2016, pp. 5289–5294

work page 2016

[4] [4]

Efﬁcient multi- robot motion planning for unlabeled discs in simple polygons,

A. Adler, M. De Berg, D. Halperin, and K. Solovey, “Efﬁcient multi- robot motion planning for unlabeled discs in simple polygons,” in Algorithmic Foundations of Robotics XI . Springer, 2015, pp. 1–17

work page 2015

[5] [5]

Scram: Scalable collision- avoiding role assignment with minimal-makespan for formational po- sitioning,

P. MacAlpine, E. Price, and P. Stone, “Scram: Scalable collision- avoiding role assignment with minimal-makespan for formational po- sitioning,” in Twenty-Ninth AAAI Conference on Artiﬁcial Intelligence, 2015

work page 2015

[6] [6]

Distance optimal formation control on graphs with a tight convergence time guarantee,

J. Yu and M. LaValle, “Distance optimal formation control on graphs with a tight convergence time guarantee,” in Decision and Control (CDC), 2012 IEEE 51st Annual Conference on . IEEE, 2012, pp. 4023–4028

work page 2012

[7] [7]

Capt: Concurrent assignment and planning of trajectories for multiple robots,

M. Turpin, N. Michael, and V . Kumar, “Capt: Concurrent assignment and planning of trajectories for multiple robots,” The International Journal of Robotics Research , vol. 33, no. 1, pp. 98–112, 2014

work page 2014

[8] [8]

Multi-agent path planning and network ﬂow,

J. Yu and S. M. LaValle, “Multi-agent path planning and network ﬂow,” in Algorithmic foundations of robotics X . Springer, 2013, pp. 157–173

work page 2013

[9] [9]

Goal assignment and trajectory planning for large teams of interchangeable robots,

M. Turpin, K. Mohta, N. Michael, and V . Kumar, “Goal assignment and trajectory planning for large teams of interchangeable robots,” Autonomous Robots, vol. 37, no. 4, pp. 401–415, 2014

work page 2014

[10] [10]

Multi-agent actor-critic for mixed cooperative-competitive environ- ments,

R. Lowe, Y . Wu, A. Tamar, J. Harb, O. P. Abbeel, and I. Mordatch, “Multi-agent actor-critic for mixed cooperative-competitive environ- ments,” in Advances in Neural Information Processing Systems , 2017, pp. 6379–6390

work page 2017

[11] [11]

Counterfactual multi-agent policy gradients,

J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. White- son, “Counterfactual multi-agent policy gradients,” arXiv preprint arXiv:1705.08926, 2017

work page arXiv 2017

[12] [12]

Markov games as a framework for multi-agent reinforcement learning,

M. L. Littman, “Markov games as a framework for multi-agent reinforcement learning,” in Machine Learning Proceedings 1994 . Elsevier, 1994, pp. 157–163

work page 1994

[13] [13]

Continuous control with deep reinforcement learning

T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y . Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforce- ment learning,” arXiv preprint arXiv:1509.02971 , 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[14] [14]

R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press Cambridge, 1998, vol. 1, no. 1

work page 1998

[15] [15]

Playing Atari with Deep Reinforcement Learning

V . Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing atari with deep reinforcement learning,” arXiv preprint arXiv:1312.5602 , 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[16] [16]

Multi-agent reinforcement learning: Independent vs. cooper- ative agents

M. Tan, “Multi-agent reinforcement learning: Independent vs. cooper- ative agents.”

work page

[17] [17]

Motion planning in dynamic environments using velocity obstacles,

P. Fiorini and Z. Shiller, “Motion planning in dynamic environments using velocity obstacles,” The International Journal of Robotics Re- search, vol. 17, no. 7, pp. 760–772, 1998

work page 1998

[18] [18]

Reciprocal n-body collision avoidance,

J. Van Den Berg, S. J. Guy, M. Lin, and D. Manocha, “Reciprocal n-body collision avoidance,” in Robotics research. Springer, 2011, pp. 3–19

work page 2011

[19] [19]

Optimal reciprocal collision avoidance for multiple non-holonomic robots,

J. Alonso-Mora, A. Breitenmoser, M. Ruﬂi, P. Beardsley, and R. Siegwart, “Optimal reciprocal collision avoidance for multiple non-holonomic robots,” in Distributed Autonomous Robotic Systems . Springer, 2013, pp. 203–216

work page 2013

[20] [20]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[21] [21]

Reciprocal velocity obsta- cles for real-time multi-agent navigation,

J. Van den Berg, M. Lin, and D. Manocha, “Reciprocal velocity obsta- cles for real-time multi-agent navigation,” in 2008 IEEE International Conference on Robotics and Automation. IEEE, 2008, pp. 1928–1935

work page 2008

[22] [22]

Semi-Supervised Classification with Graph Convolutional Networks

T. N. Kipf and M. Welling, “Semi-supervised classiﬁcation with graph convolutional networks,” arXiv preprint arXiv:1609.02907 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[23] [23]

Distributed Prioritized Experience Replay

D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver, “Distributed prioritized experience replay,” arXiv preprint arXiv:1803.00933 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunninget al., “Impala: Scalable dis- tributed deep-rl with importance weighted actor-learner architectures,” arXiv preprint arXiv:1802.01561 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[25] [25]

RLlib: Abstractions for distributed reinforcement learning,

E. Liang, R. Liaw, R. Nishihara, P. Moritz, R. Fox, K. Goldberg, J. E. Gonzalez, M. I. Jordan, and I. Stoica, “RLlib: Abstractions for distributed reinforcement learning,” in International Conference on Machine Learning (ICML) , 2018

work page 2018

[26] [26]

Scalable Centralized Deep Multi-Agent Reinforcement Learning via Policy Gradients

A. Khan, C. Zhang, D. D. Lee, V . Kumar, and A. Ribeiro, “Scalable centralized deep multi-agent reinforcement learning via policy gradi- ents,” arXiv preprint arXiv:1805.08776 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Cooperative collision avoidance for nonholonomic robots,

J. Alonso-Mora, P. Beardsley, and R. Siegwart, “Cooperative collision avoidance for nonholonomic robots,” IEEE Transactions on Robotics , vol. 34, no. 2, pp. 404–420, 2018

work page 2018