DeepFleet: Multi-Agent Foundation Models for Mobile Robots
Pith reviewed 2026-05-19 00:09 UTC · model grok-4.3
The pith
Models that focus on local robot interactions and asynchronous updates perform best for coordinating large mobile robot fleets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that the robot-centric model, an autoregressive decision transformer on individual robot neighborhoods, and the graph-floor model, which uses temporal attention combined with graph neural networks for spatial relationships, both outperform the other two designs on prediction tasks involving robot positions, goals, and interactions, and that these two benefit from scaling up with larger datasets.
What carries the argument
The inductive biases embodied in the four architectures, particularly the use of asynchronous robot state updates and the incorporation of localized structures of robot interactions in the robot-centric and graph-floor models.
Load-bearing premise
Improvements in accuracy on historical prediction tasks will lead to better outcomes in live, real-time robot coordination and planning without needing additional fine-tuning or safety constraints.
What would settle it
A direct test would be to integrate one of the promising models into a warehouse simulator or live system and measure changes in overall fleet efficiency, such as average task completion time or number of near-misses, compared to traditional planning methods.
read the original abstract
We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces DeepFleet, a suite of four multi-agent foundation models (robot-centric RC, robot-floor RF, image-floor IF, and graph-floor GF) for coordination and planning in large-scale mobile robot fleets. The models are trained on fleet movement data including positions, goals, and interactions from hundreds of thousands of robots across Amazon warehouses. Each architecture embodies a distinct inductive bias: RC uses an autoregressive decision transformer on robot neighborhoods with asynchronous updates; RF employs transformer cross-attention between robots and the floor; IF applies convolutional encoding to multi-channel fleet images; and GF combines temporal attention with graph neural networks for spatial relationships. The paper evaluates the impact of these design choices on prediction task performance and concludes that the RC and GF models, which incorporate asynchronous state updates and localized interaction structures, show the most promise and scale effectively with larger warehouse datasets.
Significance. If the superior prediction performance of the RC and GF models translates to improved real-time planning and coordination, the work could meaningfully advance scalable multi-agent foundation models for robotics and warehouse logistics by leveraging large-scale real-world data. The systematic exploration of inductive biases provides useful design insights. However, the current focus on isolated prediction metrics without downstream operational validation limits the strength of claims regarding practical coordination benefits.
major comments (1)
- [Evaluation section (as described in abstract and main text)] The central claim that the RC and GF models 'show the most promise' for coordination and planning rests on their outperformance in prediction tasks (state forecasting and interaction modeling) and scaling behavior with larger datasets. The evaluation section reports only these prediction metrics under the four inductive biases and does not include any closed-loop planning experiments, MPC-style rollouts, or operational metrics such as throughput, collision rate, or makespan. This gap means the results do not directly test whether better next-state prediction yields improved real-time fleet decisions, which is load-bearing for the paper's motivation and conclusions.
minor comments (1)
- [Abstract] The abstract states that comparative evaluation and scaling results are presented but supplies no quantitative metrics, error bars, dataset sizes, or ablation details, making it harder for readers to immediately gauge the magnitude of the reported improvements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review of our manuscript. We address the major comment below, providing an honest assessment of the evaluation scope while defending the paper's contributions on their own terms.
read point-by-point responses
-
Referee: [Evaluation section (as described in abstract and main text)] The central claim that the RC and GF models 'show the most promise' for coordination and planning rests on their outperformance in prediction tasks (state forecasting and interaction modeling) and scaling behavior with larger datasets. The evaluation section reports only these prediction metrics under the four inductive biases and does not include any closed-loop planning experiments, MPC-style rollouts, or operational metrics such as throughput, collision rate, or makespan. This gap means the results do not directly test whether better next-state prediction yields improved real-time fleet decisions, which is load-bearing for the paper's motivation and conclusions.
Authors: We thank the referee for this observation. The manuscript explicitly frames its contribution as an exploration of inductive biases in multi-agent foundation models, with evaluation focused on prediction tasks (state forecasting and interaction modeling) because these directly assess how well each architecture captures fleet dynamics from hundreds of thousands of real-world robot trajectories. The RC and GF models' superior performance and scaling behavior on these tasks provide evidence that architectures incorporating asynchronous updates and localized interaction structures are the most promising starting points for models intended to support coordination and planning. We do not claim or demonstrate direct improvements in closed-loop planning, MPC rollouts, throughput, collision rates, or makespan, as such operational validation would require coupling the models to specific planners and simulators—an integration that lies beyond the current scope of comparing design choices via prediction metrics. We agree that this represents a limitation for stronger claims about real-time fleet decisions. In revision we will add a dedicated paragraph in the discussion and conclusion sections that explicitly acknowledges this scope limitation, clarifies that prediction performance is presented as a necessary (but not sufficient) indicator of promise for downstream coordination, and outlines future work on closed-loop evaluation. revision: partial
Circularity Check
No significant circularity in empirical model evaluation
full rationale
The paper introduces four model architectures embodying distinct inductive biases and evaluates them empirically on prediction tasks using external warehouse fleet movement data from hundreds of thousands of robots. No mathematical derivation chain, first-principles results, or equations are presented that reduce to fitted parameters or self-referential inputs by construction. Claims rest on comparative performance metrics from independent datasets rather than any self-definitional, fitted-input-as-prediction, or self-citation load-bearing steps. The work is self-contained against external benchmarks with no circularity signals.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Warehouse robot movement data from hundreds of thousands of units captures the relevant interaction patterns needed to train and evaluate coordination models.
Reference graph
Works this paper leans on
-
[1]
Language Models Are Few-Shot Learners
Tom Brown, Benjamin Mann, Nick Ryder, et al. Language Models Are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020
work page 2020
-
[2]
Yu Zhang, Daniel S. Park, Wei Han, et al. BigSSL: Exploring the Frontier of Large-Scale Semi- Supervised Learning for Automatic Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 2022
work page 2022
-
[3]
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), July 2021
work page 2021
-
[4]
OpenAI. OpenAI o1 System Card. arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Scaling laws of motion forecasting and planning–a technical report,
Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, et al. Scaling Laws of Motion Forecasting and Planning – A Technical Report. arXiv:2506.08228, 2025
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024
work page 2024
-
[8]
Octo: An Open-Source Generalist Robot Policy
Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of Robotics Science and Systems, July 2024
work page 2024
-
[9]
Social LSTM: Human Trajectory Prediction in Crowded Spaces
Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Fei-fei Li, and Silvio Savarese. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016
work page 2016
-
[10]
Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks
Agrim Gupta, Justin Johnson, Fei-fei Li, et al. Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 21 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots
work page 2018
-
[11]
Transformer Networks for Trajectory Forecasting
Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso. Transformer Networks for Trajectory Forecasting. In Proceedings of the International Conference on Pattern Recog- nition (ICPR), 2021
work page 2021
-
[12]
Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data
Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data. InProceedings of the European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[13]
From goals, way- points & paths to long term human trajectory forecasting
Karttikeya Mangalam, Yang An, Harshayu Girase, and Jitendra Malik. From goals, way- points & paths to long term human trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVF), 2021
work page 2021
-
[14]
Collab- orative uncertainty in multi-agent trajectory forecasting
Bohan Tang, Yiqi Zhong, Ulrich Neumann, Gang Wang, Siheng Chen, and Ya Zhang. Collab- orative uncertainty in multi-agent trajectory forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[15]
Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data
Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the European Conference on Computer Vision (ECCV), 2020
work page 2020
-
[16]
A Joint Prediction Method of Multi-Agent to Reduce Collision Rate
Mingyi Wang, Hongqun Zou, Yifan Liu, You Wang, and Guang Li. A Joint Prediction Method of Multi-Agent to Reduce Collision Rate . arXiv:2411.07612, 2024
-
[17]
ICBS: The improved conflict-based search algorithm for multi-agent pathfinding
Eli Boyarski, Ariel Felner, Roni Stern, Guni Sharon, Oded Betzalel, David Tolpin, and Eyal Shimony. ICBS: The improved conflict-based search algorithm for multi-agent pathfinding. In Proceedings of the International Symposium on Combinatorial Search, 2015
work page 2015
-
[18]
Multi-agent pathfinding: Definitions, variants, and benchmarks
Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, TK Kumar, et al. Multi-agent pathfinding: Definitions, variants, and benchmarks. In Proceedings of the International Symposium on Combinatorial Search, 2019
work page 2019
-
[19]
Christensen, Stephanie Kemna, and Gaurav Sukhatme
Nicholas Fung, John Rogers, Carlos Nieto, Henrik I. Christensen, Stephanie Kemna, and Gaurav Sukhatme. Coordinating multi-robot systems through environment partitioning for adaptive informative sampling. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2019
work page 2019
-
[20]
Dynamic multi-robot task allocation under uncertainty and temporal constraints
Shushman Choudhury, Jayesh K Gupta, Mykel J Kochenderfer, Dorsa Sadigh, and Jeannette Bohg. Dynamic multi-robot task allocation under uncertainty and temporal constraints. Autonomous Robots, 2022
work page 2022
-
[21]
Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding
He Jiang, Yutong Wang, Rishi Veerapaneni, Tanishq Duhan, Guillaume Sartoretti, and Jiaoyang Li. Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding. arXiv:2410.21415, 2024
-
[22]
MAPF-GPT: Imitation learning for multi-agent pathfinding at scale
Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. MAPF-GPT: Imitation learning for multi-agent pathfinding at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 22 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots
work page 2025
-
[23]
TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction
Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2020
work page 2020
-
[24]
Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene trans- former: A unified multi-task model for behavior prediction and planning. arXiv:2106.08417, 2021
-
[25]
Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues
Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues. Applied Intelligence, 2022
work page 2022
-
[26]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Hierarchical state space models for continuous sequence-to-sequence modeling
Raunaq Bhirangi, Chenyu Wang, Venkatesh Pattabiraman, et al. Hierarchical state space models for continuous sequence-to-sequence modeling. In Proceedings of the International Conference on Machine Learning (ICML), 2024
work page 2024
-
[28]
Casper Dik, Christos Emmanouilidis, and Bertrand Duqueroie. Graph Network-Based Human Movement Prediction for Socially-Aware Robot Navigation in Shared Workspaces. Neural Computing and Applications, 2024
work page 2024
-
[29]
Learning skillful medium-range global weather forecasting
Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 2023
work page 2023
-
[30]
Gencast: Diffusion- based ensemble forecasting for medium-range weather
Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv:2312.15796, 2023
-
[31]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling Laws for Neural Language Models. arXiv:2001:08361, 2020
work page 2001
-
[32]
Training Compute-Optimal Large Language Models
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[33]
PaLM: Scaling Language Modeling with Pathways
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. PaLM: Scaling Language Model- ing with Pathways. arXiv:2204.02311, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[34]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 Herd of Models. arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
RT-1: Robotics Transformer for Real-World Control at Scale
Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-1: Robotics transformer for real- world control at scale. arXiv:2212.06817, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[36]
RT-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning (CoRL), 2023
work page 2023
-
[37]
Vima: General robot manipulation with multimodal prompts, 2023
Yunfan Jiang, Agrim Gupta, Zichen Zhang, et al. VIMA: General robot manipulation with multimodal prompts. arXiv:2210.03094, 2022. 23 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots
-
[38]
Perceiver-Actor: A multi-task transformer for robotic manipulation
Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A multi-task transformer for robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2023
work page 2023
-
[39]
Flamingo: a Visual Language Model for Few-Shot Learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[40]
Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025
F I G U R E. Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025
work page 2025
-
[41]
Scott Reed, Konrad Zolna, Emilio Parisotto, et al. A Generalist Agent. Transactions on Machine Learning Research, 2022
work page 2022
-
[42]
Meinard Müller. Dynamic Time Warping. Information Retrieval for Music and Motion, pages 69–84, 2007
work page 2007
-
[43]
Decision Transformer: Reinforcement Learn- ing via Sequence Modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learn- ing via Sequence Modeling. Advances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[44]
Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018
work page 2018
-
[45]
Ofir Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In Proceedings of the International Conference for Learning Representations (ICLR), 2022
work page 2022
-
[46]
Perceiver: General Perception with Iterative Attention
Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General Perception with Iterative Attention. In Proceedings of the International Conference on Machine Learning (ICML), 2021
work page 2021
-
[47]
A survey on video diffusion models
Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models. ACM Computing Surveys, 2024
work page 2024
-
[48]
A survey on generative AI and LLM for video generation, understanding, and streaming
Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, and Jussi Kan- gasharju. A survey on generative AI and LLM for video generation, understanding, and streaming. arXiv:2404.16038, 2024. 24 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots
-
[49]
List of Contributors Please address correspondence to deepfleet@amazon.com. Scaling experiments and writing lead Ameya Agaskar Sriram Siva DeepFleet model design and development William Pickering (Robot-Centric model) Kyle O’Brien (Robot-Floor Cross Attention model) Charles Kekeh (Image-Based Floor-Centric model) Sriram Siva (Graph-Based Floor-Centric mod...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.