DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Alexandre Ormiga Galvao Barbosa; Alicia Chua; Ameya Agaskar; Ang Li; Brianna Gallo Sarker; Charles Kekeh; Charun Thattai; Dino Kirouani; Federico Pecora; Isaac Iyengar

arxiv: 2508.08574 · v3 · submitted 2025-08-12 · 💻 cs.RO · cs.MA

DeepFleet: Multi-Agent Foundation Models for Mobile Robots

Ameya Agaskar , Sriram Siva , William Pickering , Kyle O'Brien , Charles Kekeh , Alexandre Ormiga Galvao Barbosa , Ang Li , Brianna Gallo Sarker

show 13 more authors

Alicia Chua Mayur Nemade Charun Thattai Jiaming Di Isaac Iyengar Ramya Dharoor Dino Kirouani Jimmy Erskine Tamir Hegazy Scott Niekum Usman A. Khan Federico Pecora Joseph W. Durham

This is my paper

Pith reviewed 2026-05-19 00:09 UTC · model grok-4.3

classification 💻 cs.RO cs.MA

keywords foundation modelsmulti-agent systemsmobile robotswarehouse automationgraph neural networksdecision transformersinductive biasesfleet coordination

0 comments

The pith

Models that focus on local robot interactions and asynchronous updates perform best for coordinating large mobile robot fleets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents DeepFleet, a set of foundation models trained on movement data from hundreds of thousands of robots in Amazon warehouses. The authors compare four architectures, each with a different way of representing robot interactions and the warehouse environment. They find that designs using neighborhoods of individual robots with autoregressive prediction and graph-based spatial modeling with temporal attention yield the strongest results on forecasting tasks. These models also scale effectively when given more data from larger operations. Readers might care because such models could support more reliable automation in settings where many robots must navigate shared spaces without constant human oversight.

Core claim

The central discovery is that the robot-centric model, an autoregressive decision transformer on individual robot neighborhoods, and the graph-floor model, which uses temporal attention combined with graph neural networks for spatial relationships, both outperform the other two designs on prediction tasks involving robot positions, goals, and interactions, and that these two benefit from scaling up with larger datasets.

What carries the argument

The inductive biases embodied in the four architectures, particularly the use of asynchronous robot state updates and the incorporation of localized structures of robot interactions in the robot-centric and graph-floor models.

Load-bearing premise

Improvements in accuracy on historical prediction tasks will lead to better outcomes in live, real-time robot coordination and planning without needing additional fine-tuning or safety constraints.

What would settle it

A direct test would be to integrate one of the promising models into a warehouse simulator or live system and measure changes in overall fleet efficiency, such as average task completion time or number of near-misses, compared to traditional planning methods.

read the original abstract

We introduce DeepFleet, a suite of foundation models designed to support coordination and planning for large-scale mobile robot fleets. These models are trained on fleet movement data, including robot positions, goals, and interactions, from hundreds of thousands of robots in Amazon warehouses worldwide. DeepFleet consists of four architectures that each embody a distinct inductive bias and collectively explore key points in the design space for multi-agent foundation models: the robot-centric (RC) model is an autoregressive decision transformer operating on neighborhoods of individual robots; the robot-floor (RF) model uses a transformer with cross-attention between robots and the warehouse floor; the image-floor (IF) model applies convolutional encoding to a multi-channel image representation of the full fleet; and the graph-floor (GF) model combines temporal attention with graph neural networks for spatial relationships. In this paper, we describe these models and present our evaluation of the impact of these design choices on prediction task performance. We find that the robot-centric and graph-floor models, which both use asynchronous robot state updates and incorporate the localized structure of robot interactions, show the most promise. We also present experiments that show that these two models can make effective use of larger warehouses operation datasets as the models are scaled up.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeepFleet compares four inductive biases for multi-agent robot models on real Amazon warehouse data and finds the robot-centric and graph versions scale best at prediction, but the work stays at forecasting without testing planning or coordination outcomes.

read the letter

Hi, the main point is that this paper trains four foundation model architectures on movement data from hundreds of thousands of real warehouse robots and reports that the robot-centric and graph-floor versions handle scaling better than the cross-attention or image-based ones on prediction tasks. The structured comparison of those inductive biases on actual fleet positions, goals, and interactions is the clearest new element. Using large-scale operational data instead of simulations gives the results more weight than typical small-scale multi-agent studies, and the scaling experiments for the top two models are a straightforward positive finding. The architectures themselves are described clearly enough to see how each encodes locality and asynchrony differently. The evaluation stays limited to prediction metrics like state forecasting and interaction modeling. No closed-loop rollouts, MPC comparisons, or operational numbers such as throughput or collision rates appear, so the claim that these models show promise for coordination rests on the untested step that better next-state prediction will improve live fleet decisions. That gap is real and worth flagging, though the paper does not overclaim the downstream results. Readers working on multi-robot systems or foundation models for physical agents will get the most from the architecture trade-offs and the real-data scaling curves. It is solid enough empirical work on a practical problem to go to referees rather than get desk-rejected, even if revisions will need to address how prediction gains connect to planning. I would send it out for review.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces DeepFleet, a suite of four multi-agent foundation models (robot-centric RC, robot-floor RF, image-floor IF, and graph-floor GF) for coordination and planning in large-scale mobile robot fleets. The models are trained on fleet movement data including positions, goals, and interactions from hundreds of thousands of robots across Amazon warehouses. Each architecture embodies a distinct inductive bias: RC uses an autoregressive decision transformer on robot neighborhoods with asynchronous updates; RF employs transformer cross-attention between robots and the floor; IF applies convolutional encoding to multi-channel fleet images; and GF combines temporal attention with graph neural networks for spatial relationships. The paper evaluates the impact of these design choices on prediction task performance and concludes that the RC and GF models, which incorporate asynchronous state updates and localized interaction structures, show the most promise and scale effectively with larger warehouse datasets.

Significance. If the superior prediction performance of the RC and GF models translates to improved real-time planning and coordination, the work could meaningfully advance scalable multi-agent foundation models for robotics and warehouse logistics by leveraging large-scale real-world data. The systematic exploration of inductive biases provides useful design insights. However, the current focus on isolated prediction metrics without downstream operational validation limits the strength of claims regarding practical coordination benefits.

major comments (1)

[Evaluation section (as described in abstract and main text)] The central claim that the RC and GF models 'show the most promise' for coordination and planning rests on their outperformance in prediction tasks (state forecasting and interaction modeling) and scaling behavior with larger datasets. The evaluation section reports only these prediction metrics under the four inductive biases and does not include any closed-loop planning experiments, MPC-style rollouts, or operational metrics such as throughput, collision rate, or makespan. This gap means the results do not directly test whether better next-state prediction yields improved real-time fleet decisions, which is load-bearing for the paper's motivation and conclusions.

minor comments (1)

[Abstract] The abstract states that comparative evaluation and scaling results are presented but supplies no quantitative metrics, error bars, dataset sizes, or ablation details, making it harder for readers to immediately gauge the magnitude of the reported improvements.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our manuscript. We address the major comment below, providing an honest assessment of the evaluation scope while defending the paper's contributions on their own terms.

read point-by-point responses

Referee: [Evaluation section (as described in abstract and main text)] The central claim that the RC and GF models 'show the most promise' for coordination and planning rests on their outperformance in prediction tasks (state forecasting and interaction modeling) and scaling behavior with larger datasets. The evaluation section reports only these prediction metrics under the four inductive biases and does not include any closed-loop planning experiments, MPC-style rollouts, or operational metrics such as throughput, collision rate, or makespan. This gap means the results do not directly test whether better next-state prediction yields improved real-time fleet decisions, which is load-bearing for the paper's motivation and conclusions.

Authors: We thank the referee for this observation. The manuscript explicitly frames its contribution as an exploration of inductive biases in multi-agent foundation models, with evaluation focused on prediction tasks (state forecasting and interaction modeling) because these directly assess how well each architecture captures fleet dynamics from hundreds of thousands of real-world robot trajectories. The RC and GF models' superior performance and scaling behavior on these tasks provide evidence that architectures incorporating asynchronous updates and localized interaction structures are the most promising starting points for models intended to support coordination and planning. We do not claim or demonstrate direct improvements in closed-loop planning, MPC rollouts, throughput, collision rates, or makespan, as such operational validation would require coupling the models to specific planners and simulators—an integration that lies beyond the current scope of comparing design choices via prediction metrics. We agree that this represents a limitation for stronger claims about real-time fleet decisions. In revision we will add a dedicated paragraph in the discussion and conclusion sections that explicitly acknowledges this scope limitation, clarifies that prediction performance is presented as a necessary (but not sufficient) indicator of promise for downstream coordination, and outlines future work on closed-loop evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical model evaluation

full rationale

The paper introduces four model architectures embodying distinct inductive biases and evaluates them empirically on prediction tasks using external warehouse fleet movement data from hundreds of thousands of robots. No mathematical derivation chain, first-principles results, or equations are presented that reduce to fitted parameters or self-referential inputs by construction. Claims rest on comparative performance metrics from independent datasets rather than any self-definitional, fitted-input-as-prediction, or self-citation load-bearing steps. The work is self-contained against external benchmarks with no circularity signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the representativeness of Amazon warehouse movement data and the assumption that the chosen architectural inductive biases meaningfully capture multi-agent coordination structure.

axioms (1)

domain assumption Warehouse robot movement data from hundreds of thousands of units captures the relevant interaction patterns needed to train and evaluate coordination models.
All training and scaling experiments depend on this external dataset being representative of the target deployment setting.

pith-pipeline@v0.9.0 · 5835 in / 1356 out tokens · 52524 ms · 2026-05-19T00:09:01.771616+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 7 internal anchors

[1]

Language Models Are Few-Shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, et al. Language Models Are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020

work page 2020
[2]

Park, Wei Han, et al

Yu Zhang, Daniel S. Park, Wei Han, et al. BigSSL: Exploring the Frontier of Large-Scale Semi- Supervised Learning for Automatic Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 2022

work page 2022
[3]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), July 2021

work page 2021
[4]

OpenAI o1 System Card

OpenAI. OpenAI o1 System Card. arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Scaling laws of motion forecasting and planning–a technical report,

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, et al. Scaling Laws of Motion Forecasting and Planning – A Technical Report. arXiv:2506.08228, 2025

work page arXiv 2025
[6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024
[8]

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of Robotics Science and Systems, July 2024

work page 2024
[9]

Social LSTM: Human Trajectory Prediction in Crowded Spaces

Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Fei-fei Li, and Silvio Savarese. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[10]

Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks

Agrim Gupta, Justin Johnson, Fei-fei Li, et al. Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 21 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page 2018
[11]

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso. Transformer Networks for Trajectory Forecasting. In Proceedings of the International Conference on Pattern Recog- nition (ICPR), 2021

work page 2021
[12]

Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data. InProceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020
[13]

From goals, way- points & paths to long term human trajectory forecasting

Karttikeya Mangalam, Yang An, Harshayu Girase, and Jitendra Malik. From goals, way- points & paths to long term human trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVF), 2021

work page 2021
[14]

Collab- orative uncertainty in multi-agent trajectory forecasting

Bohan Tang, Yiqi Zhong, Ulrich Neumann, Gang Wang, Siheng Chen, and Ya Zhang. Collab- orative uncertainty in multi-agent trajectory forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[15]

Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020
[16]

A Joint Prediction Method of Multi-Agent to Reduce Collision Rate

Mingyi Wang, Hongqun Zou, Yifan Liu, You Wang, and Guang Li. A Joint Prediction Method of Multi-Agent to Reduce Collision Rate . arXiv:2411.07612, 2024

work page arXiv 2024
[17]

ICBS: The improved conflict-based search algorithm for multi-agent pathfinding

Eli Boyarski, Ariel Felner, Roni Stern, Guni Sharon, Oded Betzalel, David Tolpin, and Eyal Shimony. ICBS: The improved conflict-based search algorithm for multi-agent pathfinding. In Proceedings of the International Symposium on Combinatorial Search, 2015

work page 2015
[18]

Multi-agent pathfinding: Definitions, variants, and benchmarks

Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, TK Kumar, et al. Multi-agent pathfinding: Definitions, variants, and benchmarks. In Proceedings of the International Symposium on Combinatorial Search, 2019

work page 2019
[19]

Christensen, Stephanie Kemna, and Gaurav Sukhatme

Nicholas Fung, John Rogers, Carlos Nieto, Henrik I. Christensen, Stephanie Kemna, and Gaurav Sukhatme. Coordinating multi-robot systems through environment partitioning for adaptive informative sampling. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2019

work page 2019
[20]

Dynamic multi-robot task allocation under uncertainty and temporal constraints

Shushman Choudhury, Jayesh K Gupta, Mykel J Kochenderfer, Dorsa Sadigh, and Jeannette Bohg. Dynamic multi-robot task allocation under uncertainty and temporal constraints. Autonomous Robots, 2022

work page 2022
[21]

Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding

He Jiang, Yutong Wang, Rishi Veerapaneni, Tanishq Duhan, Guillaume Sartoretti, and Jiaoyang Li. Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding. arXiv:2410.21415, 2024

work page arXiv 2024
[22]

MAPF-GPT: Imitation learning for multi-agent pathfinding at scale

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. MAPF-GPT: Imitation learning for multi-agent pathfinding at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 22 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page 2025
[23]

TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction

Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2020

work page 2020
[24]

Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021

Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene trans- former: A unified multi-task model for behavior prediction and planning. arXiv:2106.08417, 2021

work page arXiv 2021
[25]

Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues

Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues. Applied Intelligence, 2022

work page 2022
[26]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Hierarchical state space models for continuous sequence-to-sequence modeling

Raunaq Bhirangi, Chenyu Wang, Venkatesh Pattabiraman, et al. Hierarchical state space models for continuous sequence-to-sequence modeling. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024
[28]

Graph Network-Based Human Movement Prediction for Socially-Aware Robot Navigation in Shared Workspaces

Casper Dik, Christos Emmanouilidis, and Bertrand Duqueroie. Graph Network-Based Human Movement Prediction for Socially-Aware Robot Navigation in Shared Workspaces. Neural Computing and Applications, 2024

work page 2024
[29]

Learning skillful medium-range global weather forecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 2023

work page 2023
[30]

Gencast: Diffusion- based ensemble forecasting for medium-range weather

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv:2312.15796, 2023

work page arXiv 2023
[31]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling Laws for Neural Language Models. arXiv:2001:08361, 2020

work page 2001
[32]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[33]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. PaLM: Scaling Language Model- ing with Pathways. arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[34]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 Herd of Models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-1: Robotics transformer for real- world control at scale. arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[36]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023
[37]

Vima: General robot manipulation with multimodal prompts, 2023

Yunfan Jiang, Agrim Gupta, Zichen Zhang, et al. VIMA: General robot manipulation with multimodal prompts. arXiv:2210.03094, 2022. 23 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page arXiv 2022
[38]

Perceiver-Actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A multi-task transformer for robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023
[39]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[40]

Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025

F I G U R E. Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025

work page 2025
[41]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, et al. A Generalist Agent. Transactions on Machine Learning Research, 2022

work page 2022
[42]

Dynamic Time Warping

Meinard Müller. Dynamic Time Warping. Information Retrieval for Music and Motion, pages 69–84, 2007

work page 2007
[43]

Decision Transformer: Reinforcement Learn- ing via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learn- ing via Sequence Modeling. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[44]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018

work page 2018
[45]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In Proceedings of the International Conference for Learning Representations (ICLR), 2022

work page 2022
[46]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General Perception with Iterative Attention. In Proceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021
[47]

A survey on video diffusion models

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models. ACM Computing Surveys, 2024

work page 2024
[48]

A survey on generative AI and LLM for video generation, understanding, and streaming

Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, and Jussi Kan- gasharju. A survey on generative AI and LLM for video generation, understanding, and streaming. arXiv:2404.16038, 2024. 24 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page arXiv 2024
[49]

List of Contributors Please address correspondence to deepfleet@amazon.com. Scaling experiments and writing lead Ameya Agaskar Sriram Siva DeepFleet model design and development William Pickering (Robot-Centric model) Kyle O’Brien (Robot-Floor Cross Attention model) Charles Kekeh (Image-Based Floor-Centric model) Sriram Siva (Graph-Based Floor-Centric mod...

work page

[1] [1]

Language Models Are Few-Shot Learners

Tom Brown, Benjamin Mann, Nick Ryder, et al. Language Models Are Few-Shot Learners. In Advances in Neural Information Processing Systems, 2020

work page 2020

[2] [2]

Park, Wei Han, et al

Yu Zhang, Daniel S. Park, Wei Han, et al. BigSSL: Exploring the Frontier of Large-Scale Semi- Supervised Learning for Automatic Speech Recognition. IEEE Journal of Selected Topics in Signal Processing, 2022

work page 2022

[3] [3]

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, et al. Learning Transferable Visual Models From Natural Language Supervision. In Proceedings of the International Conference on Machine Learning (ICML), July 2021

work page 2021

[4] [4]

OpenAI o1 System Card

OpenAI. OpenAI o1 System Card. arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Scaling laws of motion forecasting and planning–a technical report,

Mustafa Baniodeh, Kratarth Goel, Scott Ettinger, et al. Scaling Laws of Motion Forecasting and Planning – A Technical Report. arXiv:2506.08228, 2025

work page arXiv 2025

[6] [6]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, et al. π0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[7] [7]

Open X-Embodiment: Robotic Learning Datasets and RT-X Models

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, et al. Open X-Embodiment: Robotic Learning Datasets and RT-X Models. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), 2024

work page 2024

[8] [8]

Octo: An Open-Source Generalist Robot Policy

Dibya Ghosh, Homer Walke, Karl Pertsch, et al. Octo: An Open-Source Generalist Robot Policy. In Proceedings of Robotics Science and Systems, July 2024

work page 2024

[9] [9]

Social LSTM: Human Trajectory Prediction in Crowded Spaces

Alexandre Alahi, Kratarth Goel, Vignesh Ramanathan, Alexandre Robicquet, Fei-fei Li, and Silvio Savarese. Social LSTM: Human Trajectory Prediction in Crowded Spaces. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016

[10] [10]

Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks

Agrim Gupta, Justin Johnson, Fei-fei Li, et al. Social GAN: Socially Acceptable Trajecto- ries with Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018. 21 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page 2018

[11] [11]

Transformer Networks for Trajectory Forecasting

Francesco Giuliari, Irtiza Hasan, Marco Cristani, and Fabio Galasso. Transformer Networks for Trajectory Forecasting. In Proceedings of the International Conference on Pattern Recog- nition (ICPR), 2021

work page 2021

[12] [12]

Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-Feasible Trajectory Forecasting with Heterogeneous Data. InProceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020

[13] [13]

From goals, way- points & paths to long term human trajectory forecasting

Karttikeya Mangalam, Yang An, Harshayu Girase, and Jitendra Malik. From goals, way- points & paths to long term human trajectory forecasting. In Proceedings of the IEEE/CVF International Conference on Computer Vision (CVF), 2021

work page 2021

[14] [14]

Collab- orative uncertainty in multi-agent trajectory forecasting

Bohan Tang, Yiqi Zhong, Ulrich Neumann, Gang Wang, Siheng Chen, and Ya Zhang. Collab- orative uncertainty in multi-agent trajectory forecasting. In Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[15] [15]

Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data

Tim Salzmann, Boris Ivanovic, Punarjay Chakravarty, and Marco Pavone. Trajectron++: Dynamically-feasible trajectory forecasting with heterogeneous data. In Proceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020

[16] [16]

A Joint Prediction Method of Multi-Agent to Reduce Collision Rate

Mingyi Wang, Hongqun Zou, Yifan Liu, You Wang, and Guang Li. A Joint Prediction Method of Multi-Agent to Reduce Collision Rate . arXiv:2411.07612, 2024

work page arXiv 2024

[17] [17]

ICBS: The improved conflict-based search algorithm for multi-agent pathfinding

Eli Boyarski, Ariel Felner, Roni Stern, Guni Sharon, Oded Betzalel, David Tolpin, and Eyal Shimony. ICBS: The improved conflict-based search algorithm for multi-agent pathfinding. In Proceedings of the International Symposium on Combinatorial Search, 2015

work page 2015

[18] [18]

Multi-agent pathfinding: Definitions, variants, and benchmarks

Roni Stern, Nathan Sturtevant, Ariel Felner, Sven Koenig, Hang Ma, Thayne Walker, Jiaoyang Li, Dor Atzmon, Liron Cohen, TK Kumar, et al. Multi-agent pathfinding: Definitions, variants, and benchmarks. In Proceedings of the International Symposium on Combinatorial Search, 2019

work page 2019

[19] [19]

Christensen, Stephanie Kemna, and Gaurav Sukhatme

Nicholas Fung, John Rogers, Carlos Nieto, Henrik I. Christensen, Stephanie Kemna, and Gaurav Sukhatme. Coordinating multi-robot systems through environment partitioning for adaptive informative sampling. In Proceedings of the International Conference on Robotics and Automation (ICRA), 2019

work page 2019

[20] [20]

Dynamic multi-robot task allocation under uncertainty and temporal constraints

Shushman Choudhury, Jayesh K Gupta, Mykel J Kochenderfer, Dorsa Sadigh, and Jeannette Bohg. Dynamic multi-robot task allocation under uncertainty and temporal constraints. Autonomous Robots, 2022

work page 2022

[21] [21]

Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding

He Jiang, Yutong Wang, Rishi Veerapaneni, Tanishq Duhan, Guillaume Sartoretti, and Jiaoyang Li. Deploying Ten Thousand Robots: Scalable Imitation Learning for Lifelong Multi-Agent Path Finding. arXiv:2410.21415, 2024

work page arXiv 2024

[22] [22]

MAPF-GPT: Imitation learning for multi-agent pathfinding at scale

Anton Andreychuk, Konstantin Yakovlev, Aleksandr Panov, and Alexey Skrynnik. MAPF-GPT: Imitation learning for multi-agent pathfinding at scale. InProceedings of the AAAI Conference on Artificial Intelligence, 2025. 22 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page 2025

[23] [23]

TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction

Xiaoli Liu, Jianqin Yin, Jin Liu, Pengxiang Ding, Jun Liu, and Huaping Liu. TrajectoryCNN: A New Spatio-Temporal Feature Learning Network for Human Motion Prediction. IEEE Transactions on Circuits and Systems for Video Technology, 2020

work page 2020

[24] [24]

Scene transformer: A unified architecture for predicting multiple agent trajectories.arXiv preprint arXiv:2106.08417, 2021

Jiquan Ngiam, Benjamin Caine, Vijay Vasudevan, Zhengdong Zhang, Hao-Tien Lewis Chiang, Jeffrey Ling, Rebecca Roelofs, Alex Bewley, Chenxi Liu, Ashish Venugopal, et al. Scene trans- former: A unified multi-task model for behavior prediction and planning. arXiv:2106.08417, 2021

work page arXiv 2021

[25] [25]

Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues

Khac-Hoai Nam Bui, Jiho Cho, and Hongsuk Yi. Spatial-temporal graph neural network for traffic forecasting: An overview and open research issues. Applied Intelligence, 2022

work page 2022

[26] [26]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Hierarchical state space models for continuous sequence-to-sequence modeling

Raunaq Bhirangi, Chenyu Wang, Venkatesh Pattabiraman, et al. Hierarchical state space models for continuous sequence-to-sequence modeling. In Proceedings of the International Conference on Machine Learning (ICML), 2024

work page 2024

[28] [28]

Graph Network-Based Human Movement Prediction for Socially-Aware Robot Navigation in Shared Workspaces

Casper Dik, Christos Emmanouilidis, and Bertrand Duqueroie. Graph Network-Based Human Movement Prediction for Socially-Aware Robot Navigation in Shared Workspaces. Neural Computing and Applications, 2024

work page 2024

[29] [29]

Learning skillful medium-range global weather forecasting

Remi Lam, Alvaro Sanchez-Gonzalez, Matthew Willson, Peter Wirnsberger, Meire Fortunato, Ferran Alet, Suman Ravuri, Timo Ewalds, Zach Eaton-Rosen, Weihua Hu, et al. Learning skillful medium-range global weather forecasting. Science, 2023

work page 2023

[30] [30]

Gencast: Diffusion- based ensemble forecasting for medium-range weather

Ilan Price, Alvaro Sanchez-Gonzalez, Ferran Alet, et al. GenCast: Diffusion-based ensemble forecasting for medium-range weather. arXiv:2312.15796, 2023

work page arXiv 2023

[31] [31]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, et al. Scaling Laws for Neural Language Models. arXiv:2001:08361, 2020

work page 2001

[32] [32]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, et al. Training Compute-Optimal Large Language Models. arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[33] [33]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al. PaLM: Scaling Language Model- ing with Pathways. arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[34] [34]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The Llama 3 Herd of Models. arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, et al. RT-1: Robotics transformer for real- world control at scale. arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[36] [36]

RT-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, et al. RT-2: Vision-language-action models transfer web knowledge to robotic control. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023

[37] [37]

Vima: General robot manipulation with multimodal prompts, 2023

Yunfan Jiang, Agrim Gupta, Zichen Zhang, et al. VIMA: General robot manipulation with multimodal prompts. arXiv:2210.03094, 2022. 23 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page arXiv 2022

[38] [38]

Perceiver-Actor: A multi-task transformer for robotic manipulation

Mohit Shridhar, Lucas Manuelli, and Dieter Fox. Perceiver-Actor: A multi-task transformer for robotic manipulation. In Proceedings of the Conference on Robot Learning (CoRL), 2023

work page 2023

[39] [39]

Flamingo: a Visual Language Model for Few-Shot Learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, et al. Flamingo: a Visual Language Model for Few-Shot Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[40] [40]

Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025

F I G U R E. Helix: A Vision-Language-Action Model for Generalist Humanoid Control, Febru- ary 2025

work page 2025

[41] [41]

A Generalist Agent

Scott Reed, Konrad Zolna, Emilio Parisotto, et al. A Generalist Agent. Transactions on Machine Learning Research, 2022

work page 2022

[42] [42]

Dynamic Time Warping

Meinard Müller. Dynamic Time Warping. Information Retrieval for Music and Motion, pages 69–84, 2007

work page 2007

[43] [43]

Decision Transformer: Reinforcement Learn- ing via Sequence Modeling

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision Transformer: Reinforcement Learn- ing via Sequence Modeling. Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[44] [44]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self-Attention with Relative Position Representations. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), 2018

work page 2018

[45] [45]

Smith, and Mike Lewis

Ofir Press, Noah A. Smith, and Mike Lewis. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation. In Proceedings of the International Conference for Learning Representations (ICLR), 2022

work page 2022

[46] [46]

Perceiver: General Perception with Iterative Attention

Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, and Joao Carreira. Perceiver: General Perception with Iterative Attention. In Proceedings of the International Conference on Machine Learning (ICML), 2021

work page 2021

[47] [47]

A survey on video diffusion models

Zhen Xing, Qijun Feng, Haoran Chen, Qi Dai, Han Hu, Hang Xu, Zuxuan Wu, and Yu-Gang Jiang. A survey on video diffusion models. ACM Computing Surveys, 2024

work page 2024

[48] [48]

A survey on generative AI and LLM for video generation, understanding, and streaming

Pengyuan Zhou, Lin Wang, Zhi Liu, Yanbin Hao, Pan Hui, Sasu Tarkoma, and Jussi Kan- gasharju. A survey on generative AI and LLM for video generation, understanding, and streaming. arXiv:2404.16038, 2024. 24 DEEP FLEET : Multi-Agent Foundation Models for Mobile Robots

work page arXiv 2024

[49] [49]

List of Contributors Please address correspondence to deepfleet@amazon.com. Scaling experiments and writing lead Ameya Agaskar Sriram Siva DeepFleet model design and development William Pickering (Robot-Centric model) Kyle O’Brien (Robot-Floor Cross Attention model) Charles Kekeh (Image-Based Floor-Centric model) Sriram Siva (Graph-Based Floor-Centric mod...

work page