Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions
Pith reviewed 2026-05-10 11:07 UTC · model grok-4.3
The pith
Foundation models are shifting robotics from single-task systems to adaptable agents for open-world environments.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that the emergence of foundation models is driving a transformative paradigm shift in robotics from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents capable of operating in complex, open-world, and dynamic environments. This is shown by delineating five research phases, performing a granular taxonomy across model types such as LLMs, VFMs, VLMs, and VLAs, architectures, learning paradigms, stages of knowledge incorporation, robotic tasks, and application domains, plus reporting on datasets and discussing challenges.
What carries the argument
Foundation Models, defined as large-scale neural-network architectures trained on massive, heterogeneous datasets that provide multi-modal understanding, reasoning, long-horizon planning, and cross-embodiment generalization.
If this is right
- A single foundation model can support multiple robotic tasks across different physical embodiments and environments.
- Cross-modal and multi-sensory capabilities allow robots to operate effectively in unpredictable real-world conditions.
- Public datasets for training and evaluation become essential benchmarks for progress on robotic applications.
- Practical deployment requires solving challenges in safety, efficiency, and generalization beyond lab settings.
Where Pith is reading between the lines
- The same phase-based and taxonomic approach could be used to organize foundation-model research in related areas such as autonomous vehicles or manipulation systems.
- If the shift holds, fewer task-specific engineering efforts will be needed when adapting robots to new jobs or hardware.
- Hybrid combinations of foundation models with classical control techniques may address current gaps in real-time reliability and physical safety.
Load-bearing premise
The literature on foundation models in robotics can be accurately divided into five distinct phases and the chosen taxonomic categories cover all relevant work without major omissions or selection bias.
What would settle it
Discovery of a large body of robotics work using foundation models that does not fit the five-phase timeline or the listed categories of models, architectures, tasks, and domains would show the review's structure is incomplete.
Figures
read the original abstract
Over the recent years, the field of robotics has been undergoing a transformative paradigm shift from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents, capable of operating in complex, open-world, and dynamic environments. This tremendous advancement is primarily driven by the emergence of Foundation Models (FMs), i.e., large-scale neural-network architectures trained on massive, heterogeneous datasets that provide unprecedented capabilities in multi-modal understanding and reasoning, long-horizon planning, and cross-embodiment generalization. In this context, the current study provides a holistic, systematic, and in-depth review of the research landscape of FMs in robotics. In particular, the evolution of the field is initially delineated through five distinct research phases, spanning from the early incorporation of Natural Language Processing (NLP) and Computer Vision (CV) models to the current frontier of multi-sensory generalization and real-world deployment. Subsequently, a highly-granular taxonomic investigation of the literature is performed, examining the following key aspects: a) the employed FM types, including LLMs, VFMs, VLMs, and VLAs, b) the underlying neural-network architectures, c) the adopted learning paradigms, d) the different learning stages of knowledge incorporation, e) the major robotic tasks, and f) the main real-world application domains. For each aspect, comparative analysis and critical insights are provided. Moreover, a report on the publicly available datasets used for model training and evaluation across the considered robotic tasks is included. Furthermore, a hierarchical discussion on the current open challenges and promising future research directions in the field is incorporated.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that foundation models (FMs) are driving a paradigm shift in robotics from fixed, single-task, domain-specific solutions to adaptive, multi-function, general-purpose agents capable of open-world operation. It delineates the field's evolution into five distinct research phases (from early NLP/CV model incorporation to multi-sensory generalization and real-world deployment), then provides a granular taxonomy covering FM types (LLMs, VFMs, VLMs, VLAs), neural architectures, learning paradigms, knowledge incorporation stages, robotic tasks, and application domains, along with a report on public datasets and a hierarchical discussion of challenges and future directions.
Significance. If the literature mapping is comprehensive and free of selection bias, the review would be significant as a structured synthesis that highlights how FMs enable multi-modal reasoning, long-horizon planning, and cross-embodiment generalization, while cataloging datasets and open problems to guide future robotics research.
major comments (1)
- [Abstract and Introduction] Abstract and Introduction: The repeated claim of a 'holistic, systematic' review is not supported by any description of the literature search protocol (databases queried, search terms, date range, inclusion/exclusion criteria, or number of papers screened). This directly affects the defensibility of the five-phase taxonomy and the FM-type/architecture/task/domain categories, which are load-bearing for the central narrative of a paradigm shift; without this information, it is impossible to rule out post-hoc curation or omission of relevant counter-examples.
minor comments (2)
- [Datasets section] Datasets section: When reporting publicly available datasets, include explicit details on scale (number of trajectories or samples), sensor modalities, and licensing/access links to improve utility for readers seeking to reproduce or extend the surveyed work.
- [Future directions] Future directions: The hierarchical challenges discussion would benefit from clearer linkage back to specific gaps identified in the taxonomy (e.g., which FM types lack coverage in certain domains) rather than remaining at a high level.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. The comment raises a valid point regarding transparency in our review methodology, which we address directly below. We believe incorporating this information will strengthen the paper without altering its core contributions.
read point-by-point responses
-
Referee: The repeated claim of a 'holistic, systematic' review is not supported by any description of the literature search protocol (databases queried, search terms, date range, inclusion/exclusion criteria, or number of papers screened). This directly affects the defensibility of the five-phase taxonomy and the FM-type/architecture/task/domain categories, which are load-bearing for the central narrative of a paradigm shift; without this information, it is impossible to rule out post-hoc curation or omission of relevant counter-examples.
Authors: We agree that explicitly documenting the literature search protocol would improve the defensibility of the taxonomy and overall narrative. Although the review was conducted through extensive manual curation of recent literature on arXiv, Google Scholar, and major robotics conferences (ICRA, IROS, CoRL, RSS) using terms such as 'foundation models robotics', 'LLM robotics', 'VLM robotics', and 'vision-language-action models', with a focus on works from 2018 onward, this process was not detailed in the original submission. In the revised manuscript, we will add a new subsection (e.g., 'Literature Search Methodology') in the Introduction that specifies the databases, search terms, date range (2018–2024), inclusion criteria (peer-reviewed papers and high-impact preprints directly addressing FMs in robotics tasks), exclusion criteria (purely theoretical NLP/CV works without robotic application), and approximate numbers of papers screened (~450) and included (~180). This addition will support the five-phase evolution and categorical taxonomy while preserving the manuscript's structure and claims. revision: yes
Circularity Check
No circularity: pure literature synthesis with external citations only
full rationale
This is a review paper with no derivations, equations, fitted parameters, or first-principles claims. All content consists of summaries and taxonomies drawn from externally cited prior work. The five-phase delineation and taxonomic categories are presented as organizational tools based on the surveyed literature rather than self-derived results that reduce to the paper's own inputs. No self-citation chains or ansatzes are load-bearing for any central claim, satisfying the criteria for a self-contained external synthesis.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2409.08249
doi: 10.48550/arXiv.2409.08249. arXiv preprint arXiv:2409.08249. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Letters, 2025a. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Le...
-
[2]
doi: 10.15607/RSS.2025.XXI.028. Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2228–2238, 2023...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.