Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions

Aggelos Psiris; Arash Ajoudani adn Georgios Th. Papadopoulos; Efstratios Gavves; Evangelos K. Markakis; Kostas Bekris; Panagiotis Sarigiannidis; Vasileios Argyriou

arxiv: 2604.15395 · v1 · submitted 2026-04-16 · 💻 cs.RO

Foundation Models in Robotics: A Comprehensive Review of Methods, Models, Datasets, Challenges and Future Research Directions

Aggelos Psiris , Vasileios Argyriou , Evangelos K. Markakis , Panagiotis Sarigiannidis , Efstratios Gavves , Kostas Bekris , Arash Ajoudani adn Georgios Th. Papadopoulos This is my paper

Pith reviewed 2026-05-10 11:07 UTC · model grok-4.3

classification 💻 cs.RO

keywords foundation modelsroboticslarge language modelsvision-language modelsrobotic tasksdatasetschallengesparadigm shift

0 comments

The pith

Foundation models are shifting robotics from single-task systems to adaptable agents for open-world environments.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews how foundation models, large neural networks trained on vast heterogeneous data, are enabling robotics to move from fixed, specialized solutions to flexible, multi-function agents that understand multiple inputs, plan over long horizons, and generalize across different robot bodies. It traces the field's development across five phases starting from early use of language and vision models and maps the current work through detailed categories of model types, network designs, learning methods, tasks, and real-world domains. The authors also compile lists of public datasets and lay out open challenges plus suggested research paths. A sympathetic reader cares because this change could produce robots that reliably handle varied activities in homes, factories, or outdoor settings without custom reprogramming for each new job. The review serves as a map to help researchers see what has been done and where gaps remain.

Core claim

The paper establishes that the emergence of foundation models is driving a transformative paradigm shift in robotics from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents capable of operating in complex, open-world, and dynamic environments. This is shown by delineating five research phases, performing a granular taxonomy across model types such as LLMs, VFMs, VLMs, and VLAs, architectures, learning paradigms, stages of knowledge incorporation, robotic tasks, and application domains, plus reporting on datasets and discussing challenges.

What carries the argument

Foundation Models, defined as large-scale neural-network architectures trained on massive, heterogeneous datasets that provide multi-modal understanding, reasoning, long-horizon planning, and cross-embodiment generalization.

If this is right

A single foundation model can support multiple robotic tasks across different physical embodiments and environments.
Cross-modal and multi-sensory capabilities allow robots to operate effectively in unpredictable real-world conditions.
Public datasets for training and evaluation become essential benchmarks for progress on robotic applications.
Practical deployment requires solving challenges in safety, efficiency, and generalization beyond lab settings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phase-based and taxonomic approach could be used to organize foundation-model research in related areas such as autonomous vehicles or manipulation systems.
If the shift holds, fewer task-specific engineering efforts will be needed when adapting robots to new jobs or hardware.
Hybrid combinations of foundation models with classical control techniques may address current gaps in real-time reliability and physical safety.

Load-bearing premise

The literature on foundation models in robotics can be accurately divided into five distinct phases and the chosen taxonomic categories cover all relevant work without major omissions or selection bias.

What would settle it

Discovery of a large body of robotics work using foundation models that does not fit the five-phase timeline or the listed categories of models, architectures, tasks, and domains would show the review's structure is incomplete.

Figures

Figures reproduced from arXiv: 2604.15395 by Aggelos Psiris, Arash Ajoudani adn Georgios Th. Papadopoulos, Efstratios Gavves, Evangelos K. Markakis, Kostas Bekris, Panagiotis Sarigiannidis, Vasileios Argyriou.

**Figure 2.** Figure 2: Main phases in robotic FM research and key/milestone works. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Key criteria and main resulting categories of robotic FM methods. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Representative literature methods per FM type: (a) LLMs (SayCan (Brohan et al., 2023b)), which [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Representative literature methods incorporating different NN architecture types: (a) Transformers [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Representative literature methods per robotic task: (a) Perception: Open-vocabulary [PITH_FULL_IMAGE:figures/full_fig_p045_6.png] view at source ↗

**Figure 7.** Figure 7: Representative literature methods per application domain: (a) Agentic mobility (GNM (Shah et al., [PITH_FULL_IMAGE:figures/full_fig_p046_7.png] view at source ↗

read the original abstract

Over the recent years, the field of robotics has been undergoing a transformative paradigm shift from fixed, single-task, domain-specific solutions towards adaptive, multi-function, general-purpose agents, capable of operating in complex, open-world, and dynamic environments. This tremendous advancement is primarily driven by the emergence of Foundation Models (FMs), i.e., large-scale neural-network architectures trained on massive, heterogeneous datasets that provide unprecedented capabilities in multi-modal understanding and reasoning, long-horizon planning, and cross-embodiment generalization. In this context, the current study provides a holistic, systematic, and in-depth review of the research landscape of FMs in robotics. In particular, the evolution of the field is initially delineated through five distinct research phases, spanning from the early incorporation of Natural Language Processing (NLP) and Computer Vision (CV) models to the current frontier of multi-sensory generalization and real-world deployment. Subsequently, a highly-granular taxonomic investigation of the literature is performed, examining the following key aspects: a) the employed FM types, including LLMs, VFMs, VLMs, and VLAs, b) the underlying neural-network architectures, c) the adopted learning paradigms, d) the different learning stages of knowledge incorporation, e) the major robotic tasks, and f) the main real-world application domains. For each aspect, comparative analysis and critical insights are provided. Moreover, a report on the publicly available datasets used for model training and evaluation across the considered robotic tasks is included. Furthermore, a hierarchical discussion on the current open challenges and promising future research directions in the field is incorporated.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This review organizes foundation models in robotics with a five-phase timeline and granular taxonomy but lacks any documented search protocol, leaving the coverage and paradigm-shift claim open to selection bias.

read the letter

This review paper tries to organize the fast-moving area of foundation models applied to robotics. The central takeaway is that it provides a new five-phase timeline for the field's development and a fine-grained taxonomy, but the lack of a documented search strategy makes it hard to trust that the coverage is complete or unbiased. What the paper does well is break down the literature into manageable parts. It looks at different foundation model types such as large language models, vision foundation models, vision-language models, and vision-language-action models. It then examines the neural architectures behind them, the learning paradigms used, the stages where knowledge is incorporated, the robotic tasks like planning or control, and the application domains. Adding a report on public datasets is practical, and the section on open challenges and future directions points to real issues like cross-embodiment generalization and safety in dynamic environments. The soft spots come in the execution of the review itself. The abstract describes the work as holistic and systematic, with distinct phases spanning from early NLP and CV models to multi-sensory generalization. However, there is no information on how the authors searched for papers, what databases they used, what the inclusion criteria were, or how they avoided bias in selecting which works to include. In a survey this broad, that omission means the phases and categories could reflect the authors' curated selection rather than the full state of the field. This directly affects the strength of the argument that foundation models are driving a shift to general-purpose robotic agents, since that claim depends on the reviewed body of work being representative. For readers, this paper is best suited for newcomers to the embodied AI space who need a map of the literature or for researchers looking for a quick synthesis before diving into specific papers. It won't be the last word on the topic, but it can save time. The thinking behind the taxonomy shows honest engagement with how the field is evolving. I would recommend sending this to peer review. The topic is important and timely, and with a clearer methods section on literature collection, it could become a helpful reference for the community.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that foundation models (FMs) are driving a paradigm shift in robotics from fixed, single-task, domain-specific solutions to adaptive, multi-function, general-purpose agents capable of open-world operation. It delineates the field's evolution into five distinct research phases (from early NLP/CV model incorporation to multi-sensory generalization and real-world deployment), then provides a granular taxonomy covering FM types (LLMs, VFMs, VLMs, VLAs), neural architectures, learning paradigms, knowledge incorporation stages, robotic tasks, and application domains, along with a report on public datasets and a hierarchical discussion of challenges and future directions.

Significance. If the literature mapping is comprehensive and free of selection bias, the review would be significant as a structured synthesis that highlights how FMs enable multi-modal reasoning, long-horizon planning, and cross-embodiment generalization, while cataloging datasets and open problems to guide future robotics research.

major comments (1)

[Abstract and Introduction] Abstract and Introduction: The repeated claim of a 'holistic, systematic' review is not supported by any description of the literature search protocol (databases queried, search terms, date range, inclusion/exclusion criteria, or number of papers screened). This directly affects the defensibility of the five-phase taxonomy and the FM-type/architecture/task/domain categories, which are load-bearing for the central narrative of a paradigm shift; without this information, it is impossible to rule out post-hoc curation or omission of relevant counter-examples.

minor comments (2)

[Datasets section] Datasets section: When reporting publicly available datasets, include explicit details on scale (number of trajectories or samples), sensor modalities, and licensing/access links to improve utility for readers seeking to reproduce or extend the surveyed work.
[Future directions] Future directions: The hierarchical challenges discussion would benefit from clearer linkage back to specific gaps identified in the taxonomy (e.g., which FM types lack coverage in certain domains) rather than remaining at a high level.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The comment raises a valid point regarding transparency in our review methodology, which we address directly below. We believe incorporating this information will strengthen the paper without altering its core contributions.

read point-by-point responses

Referee: The repeated claim of a 'holistic, systematic' review is not supported by any description of the literature search protocol (databases queried, search terms, date range, inclusion/exclusion criteria, or number of papers screened). This directly affects the defensibility of the five-phase taxonomy and the FM-type/architecture/task/domain categories, which are load-bearing for the central narrative of a paradigm shift; without this information, it is impossible to rule out post-hoc curation or omission of relevant counter-examples.

Authors: We agree that explicitly documenting the literature search protocol would improve the defensibility of the taxonomy and overall narrative. Although the review was conducted through extensive manual curation of recent literature on arXiv, Google Scholar, and major robotics conferences (ICRA, IROS, CoRL, RSS) using terms such as 'foundation models robotics', 'LLM robotics', 'VLM robotics', and 'vision-language-action models', with a focus on works from 2018 onward, this process was not detailed in the original submission. In the revised manuscript, we will add a new subsection (e.g., 'Literature Search Methodology') in the Introduction that specifies the databases, search terms, date range (2018–2024), inclusion criteria (peer-reviewed papers and high-impact preprints directly addressing FMs in robotics tasks), exclusion criteria (purely theoretical NLP/CV works without robotic application), and approximate numbers of papers screened (~450) and included (~180). This addition will support the five-phase evolution and categorical taxonomy while preserving the manuscript's structure and claims. revision: yes

Circularity Check

0 steps flagged

No circularity: pure literature synthesis with external citations only

full rationale

This is a review paper with no derivations, equations, fitted parameters, or first-principles claims. All content consists of summaries and taxonomies drawn from externally cited prior work. The five-phase delineation and taxonomic categories are presented as organizational tools based on the surveyed literature rather than self-derived results that reduce to the paper's own inputs. No self-citation chains or ansatzes are load-bearing for any central claim, satisfying the criteria for a self-contained external synthesis.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a review paper that organizes existing research rather than introducing new mathematical structures. No free parameters, domain-specific axioms, or invented entities are added by the authors.

pith-pipeline@v0.9.0 · 5641 in / 1240 out tokens · 58101 ms · 2026-05-10T11:07:18.232231+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

arXiv preprint arXiv:2409.08249

doi: 10.48550/arXiv.2409.08249. arXiv preprint arXiv:2409.08249. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Letters, 2025a. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Le...

work page doi:10.48550/arxiv.2409.08249 2022
[2]

Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al

doi: 10.15607/RSS.2025.XXI.028. Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2228–2238, 2023...

work page doi:10.15607/rss.2025.xxi.028 2025

[1] [1]

arXiv preprint arXiv:2409.08249

doi: 10.48550/arXiv.2409.08249. arXiv preprint arXiv:2409.08249. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Letters, 2025a. Tomas Berriel Martins, Martin R Oswald, and Javier Civera. Open-vocabulary online semantic mapping for slam.IEEE Robotics and Automation Le...

work page doi:10.48550/arxiv.2409.08249 2022

[2] [2]

Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al

doi: 10.15607/RSS.2025.XXI.028. Liang Xu, Ziyang Song, Dongliang Wang, Jing Su, Zhicheng Fang, Chenjing Ding, Weihao Gan, Yichao Yan, Xin Jin, Xiaokang Yang, et al. Actformer: A gan-based transformer towards general action-conditioned 3d human motion generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2228–2238, 2023...

work page doi:10.15607/rss.2025.xxi.028 2025