A Survey on Vision-Language-Action Models for Embodied AI
Pith reviewed 2026-05-24 01:23 UTC · model grok-4.3
The pith
The first survey on vision-language-action models organizes them into three research lines for embodied AI.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper presents the first survey on VLAs for embodied AI and supplies a taxonomy that divides the field into three major lines: research on individual components of VLAs, development of VLA-based control policies that predict low-level actions, and high-level task planners that break long-horizon tasks into subtasks to follow general user instructions. It further summarizes relevant datasets, simulators, and benchmarks and discusses challenges and future directions.
What carries the argument
A three-line taxonomy of VLAs that separates work on individual components, low-level control policies, and high-level task planners.
If this is right
- The taxonomy lets new VLA papers be placed relative to existing work.
- The listed datasets and simulators give concrete starting points for training and testing VLAs.
- The identified challenges indicate concrete problems that next VLA designs should address.
- High-level planners can guide low-level policies on longer tasks, suggesting a path to more general instruction following.
Where Pith is reading between the lines
- The taxonomy could be used to design evaluation suites that separately test each line.
- Hybrid models that combine elements from more than one line may become a natural next step once the categories are established.
- Making the survey's repository the standard reference list would reduce duplication in future VLA papers.
- The three-line split may influence how funding and conference tracks organize embodied-AI research.
Load-bearing premise
Existing VLA literature can be partitioned into these three lines without major omissions or overlaps that would require a different structure.
What would settle it
Publication of a substantial VLA paper whose method falls outside all three lines or requires splitting or merging the categories to accommodate it.
Figures
read the original abstract
Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models -- referred to as vision-language-action (VLA) models -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents the first survey on vision-language-action (VLA) models for embodied AI. It proposes a taxonomy dividing the literature into three major lines: individual components of VLAs, VLA-based low-level control policies, and high-level task planners for long-horizon tasks. The paper also compiles resources such as datasets, simulators, and benchmarks, and discusses challenges and future directions.
Significance. If the taxonomy provides a clean partition of the VLA literature, the survey would be a significant organizational contribution to the field by structuring the rapidly growing body of work and highlighting key resources. The inclusion of a curated repository adds to its utility for researchers.
major comments (1)
- [Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.
minor comments (2)
- [Abstract and Introduction] The claim to present the 'first survey' would benefit from a short explicit comparison to prior related surveys on VLMs or embodied AI in the introduction to substantiate novelty.
- [Resources section] In the resources summary, include explicit inclusion criteria and note any deliberate omissions for datasets, simulators, and benchmarks to improve transparency.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our survey as the first on VLAs and for the constructive feedback on the taxonomy. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.
Authors: We acknowledge the validity of this observation. While the taxonomy is structured around the primary research focus of each work (component-level innovations, low-level action generation, or high-level task decomposition), it is true that some models built on shared VLM backbones can be applied or extended across horizons, creating potential boundary cases. To address this, we will add an explicit discussion in the taxonomy section (and a brief note in the abstract) clarifying the classification criteria, noting that assignment is based on the main contribution rather than all possible uses, and providing examples of works that span lines. This revision will improve transparency without requiring a restructuring of the three lines, which we maintain remain useful for organizing the literature by research objective. revision: partial
Circularity Check
No circularity: survey taxonomy is an author-proposed organizational structure with no derivations, equations, or self-referential reductions.
full rationale
This is a literature survey paper whose central contribution is a proposed three-line taxonomy of existing VLA work. The taxonomy is presented as an organizing framework rather than derived from any equations, fitted parameters, or first-principles results. No load-bearing steps reduce by construction to the paper's own inputs; all cited works are external. The claim of being the 'first survey' is a factual assertion about coverage, not a mathematical derivation. Self-citations, if present, are not used to justify uniqueness theorems or force the taxonomy. The structure is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 58 Pith papers
-
FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
-
4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving
4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segm...
-
RotVLA: Rotational Latent Action for Vision-Language-Action Model
RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
-
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
-
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
-
[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
-
Deformation-based In-Context Learning for Point Cloud Understanding
DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
-
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
-
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
-
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
-
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
-
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
-
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
-
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
-
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
-
DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation
DexSim2Real integrates FM-guided domain randomization, cross-attention visuo-tactile RL policies, and LLM-based progressive curricula to reach 78.2% average real-world success on six dexterous tasks with an 8.3% sim-t...
-
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
-
Learning-augmented robotic automation for real-world manufacturing
A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of ...
-
A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking
A VLA model with Cross-Depth Fusion tracking head and TraCon register unifies needle tracking and adaptive insertion control, outperforming prior trackers and manual operation in experiments.
-
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
-
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
-
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fi...
-
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
-
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
-
Emergent Neural Automaton Policies: Learning Symbolic Structure from Visuomotor Trajectories
ENAP extracts an emergent Mealy automaton from visuomotor trajectories to act as a high-level planner for a low-level residual policy, yielding up to 27% higher success than end-to-end VLA policies in low-data regimes.
-
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
-
VLANeXt: Recipes for Building Strong VLA Models
VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
-
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x spe...
-
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
-
LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
LIBERO-PRO shows VLA models collapse from over 90% to 0% accuracy under perturbations in objects, states, instructions, and environments, exposing memorization instead of genuine comprehension.
-
Block-wise Adaptive Caching for Accelerating Diffusion Policy
BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.
-
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
-
Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers
A hybrid event-driven switching system pairs VLA models with lightweight dexterous policies on a compliant anthropomorphic hand to perform language-conditioned multi-finger tasks with cross-embodiment modularity.
-
Anytime Training with Schedule-Free Spectral Optimization
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
-
PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots
PRIME is a MAP optimization framework that refines onboard kinematics into dynamically consistent trajectories for legged robots while jointly estimating contact forces and inertial parameters using differentiable smo...
-
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.
-
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
-
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
-
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
-
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
-
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
-
SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.
-
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
-
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
NORA is a compact 3B-parameter VLA model trained on 970k robot demonstrations that outperforms larger VLA models in embodied tasks while using significantly less computational resources.
-
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
-
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.
-
Position: Embodied AI Requires a Privacy-Utility Trade-off
Embodied AI requires treating privacy as a lifecycle architectural constraint rather than a stage-local feature, addressed via the proposed SPINE framework with a multi-criterion privacy classification matrix.
-
Large Language Models for Multi-Robot Systems: A Survey
A survey that categorizes LLM uses in multi-robot systems across task allocation, motion planning, action generation, and human interaction, while noting challenges and future research opportunities.
-
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
-
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.
Reference graph
Works this paper leans on
-
[1]
RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control
A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Imagenet classification with deep convolutional neural networks,
A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114
work page 2012
-
[3]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008
work page 2017
-
[4]
Human-level control through deep reinforcement learning,
V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nat., vol. 518, no. 7540, pp. 529–533, 2015
work page 2015
-
[5]
Learning hand- eye coordination for robotic grasping with large-scale data collection,
S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand- eye coordination for robotic grasping with large-scale data collection,” inISER, ser. Springer Proceedings in Advanced Robotics, vol. 1. Springer, 2016, pp. 173–184
work page 2016
-
[6]
Flamingo: a visual language model for few-shot learning,
J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mod...
work page 2022
-
[7]
J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202. PMLR, 2023, pp. 19 730– 19 742
work page 2023
-
[8]
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”CoRR, vol. abs/2304.08485, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Inner mono- logue: Embodied reasoning through planning with language models,
W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner mono- logue: Embodied reasoning through planning with language models,” inCoRL, vol. 205. PMLR, 2022, pp. 1769–1782
work page 2022
-
[10]
Do as I can, not as I say: Grounding language in robotic affordances,
B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K....
work page 2022
-
[11]
Palm-e: An embodied multimodal language model,
D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” inICML, vol. 202. PMLR, 2023, pp. 8469–8488
work page 2023
-
[12]
Foundation models in robotics: Applications, challenges, and the future,
R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,”CoRR, vol. abs/2312.07843, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 18
-
[13]
Large language models for robotics: Op- portunities, challenges, and perspectives,
J. Wang, Z. Wu, Y . Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, H. Zhao, Z. Liu, H. Dai, L. Zhao, B. Ge, X. Li, T. Liu, and S. Zhang, “Large language models for robotics: Op- portunities, challenges, and perspectives,”CoRR, vol. abs/2401.04334, 2024
-
[14]
Toward general-purpose robots via foundation models: A survey and meta-analysis,
Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. V . Keetha, S. Kim, Y . Xie, T. Zhang, S. Zhao, Y . Q. Chong, C. Wang, K. P. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. A. Scherer, Z. Kira, F. Xia, and Y . Bisk, “Toward general-purpose robots via foundation models: A survey and meta-analysis,”CoRR, vol. abs/2312.08782, 2023
-
[15]
Real-world robot applications of foundation models: a review,
K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: a review,”Adv. Robotics, vol. 38, no. 18, pp. 1232–1254, 2024
work page 2024
-
[16]
Decision transformer: Re- inforcement learning via sequence modeling,
L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re- inforcement learning via sequence modeling,” inNeurIPS, 2021, pp. 15 084–15 097
work page 2021
-
[17]
Offline reinforcement learning as one big sequence modeling problem,
M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” inNeurIPS, 2021, pp. 1273–1286
work page 2021
-
[18]
S. E. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,”Trans. Mach. Learn. Res., vol. 2022, 2022
work page 2022
-
[19]
$\pi^{*}_{0.6}$: a VLA That Learns From Experience
A. A. Physical Intelligence and, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. God- den, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Primitive skill-based robot learning from human evaluative feedback,
A. Hiranaka, M. Hwang, S. Lee, C. Wang, L. Fei-Fei, J. Wu, and R. Zhang, “Primitive skill-based robot learning from human evaluative feedback,” inIROS, 2023, pp. 7817–7824
work page 2023
-
[21]
Reflexion: language agents with verbal reinforcement learning,
N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in NeurIPS, 2023
work page 2023
-
[22]
Eureka: Human-level reward design via coding large language models,
Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inICLR, 2024
work page 2024
-
[23]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763
work page 2021
-
[24]
R3M: A universal visual representation for robot manipulation,
S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” inCoRL, vol
-
[25]
PMLR, 2022, pp. 892–909
work page 2022
-
[26]
VIP: towards universal visual reward and representation via value-implicit pre-training,
Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “VIP: towards universal visual reward and representation via value-implicit pre-training,” inICLR, 2023
work page 2023
-
[27]
Real-world robot learning with masked visual pre-training,
I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” inCoRL, vol. 205. PMLR, 2022, pp. 416–426
work page 2022
-
[28]
BERT: pre-training of deep bidirectional transformers for language understanding,
J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1). Association for Computational Linguistics, 2019, pp. 4171–4186
work page 2019
-
[29]
Robot learning with sensorimotor pre-training,
I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre-training,” inCoRL, vol. 229. PMLR, 2023, pp. 683–693
work page 2023
-
[30]
Cliport: What and where pathways for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, vol. 164. PMLR, 2021, pp. 894–906
work page 2021
-
[31]
Simple but effective: CLIP embeddings for embodied AI,
A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: CLIP embeddings for embodied AI,” inCVPR. IEEE, 2022, pp. 14 809–14 818
work page 2022
-
[32]
Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,
S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inCVPR. IEEE, 2023, pp. 23 171–23 181
work page 2023
-
[33]
Where are we in the search for an artificial visual cortex for embodied intelligence?
A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V . Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an artificial visual cortex for embodied intelligence?” inNeurIPS, 2023
work page 2023
-
[34]
Language-driven representation learning for robotics,
S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” inRSS, 2023
work page 2023
-
[35]
Dinov2: Learning robust visual features without supervision,
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without sup...
work page 2024
-
[36]
OpenVLA: An Open-Source Vision-Language-Action Model
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”CoRR, vol. abs/2406.09246, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation
W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”CoRR, vol. abs/2409.01652, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Self-supervised learning from images with a joint-embedding predictive architecture,
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. G. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inCVPR. IEEE, 2023, pp. 15 619–15 629
work page 2023
-
[39]
Theia: Distilling diverse vision foundation models for robot learning,
J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Herlant, “Theia: Distilling diverse vision foundation models for robot learning,” inCoRL, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2024, pp. 724–748
work page 2024
-
[40]
The unsurprising effectiveness of pre-trained vision models for control,
S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre-trained vision models for control,” inICML, vol. 162. PMLR, 2022, pp. 17 359–17 371
work page 2022
-
[41]
A path towards autonomous machine intelligence,
Y . LeCun, “A path towards autonomous machine intelligence,” 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf
work page 2022
-
[42]
Dis- tilled feature fields enable few-shot language-guided manipulation,
W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Dis- tilled feature fields enable few-shot language-guided manipulation,” in CoRL, vol. 229. PMLR, 2023, pp. 405–424
work page 2023
-
[43]
3d- llm: Injecting the 3d world into large language models,
Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,” inNeurIPS, 2023
work page 2023
-
[44]
3d gaussian splatting for real-time radiance field rendering,
B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023
work page 2023
-
[45]
Langsplat: 3d language gaussian splatting,
M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inCVPR. IEEE, 2024, pp. 20 051– 20 060
work page 2024
-
[46]
That sounds right: Auditory self- supervision for dynamic robot manipulation,
A. Thankaraj and L. Pinto, “That sounds right: Auditory self- supervision for dynamic robot manipulation,” inCoRL, vol. 229. PMLR, 2023, pp. 1036–1049
work page 2023
-
[47]
Exploring visual pre-training for robot manipulation: Datasets, models and methods,
Y . Jing, X. Zhu, X. Liu, Q. Sima, T. Yang, Y . Feng, and T. Kong, “Exploring visual pre-training for robot manipulation: Datasets, models and methods,” inIROS, 2023, pp. 11 390–11 395
work page 2023
-
[48]
Masked autoencoding for scalable and generalizable decision making,
F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,” inNeurIPS, 2022
work page 2022
-
[49]
Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,
J. Li, Q. Gao, M. Johnston, X. Gao, X. He, H. Shi, S. Shakiah, R. Ghanadan, and W. Y . Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” in ICML, 2024
work page 2024
-
[50]
SMART: self-supervised multi-task pretraining with control trans- formers,
Y . Sun, S. Ma, R. Madaan, R. Bonatti, F. Huang, and A. Kapoor, “SMART: self-supervised multi-task pretraining with control trans- formers,” inICLR, 2023
work page 2023
-
[51]
PACT: perception-action causal transformer for autore- gressive robotics pre-training,
R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, “PACT: perception-action causal transformer for autore- gressive robotics pre-training,” inIROS, 2023, pp. 3621–3627
work page 2023
-
[52]
Video pretraining (VPT): learning to act by watching unlabeled online videos,
B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (VPT): learning to act by watching unlabeled online videos,” inNeurIPS, 2022
work page 2022
-
[53]
Unleashing large-scale video generative pre-training for visual robot manipulation,
H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inICLR, 2024
work page 2024
-
[54]
Dream to control: Learning behaviors by latent imagination,
D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inICLR, 2020
work page 2020
-
[55]
Mastering atari with discrete world models,
D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” inICLR, 2021
work page 2021
-
[56]
Mastering Diverse Domains through World Models
D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap, “Mastering diverse domains through world models,”CoRR, vol. abs/2301.04104, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 19
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Day- dreamer: World models for physical robot learning,
P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- dreamer: World models for physical robot learning,” inCoRL, vol. 205. PMLR, 2022, pp. 2226–2240
work page 2022
-
[58]
Transformers are sample- efficient world models,
V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,” inICLR, 2023
work page 2023
-
[59]
Transformer- based world models are happy with 100k interactions,
J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,” inICLR, 2023
work page 2023
-
[60]
K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inICML, vol. 202. PMLR, 2023, pp. 26 311–26 325
work page 2023
-
[61]
Z. Song, Y . Zhang, and I. King, “No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,” inNeurIPS, 2023
work page 2023
-
[62]
Graph component contrastive learning for concept relatedness estimation,
Y . Ma, Z. Song, X. Hu, J. Li, Y . Zhang, and I. King, “Graph component contrastive learning for concept relatedness estimation,” in AAAI. AAAI Press, 2023, pp. 13 362–13 370
work page 2023
-
[63]
L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,” inNeurIPS, 2023
work page 2023
-
[64]
LLM+P: Empowering Large Language Models with Optimal Planning Proficiency
B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: empowering large language models with optimal planning proficiency,”CoRR, vol. abs/2304.11477, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Reasoning with language model is planning with world model,
S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in EMNLP, 2023, pp. 8154–8173
work page 2023
-
[66]
Tree-planner: Efficient close-loop task planning with large language models,
M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo, “Tree-planner: Efficient close-loop task planning with large language models,” inICLR, 2024
work page 2024
-
[67]
Large language models as common- sense knowledge for large-scale task planning,
Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as common- sense knowledge for large-scale task planning,” inNeurIPS, 2023
work page 2023
-
[68]
(2024) Video generation models as world simulators
OpenAI. (2024) Video generation models as world simulators. [Online]. Available: https://openai.com/index/ video-generation-models-as-world-simulators/
work page 2024
-
[69]
Is sora a world simulator? A comprehensive survey on general world models and beyond,
Z. Zhu, X. Wang, W. Zhao, C. Min, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is sora a world simulator? A comprehensive survey on general world models and beyond,”CoRR, vol. abs/2405.03520, 2024
-
[70]
Genie: Generative interactive environments,
J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. M. P. Behbahani, S. C. Y . Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. E. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt ¨aschel, “Genie: Generative interactive environmen...
work page 2024
-
[71]
3d-vla: A 3d vision-language-action generative world model,
H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” inICML, 2024
work page 2024
-
[72]
Learning interactive real-world simu- lators,
S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel, “Learning interactive real-world simu- lators,” inICLR, 2024
work page 2024
-
[73]
Language models meet world models: Embodied experiences enhance language models,
J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,” inNeurIPS, 2023
work page 2023
-
[74]
Large language models are zero-shot reasoners,
T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inNeurIPS, 2022
work page 2022
-
[75]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022
work page 2022
-
[76]
Thinkbot: Embodied instruction following with thought chain reasoning
G. Lu, Z. Wang, C. Liu, J. Lu, and Y . Tang, “Thinkbot: Embod- ied instruction following with thought chain reasoning,”CoRR, vol. abs/2312.07062, 2023
-
[77]
React: Synergizing reasoning and acting in language models,
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023
work page 2023
-
[78]
RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,
Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,”CoRR, vol. abs/2403.05313, 2024
-
[79]
Robotic Control via Embodied Chain-of-Thought Reasoning
M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”CoRR, vol. abs/2407.08693, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[80]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,
Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T. Lin, G. Wetzstein, M. Liu, and D. Xiang, “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inCVPR, 2025, pp. 1702–1713
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.