pith. sign in

arxiv: 2405.14093 · v8 · submitted 2024-05-23 · 💻 cs.RO · cs.CL· cs.CV

A Survey on Vision-Language-Action Models for Embodied AI

Pith reviewed 2026-05-24 01:23 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV
keywords vision-language-action modelsembodied AIroboticstaxonomysurveydatasetssimulatorsbenchmarks
0
0 comments X

The pith

The first survey on vision-language-action models organizes them into three research lines for embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the new class of vision-language-action models that generate robot actions from language and visual inputs in physical environments. It structures the literature around three lines of work to make the fast-growing area navigable. The survey also compiles datasets, simulators, and benchmarks while noting open challenges. A reader would care because these models aim to turn large language and vision models into agents that can follow instructions in the real world.

Core claim

The paper presents the first survey on VLAs for embodied AI and supplies a taxonomy that divides the field into three major lines: research on individual components of VLAs, development of VLA-based control policies that predict low-level actions, and high-level task planners that break long-horizon tasks into subtasks to follow general user instructions. It further summarizes relevant datasets, simulators, and benchmarks and discusses challenges and future directions.

What carries the argument

A three-line taxonomy of VLAs that separates work on individual components, low-level control policies, and high-level task planners.

If this is right

  • The taxonomy lets new VLA papers be placed relative to existing work.
  • The listed datasets and simulators give concrete starting points for training and testing VLAs.
  • The identified challenges indicate concrete problems that next VLA designs should address.
  • High-level planners can guide low-level policies on longer tasks, suggesting a path to more general instruction following.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could be used to design evaluation suites that separately test each line.
  • Hybrid models that combine elements from more than one line may become a natural next step once the categories are established.
  • Making the survey's repository the standard reference list would reduce duplication in future VLA papers.
  • The three-line split may influence how funding and conference tracks organize embodied-AI research.

Load-bearing premise

Existing VLA literature can be partitioned into these three lines without major omissions or overlaps that would require a different structure.

What would settle it

Publication of a substantial VLA paper whose method falls outside all three lines or requires splitting or merging the categories to accommodate it.

Figures

Figures reproduced from arXiv: 2405.14093 by Irwin King, Jianye Hao, Yueen Ma, Yuzheng Zhuang, Zixing Song.

Figure 1
Figure 1. Figure 1: General architecture of VLA models. Three representative methods [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Venn diagram that outlines the main concepts in embodied AI discussed in this article. (b) Timelines that trace the evolution from unimodal [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Some approaches focus on individual components of [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Taxonomy of VLA models. The organization of this survey follows this taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of a hierarchical robot policy. The high-level task planner decomposes the user instruction into subtasks, which are then executed step by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Different approaches to connect LLM to multimodal modules in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: A brief timeline of pivotal unimodal models leading to the development of vision-language-action models, organized by their publication years. Details [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The growing scale of unimodal models over the years. [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Timeline of VLA models from 2020 to 2025. Bracketed numbers indicate the publication count for the corresponding year or institute. [PITH_FULL_IMAGE:figures/full_fig_p051_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of the VLA research landscape. [PITH_FULL_IMAGE:figures/full_fig_p052_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Quantitative analysis of VLA development trends. [PITH_FULL_IMAGE:figures/full_fig_p053_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: VLA research output and impact by institution. [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗
read the original abstract

Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models -- referred to as vision-language-action (VLA) models -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the first survey on vision-language-action (VLA) models for embodied AI. It proposes a taxonomy dividing the literature into three major lines: individual components of VLAs, VLA-based low-level control policies, and high-level task planners for long-horizon tasks. The paper also compiles resources such as datasets, simulators, and benchmarks, and discusses challenges and future directions.

Significance. If the taxonomy provides a clean partition of the VLA literature, the survey would be a significant organizational contribution to the field by structuring the rapidly growing body of work and highlighting key resources. The inclusion of a curated repository adds to its utility for researchers.

major comments (1)
  1. [Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.
minor comments (2)
  1. [Abstract and Introduction] The claim to present the 'first survey' would benefit from a short explicit comparison to prior related surveys on VLMs or embodied AI in the introduction to substantiate novelty.
  2. [Resources section] In the resources summary, include explicit inclusion criteria and note any deliberate omissions for datasets, simulators, and benchmarks to improve transparency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey as the first on VLAs and for the constructive feedback on the taxonomy. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.

    Authors: We acknowledge the validity of this observation. While the taxonomy is structured around the primary research focus of each work (component-level innovations, low-level action generation, or high-level task decomposition), it is true that some models built on shared VLM backbones can be applied or extended across horizons, creating potential boundary cases. To address this, we will add an explicit discussion in the taxonomy section (and a brief note in the abstract) clarifying the classification criteria, noting that assignment is based on the main contribution rather than all possible uses, and providing examples of works that span lines. This revision will improve transparency without requiring a restructuring of the three lines, which we maintain remain useful for organizing the literature by research objective. revision: partial

Circularity Check

0 steps flagged

No circularity: survey taxonomy is an author-proposed organizational structure with no derivations, equations, or self-referential reductions.

full rationale

This is a literature survey paper whose central contribution is a proposed three-line taxonomy of existing VLA work. The taxonomy is presented as an organizing framework rather than derived from any equations, fitted parameters, or first-principles results. No load-bearing steps reduce by construction to the paper's own inputs; all cited works are external. The claim of being the 'first survey' is a factual assertion about coverage, not a mathematical derivation. Self-citations, if present, are not used to justify uniqueness theorems or force the taxonomy. The structure is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced because the paper is a literature survey rather than a theoretical or experimental contribution.

pith-pipeline@v0.9.0 · 5781 in / 998 out tokens · 26804 ms · 2026-05-24T01:23:06.138174+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 58 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models

    cs.CV 2026-03 unverdicted novelty 8.0

    FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...

  2. 4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segm...

  3. RotVLA: Rotational Latent Action for Vision-Language-Action Model

    cs.RO 2026-05 unverdicted novelty 7.0

    RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.

  4. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  5. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 7.0

    Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.

  6. Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.

  7. ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.

  8. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 7.0

    VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...

  9. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...

  10. [Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI

    cs.AI 2026-04 unverdicted novelty 7.0

    ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.

  11. Deformation-based In-Context Learning for Point Cloud Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.

  12. HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness

    cs.RO 2026-03 unverdicted novelty 7.0

    HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.

  13. KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models

    cs.RO 2026-03 unverdicted novelty 7.0

    KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.

  14. RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training

    cs.AI 2026-02 unverdicted novelty 7.0

    RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.

  15. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.

  16. D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models

    cs.AI 2026-05 unverdicted novelty 6.0

    D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.

  17. Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...

  18. Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation

    cs.RO 2026-05 unverdicted novelty 6.0

    Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.

  19. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...

  20. How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study

    cs.CR 2026-05 unverdicted novelty 6.0

    VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.

  21. DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    DexSim2Real integrates FM-guided domain randomization, cross-attention visuo-tactile RL policies, and LLM-based progressive curricula to reach 78.2% average real-world success on six dexterous tasks with an 8.3% sim-t...

  22. Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training

    cs.RO 2026-04 unverdicted novelty 6.0

    DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...

  23. Learning-augmented robotic automation for real-world manufacturing

    cs.RO 2026-04 conditional novelty 6.0

    A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of ...

  24. A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking

    cs.RO 2026-04 unverdicted novelty 6.0

    A VLA model with Cross-Depth Fusion tracking head and TraCon register unifies needle tracking and adaptive insertion control, outperforming prior trackers and manual operation in experiments.

  25. AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...

  26. ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.

  27. Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment

    cs.RO 2026-04 unverdicted novelty 6.0

    A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fi...

  28. E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes

    cs.CV 2026-04 conditional novelty 6.0

    E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.

  29. Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

  30. Emergent Neural Automaton Policies: Learning Symbolic Structure from Visuomotor Trajectories

    cs.RO 2026-03 unverdicted novelty 6.0

    ENAP extracts an emergent Mealy automaton from visuomotor trajectories to act as a high-level planner for a low-level residual policy, yielding up to 27% higher success than end-to-end VLA policies in low-data regimes.

  31. ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making

    cs.RO 2026-03 unverdicted novelty 6.0

    ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.

  32. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  33. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 unverdicted novelty 6.0

    FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.

  34. VLANeXt: Recipes for Building Strong VLA Models

    cs.CV 2026-02 conditional novelty 6.0

    VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.

  35. ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models

    cs.CV 2025-11 conditional novelty 6.0

    ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x spe...

  36. DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

    cs.LG 2025-10 unverdicted novelty 6.0

    DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.

  37. LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization

    cs.CV 2025-10 conditional novelty 6.0

    LIBERO-PRO shows VLA models collapse from over 90% to 0% accuracy under perturbations in objects, states, instructions, and environments, exposing memorization instead of genuine comprehension.

  38. Block-wise Adaptive Caching for Accelerating Diffusion Policy

    cs.AI 2025-06 unverdicted novelty 6.0

    BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.

  39. Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    cs.CV 2024-12 unverdicted novelty 6.0

    Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.

  40. Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers

    cs.RO 2024-10 unverdicted novelty 6.0

    A hybrid event-driven switching system pairs VLA models with lightweight dexterous policies on a compliant anthropomorphic hand to perform language-conditioned multi-finger tasks with cross-embodiment modularity.

  41. Anytime Training with Schedule-Free Spectral Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

  42. PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots

    cs.RO 2026-05 unverdicted novelty 5.0

    PRIME is a MAP optimization framework that refines onboard kinematics into dynamically consistent trajectories for legged robots while jointly estimating contact forces and inertial parameters using differentiable smo...

  43. DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization

    cs.RO 2026-05 unverdicted novelty 5.0

    DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.

  44. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 5.0

    A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.

  45. Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training

    cs.CV 2026-05 unverdicted novelty 5.0

    Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...

  46. VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts

    cs.RO 2026-05 unverdicted novelty 5.0

    VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...

  47. CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 5.0

    CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...

  48. Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    cs.RO 2026-04 unverdicted novelty 5.0

    A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

  49. Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control

    cs.AI 2025-12 unverdicted novelty 5.0

    A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.

  50. SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation

    cs.RO 2025-11 unverdicted novelty 5.0

    SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.

  51. Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

    cs.RO 2025-08 unverdicted novelty 5.0

    This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.

  52. NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    cs.RO 2025-04 unverdicted novelty 5.0

    NORA is a compact 3B-parameter VLA model trained on 970k robot demonstrations that outperforms larger VLA models in embodied tasks while using significantly less computational resources.

  53. SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning

    cs.RO 2025-03 unverdicted novelty 5.0

    SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.

  54. Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts

    cs.CV 2026-05 unverdicted novelty 4.0

    Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.

  55. Position: Embodied AI Requires a Privacy-Utility Trade-off

    cs.AI 2026-05 unverdicted novelty 4.0

    Embodied AI requires treating privacy as a lifecycle architectural constraint rather than a stage-local feature, addressed via the proposed SPINE framework with a multi-criterion privacy classification matrix.

  56. Large Language Models for Multi-Robot Systems: A Survey

    cs.RO 2025-02 unverdicted novelty 4.0

    A survey that categorizes LLM uses in multi-robot systems across task allocation, motion planning, action generation, and human interaction, while noting challenges and future research opportunities.

  57. Large Language Model Agent: A Survey on Methodology, Applications and Challenges

    cs.CL 2025-03 accept novelty 3.0

    A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.

  58. Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems

    eess.SY 2026-04 unverdicted novelty 2.0

    A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 51 Pith papers · 47 internal anchors

  1. [1]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. ...

  2. [2]

    Imagenet classification with deep convolutional neural networks,

    A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

  3. [3]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008

  4. [4]

    Human-level control through deep reinforcement learning,

    V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nat., vol. 518, no. 7540, pp. 529–533, 2015

  5. [5]

    Learning hand- eye coordination for robotic grasping with large-scale data collection,

    S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand- eye coordination for robotic grasping with large-scale data collection,” inISER, ser. Springer Proceedings in Advanced Robotics, vol. 1. Springer, 2016, pp. 173–184

  6. [6]

    Flamingo: a visual language model for few-shot learning,

    J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mod...

  7. [7]

    BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

    J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202. PMLR, 2023, pp. 19 730– 19 742

  8. [8]

    Visual Instruction Tuning

    H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”CoRR, vol. abs/2304.08485, 2023

  9. [9]

    Inner mono- logue: Embodied reasoning through planning with language models,

    W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner mono- logue: Embodied reasoning through planning with language models,” inCoRL, vol. 205. PMLR, 2022, pp. 1769–1782

  10. [10]

    Do as I can, not as I say: Grounding language in robotic affordances,

    B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K....

  11. [11]

    Palm-e: An embodied multimodal language model,

    D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” inICML, vol. 202. PMLR, 2023, pp. 8469–8488

  12. [12]

    Foundation models in robotics: Applications, challenges, and the future,

    R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,”CoRR, vol. abs/2312.07843, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 18

  13. [13]

    Large language models for robotics: Op- portunities, challenges, and perspectives,

    J. Wang, Z. Wu, Y . Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, H. Zhao, Z. Liu, H. Dai, L. Zhao, B. Ge, X. Li, T. Liu, and S. Zhang, “Large language models for robotics: Op- portunities, challenges, and perspectives,”CoRR, vol. abs/2401.04334, 2024

  14. [14]

    Toward general-purpose robots via foundation models: A survey and meta-analysis,

    Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. V . Keetha, S. Kim, Y . Xie, T. Zhang, S. Zhao, Y . Q. Chong, C. Wang, K. P. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. A. Scherer, Z. Kira, F. Xia, and Y . Bisk, “Toward general-purpose robots via foundation models: A survey and meta-analysis,”CoRR, vol. abs/2312.08782, 2023

  15. [15]

    Real-world robot applications of foundation models: a review,

    K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: a review,”Adv. Robotics, vol. 38, no. 18, pp. 1232–1254, 2024

  16. [16]

    Decision transformer: Re- inforcement learning via sequence modeling,

    L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re- inforcement learning via sequence modeling,” inNeurIPS, 2021, pp. 15 084–15 097

  17. [17]

    Offline reinforcement learning as one big sequence modeling problem,

    M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” inNeurIPS, 2021, pp. 1273–1286

  18. [18]

    A generalist agent,

    S. E. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,”Trans. Mach. Learn. Res., vol. 2022, 2022

  19. [19]

    $\pi^{*}_{0.6}$: a VLA That Learns From Experience

    A. A. Physical Intelligence and, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. God- den, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc,...

  20. [20]

    Primitive skill-based robot learning from human evaluative feedback,

    A. Hiranaka, M. Hwang, S. Lee, C. Wang, L. Fei-Fei, J. Wu, and R. Zhang, “Primitive skill-based robot learning from human evaluative feedback,” inIROS, 2023, pp. 7817–7824

  21. [21]

    Reflexion: language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in NeurIPS, 2023

  22. [22]

    Eureka: Human-level reward design via coding large language models,

    Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inICLR, 2024

  23. [23]

    Learning transferable visual models from natural language supervi- sion,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

  24. [24]

    R3M: A universal visual representation for robot manipulation,

    S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” inCoRL, vol

  25. [25]

    PMLR, 2022, pp. 892–909

  26. [26]

    VIP: towards universal visual reward and representation via value-implicit pre-training,

    Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “VIP: towards universal visual reward and representation via value-implicit pre-training,” inICLR, 2023

  27. [27]

    Real-world robot learning with masked visual pre-training,

    I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” inCoRL, vol. 205. PMLR, 2022, pp. 416–426

  28. [28]

    BERT: pre-training of deep bidirectional transformers for language understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1). Association for Computational Linguistics, 2019, pp. 4171–4186

  29. [29]

    Robot learning with sensorimotor pre-training,

    I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre-training,” inCoRL, vol. 229. PMLR, 2023, pp. 683–693

  30. [30]

    Cliport: What and where pathways for robotic manipulation,

    M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, vol. 164. PMLR, 2021, pp. 894–906

  31. [31]

    Simple but effective: CLIP embeddings for embodied AI,

    A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: CLIP embeddings for embodied AI,” inCVPR. IEEE, 2022, pp. 14 809–14 818

  32. [32]

    Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

    S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inCVPR. IEEE, 2023, pp. 23 171–23 181

  33. [33]

    Where are we in the search for an artificial visual cortex for embodied intelligence?

    A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V . Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an artificial visual cortex for embodied intelligence?” inNeurIPS, 2023

  34. [34]

    Language-driven representation learning for robotics,

    S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” inRSS, 2023

  35. [35]

    Dinov2: Learning robust visual features without supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without sup...

  36. [36]

    OpenVLA: An Open-Source Vision-Language-Action Model

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”CoRR, vol. abs/2406.09246, 2024

  37. [37]

    ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

    W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”CoRR, vol. abs/2409.01652, 2024

  38. [38]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. G. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inCVPR. IEEE, 2023, pp. 15 619–15 629

  39. [39]

    Theia: Distilling diverse vision foundation models for robot learning,

    J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Herlant, “Theia: Distilling diverse vision foundation models for robot learning,” inCoRL, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2024, pp. 724–748

  40. [40]

    The unsurprising effectiveness of pre-trained vision models for control,

    S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre-trained vision models for control,” inICML, vol. 162. PMLR, 2022, pp. 17 359–17 371

  41. [41]

    A path towards autonomous machine intelligence,

    Y . LeCun, “A path towards autonomous machine intelligence,” 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf

  42. [42]

    Dis- tilled feature fields enable few-shot language-guided manipulation,

    W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Dis- tilled feature fields enable few-shot language-guided manipulation,” in CoRL, vol. 229. PMLR, 2023, pp. 405–424

  43. [43]

    3d- llm: Injecting the 3d world into large language models,

    Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,” inNeurIPS, 2023

  44. [44]

    3d gaussian splatting for real-time radiance field rendering,

    B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023

  45. [45]

    Langsplat: 3d language gaussian splatting,

    M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inCVPR. IEEE, 2024, pp. 20 051– 20 060

  46. [46]

    That sounds right: Auditory self- supervision for dynamic robot manipulation,

    A. Thankaraj and L. Pinto, “That sounds right: Auditory self- supervision for dynamic robot manipulation,” inCoRL, vol. 229. PMLR, 2023, pp. 1036–1049

  47. [47]

    Exploring visual pre-training for robot manipulation: Datasets, models and methods,

    Y . Jing, X. Zhu, X. Liu, Q. Sima, T. Yang, Y . Feng, and T. Kong, “Exploring visual pre-training for robot manipulation: Datasets, models and methods,” inIROS, 2023, pp. 11 390–11 395

  48. [48]

    Masked autoencoding for scalable and generalizable decision making,

    F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,” inNeurIPS, 2022

  49. [49]

    Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,

    J. Li, Q. Gao, M. Johnston, X. Gao, X. He, H. Shi, S. Shakiah, R. Ghanadan, and W. Y . Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” in ICML, 2024

  50. [50]

    SMART: self-supervised multi-task pretraining with control trans- formers,

    Y . Sun, S. Ma, R. Madaan, R. Bonatti, F. Huang, and A. Kapoor, “SMART: self-supervised multi-task pretraining with control trans- formers,” inICLR, 2023

  51. [51]

    PACT: perception-action causal transformer for autore- gressive robotics pre-training,

    R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, “PACT: perception-action causal transformer for autore- gressive robotics pre-training,” inIROS, 2023, pp. 3621–3627

  52. [52]

    Video pretraining (VPT): learning to act by watching unlabeled online videos,

    B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (VPT): learning to act by watching unlabeled online videos,” inNeurIPS, 2022

  53. [53]

    Unleashing large-scale video generative pre-training for visual robot manipulation,

    H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inICLR, 2024

  54. [54]

    Dream to control: Learning behaviors by latent imagination,

    D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inICLR, 2020

  55. [55]

    Mastering atari with discrete world models,

    D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” inICLR, 2021

  56. [56]

    Mastering Diverse Domains through World Models

    D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap, “Mastering diverse domains through world models,”CoRR, vol. abs/2301.04104, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 19

  57. [57]

    Day- dreamer: World models for physical robot learning,

    P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- dreamer: World models for physical robot learning,” inCoRL, vol. 205. PMLR, 2022, pp. 2226–2240

  58. [58]

    Transformers are sample- efficient world models,

    V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,” inICLR, 2023

  59. [59]

    Transformer- based world models are happy with 100k interactions,

    J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,” inICLR, 2023

  60. [60]

    Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,

    K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inICML, vol. 202. PMLR, 2023, pp. 26 311–26 325

  61. [61]

    No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,

    Z. Song, Y . Zhang, and I. King, “No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,” inNeurIPS, 2023

  62. [62]

    Graph component contrastive learning for concept relatedness estimation,

    Y . Ma, Z. Song, X. Hu, J. Li, Y . Zhang, and I. King, “Graph component contrastive learning for concept relatedness estimation,” in AAAI. AAAI Press, 2023, pp. 13 362–13 370

  63. [63]

    Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,

    L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,” inNeurIPS, 2023

  64. [64]

    LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

    B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: empowering large language models with optimal planning proficiency,”CoRR, vol. abs/2304.11477, 2023

  65. [65]

    Reasoning with language model is planning with world model,

    S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in EMNLP, 2023, pp. 8154–8173

  66. [66]

    Tree-planner: Efficient close-loop task planning with large language models,

    M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo, “Tree-planner: Efficient close-loop task planning with large language models,” inICLR, 2024

  67. [67]

    Large language models as common- sense knowledge for large-scale task planning,

    Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as common- sense knowledge for large-scale task planning,” inNeurIPS, 2023

  68. [68]

    (2024) Video generation models as world simulators

    OpenAI. (2024) Video generation models as world simulators. [Online]. Available: https://openai.com/index/ video-generation-models-as-world-simulators/

  69. [69]

    Is sora a world simulator? A comprehensive survey on general world models and beyond,

    Z. Zhu, X. Wang, W. Zhao, C. Min, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is sora a world simulator? A comprehensive survey on general world models and beyond,”CoRR, vol. abs/2405.03520, 2024

  70. [70]

    Genie: Generative interactive environments,

    J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. M. P. Behbahani, S. C. Y . Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. E. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt ¨aschel, “Genie: Generative interactive environmen...

  71. [71]

    3d-vla: A 3d vision-language-action generative world model,

    H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” inICML, 2024

  72. [72]

    Learning interactive real-world simu- lators,

    S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel, “Learning interactive real-world simu- lators,” inICLR, 2024

  73. [73]

    Language models meet world models: Embodied experiences enhance language models,

    J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,” inNeurIPS, 2023

  74. [74]

    Large language models are zero-shot reasoners,

    T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inNeurIPS, 2022

  75. [75]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022

  76. [76]

    Thinkbot: Embodied instruction following with thought chain reasoning

    G. Lu, Z. Wang, C. Liu, J. Lu, and Y . Tang, “Thinkbot: Embod- ied instruction following with thought chain reasoning,”CoRR, vol. abs/2312.07062, 2023

  77. [77]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023

  78. [78]

    RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,

    Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,”CoRR, vol. abs/2403.05313, 2024

  79. [79]

    Robotic Control via Embodied Chain-of-Thought Reasoning

    M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”CoRR, vol. abs/2407.08693, 2024

  80. [80]

    Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

    Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T. Lin, G. Wetzstein, M. Liu, and D. Xiang, “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inCVPR, 2025, pp. 1702–1713

Showing first 80 references.