A Survey on Vision-Language-Action Models for Embodied AI

Irwin King; Jianye Hao; Yueen Ma; Yuzheng Zhuang; Zixing Song

arxiv: 2405.14093 · v8 · submitted 2024-05-23 · 💻 cs.RO · cs.CL· cs.CV

A Survey on Vision-Language-Action Models for Embodied AI

Yueen Ma , Zixing Song , Yuzheng Zhuang , Jianye Hao , Irwin King This is my paper

Pith reviewed 2026-05-24 01:23 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV

keywords vision-language-action modelsembodied AIroboticstaxonomysurveydatasetssimulatorsbenchmarks

0 comments

The pith

The first survey on vision-language-action models organizes them into three research lines for embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper surveys the new class of vision-language-action models that generate robot actions from language and visual inputs in physical environments. It structures the literature around three lines of work to make the fast-growing area navigable. The survey also compiles datasets, simulators, and benchmarks while noting open challenges. A reader would care because these models aim to turn large language and vision models into agents that can follow instructions in the real world.

Core claim

The paper presents the first survey on VLAs for embodied AI and supplies a taxonomy that divides the field into three major lines: research on individual components of VLAs, development of VLA-based control policies that predict low-level actions, and high-level task planners that break long-horizon tasks into subtasks to follow general user instructions. It further summarizes relevant datasets, simulators, and benchmarks and discusses challenges and future directions.

What carries the argument

A three-line taxonomy of VLAs that separates work on individual components, low-level control policies, and high-level task planners.

If this is right

The taxonomy lets new VLA papers be placed relative to existing work.
The listed datasets and simulators give concrete starting points for training and testing VLAs.
The identified challenges indicate concrete problems that next VLA designs should address.
High-level planners can guide low-level policies on longer tasks, suggesting a path to more general instruction following.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be used to design evaluation suites that separately test each line.
Hybrid models that combine elements from more than one line may become a natural next step once the categories are established.
Making the survey's repository the standard reference list would reduce duplication in future VLA papers.
The three-line split may influence how funding and conference tracks organize embodied-AI research.

Load-bearing premise

Existing VLA literature can be partitioned into these three lines without major omissions or overlaps that would require a different structure.

What would settle it

Publication of a substantial VLA paper whose method falls outside all three lines or requires splitting or merging the categories to accommodate it.

Figures

Figures reproduced from arXiv: 2405.14093 by Irwin King, Jianye Hao, Yueen Ma, Yuzheng Zhuang, Zixing Song.

**Figure 2.** Figure 2: (a) Venn diagram that outlines the main concepts in embodied AI discussed in this article. (b) Timelines that trace the evolution from unimodal [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Some approaches focus on individual components of [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗

**Figure 3.** Figure 3: Taxonomy of VLA models. The organization of this survey follows this taxonomy. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of a hierarchical robot policy. The high-level task planner decomposes the user instruction into subtasks, which are then executed step by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 6.** Figure 6: Different approaches to connect LLM to multimodal modules in [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗

**Figure 7.** Figure 7: A brief timeline of pivotal unimodal models leading to the development of vision-language-action models, organized by their publication years. Details [PITH_FULL_IMAGE:figures/full_fig_p038_7.png] view at source ↗

**Figure 8.** Figure 8: The growing scale of unimodal models over the years. [PITH_FULL_IMAGE:figures/full_fig_p039_8.png] view at source ↗

**Figure 9.** Figure 9: Timeline of VLA models from 2020 to 2025. Bracketed numbers indicate the publication count for the corresponding year or institute. [PITH_FULL_IMAGE:figures/full_fig_p051_9.png] view at source ↗

**Figure 10.** Figure 10: Visualization of the VLA research landscape. [PITH_FULL_IMAGE:figures/full_fig_p052_10.png] view at source ↗

**Figure 11.** Figure 11: Quantitative analysis of VLA development trends. [PITH_FULL_IMAGE:figures/full_fig_p053_11.png] view at source ↗

**Figure 12.** Figure 12: VLA research output and impact by institution. [PITH_FULL_IMAGE:figures/full_fig_p054_12.png] view at source ↗

read the original abstract

Embodied AI is widely recognized as a cornerstone of artificial general intelligence (AGI) because it involves controlling embodied agents to perform tasks in the physical world. Building on the success of large language models (LLMs) and vision-language models (VLMs), a new category of multimodal models -- referred to as vision-language-action (VLA) models -- has emerged to address language-conditioned robotic tasks in embodied AI by leveraging their distinct ability to generate actions. The recent proliferation of VLAs necessitates a comprehensive survey to capture the rapidly evolving landscape. To this end, we present the first survey on VLAs for embodied AI. This work provides a detailed taxonomy of VLAs, organized into three major lines of research. The first line focuses on individual components of VLAs. The second line is dedicated to developing VLA-based control policies adept at predicting low-level actions. The third line comprises high-level task planners capable of decomposing long-horizon tasks into a sequence of subtasks, thereby guiding VLAs to follow more general user instructions. Furthermore, we provide an extensive summary of relevant resources, including datasets, simulators, and benchmarks. Finally, we discuss the challenges facing VLAs and outline promising future directions in embodied AI. A curated repository associated with this survey is available at: https://github.com/yueen-ma/Awesome-VLA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A first survey on VLAs that collects resources and sketches a three-way taxonomy, though the categories overlap more than the structure suggests.

read the letter

This paper claims to be the first survey on vision-language-action models for embodied AI and delivers a taxonomy split into individual components, low-level control policies, and high-level task planners, along with a compiled list of datasets, simulators, and benchmarks plus a GitHub repo. That resource list and the repo are the parts that will actually get used. The abstract and structure show a logical grouping of existing work without obvious internal contradictions, and the discussion of challenges like generalization feels standard but relevant for the field. The taxonomy itself is the main organizational move, and it draws from cited prior literature rather than inventing new models or results. The soft spot is the partition. Many VLA papers adapt a shared VLM backbone for action output and then apply it across both short-horizon control and longer task decomposition, so assigning them cleanly to one line or another often feels forced. The stress-test concern about non-disjoint categories holds up on the description given. This is not a fatal issue for a survey, but it means the claim of a comprehensive three-line organization is a bit neater on paper than in the actual literature. The work is for people who need a quick map of the VLA space or are starting projects in language-conditioned robotics. It deserves peer review because the resource compilation is concrete and the timing is right for a field that is moving fast, even if the taxonomy needs some tightening in revision.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the first survey on vision-language-action (VLA) models for embodied AI. It proposes a taxonomy dividing the literature into three major lines: individual components of VLAs, VLA-based low-level control policies, and high-level task planners for long-horizon tasks. The paper also compiles resources such as datasets, simulators, and benchmarks, and discusses challenges and future directions.

Significance. If the taxonomy provides a clean partition of the VLA literature, the survey would be a significant organizational contribution to the field by structuring the rapidly growing body of work and highlighting key resources. The inclusion of a curated repository adds to its utility for researchers.

major comments (1)

[Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.

minor comments (2)

[Abstract and Introduction] The claim to present the 'first survey' would benefit from a short explicit comparison to prior related surveys on VLMs or embodied AI in the introduction to substantiate novelty.
[Resources section] In the resources summary, include explicit inclusion criteria and note any deliberate omissions for datasets, simulators, and benchmarks to improve transparency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our survey as the first on VLAs and for the constructive feedback on the taxonomy. We address the major comment point by point below.

read point-by-point responses

Referee: [Taxonomy (abstract and main taxonomy section)] Taxonomy (as outlined in the abstract and detailed in the body): The three-line taxonomy (individual components; low-level control policies; high-level task planners) risks non-disjoint categories. Many cited works modify a shared VLM backbone for action output and apply it to both short-horizon control and long-horizon decomposition, making single-line assignment arbitrary and potentially leading to overlaps or forced binning. This directly affects the central claim that the taxonomy comprehensively organizes the field without major omissions or overlaps.

Authors: We acknowledge the validity of this observation. While the taxonomy is structured around the primary research focus of each work (component-level innovations, low-level action generation, or high-level task decomposition), it is true that some models built on shared VLM backbones can be applied or extended across horizons, creating potential boundary cases. To address this, we will add an explicit discussion in the taxonomy section (and a brief note in the abstract) clarifying the classification criteria, noting that assignment is based on the main contribution rather than all possible uses, and providing examples of works that span lines. This revision will improve transparency without requiring a restructuring of the three lines, which we maintain remain useful for organizing the literature by research objective. revision: partial

Circularity Check

0 steps flagged

No circularity: survey taxonomy is an author-proposed organizational structure with no derivations, equations, or self-referential reductions.

full rationale

This is a literature survey paper whose central contribution is a proposed three-line taxonomy of existing VLA work. The taxonomy is presented as an organizing framework rather than derived from any equations, fitted parameters, or first-principles results. No load-bearing steps reduce by construction to the paper's own inputs; all cited works are external. The claim of being the 'first survey' is a factual assertion about coverage, not a mathematical derivation. Self-citations, if present, are not used to justify uniqueness theorems or force the taxonomy. The structure is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are introduced because the paper is a literature survey rather than a theoretical or experimental contribution.

pith-pipeline@v0.9.0 · 5781 in / 998 out tokens · 26804 ms · 2026-05-24T01:23:06.138174+00:00 · methodology

discussion (0)

Forward citations

Cited by 58 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

FlowHijack: A Dynamics-Aware Backdoor Attack on Flow-Matching Vision-Language-Action Models
cs.CV 2026-03 unverdicted novelty 8.0

FlowHijack is the first dynamics-aware backdoor attack on flow-matching VLAs that achieves high success rates with stealthy triggers while preserving benign performance and making malicious actions kinematically indis...
4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

4DLidarOpen is a new open dataset providing synchronized 4D FMCW Lidar velocity measurements, multi-Lidar and camera data, and 3D bounding-box annotations with track IDs to support benchmarks on 3D detection, BEV segm...
RotVLA: Rotational Latent Action for Vision-Language-Action Model
cs.RO 2026-05 unverdicted novelty 7.0

RotVLA models latent actions as continuous SO(n) rotations with triplet-frame supervision and flow-matching to reach 98.2% success on LIBERO and 89.6%/88.5% on RoboTwin2.0 using a 1.7B-parameter model.
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 7.0

A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 7.0

Pace-and-Path Correction decomposes a quadratic cost minimization into orthogonal pace and path channels to correct chunked actions in VLA models, raising success rates by up to 28.8% in dynamic settings.
Towards Backdoor-Based Ownership Verification for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

GuardVLA embeds a stealthy backdoor watermark in VLAs via secret messages in visual data and uses a swap-and-detect mechanism for post-release ownership verification that preserves task performance.
ECHO: Continuous Hierarchical Memory for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ECHO organizes VLA experiences into a hierarchical memory tree in hyperbolic space via autoencoder and entailment constraints, delivering a 12.8% success-rate gain on LIBERO-Long over the pi0 baseline.
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 7.0

VLA-GSE improves VLA adaptation by initializing generalized shared experts and specialized routed experts via spectral decomposition of the backbone, outperforming full fine-tuning and other PEFT methods on robotic be...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

CoRAL lets LLMs act as adaptive cost designers for motion planners while using VLM priors and online identification to handle unknown physics, achieving over 50% higher success rates than baselines in unseen contact-r...
[Emerging Ideas] Artificial Tripartite Intelligence: A Bio-Inspired, Sensor-First Architecture for Physical AI
cs.AI 2026-04 unverdicted novelty 7.0

ATI is a tripartite bio-inspired architecture for physical AI that co-designs sensing and inference, shown in a camera prototype to raise accuracy from 53.8% to 88% and cut remote invocations by 43.3%.
Deformation-based In-Context Learning for Point Cloud Understanding
cs.CV 2026-04 unverdicted novelty 7.0

DeformPIC deforms query point clouds under prompt guidance for in-context learning, outperforming prior methods with lower Chamfer Distance on reconstruction, denoising, and registration tasks.
HeiSD: Hybrid Speculative Decoding for Embodied Vision-Language-Action Models with Kinematic Awareness
cs.RO 2026-03 unverdicted novelty 7.0

HeiSD delivers up to 2.45x faster inference for embodied VLA models by hybridizing speculative decoding with kinematic boundary detection and error-mitigation tricks while preserving task success rates.
KERV: Kinematic-Rectified Speculative Decoding for Embodied VLA Models
cs.RO 2026-03 unverdicted novelty 7.0

KERV integrates kinematic Kalman Filter predictions with speculative decoding in VLA models to achieve 27-37% faster inference while maintaining nearly the same task success rates.
RL-VLA$^3$: A Flexible and Asynchronous Reinforcement Learning Framework for VLA Training
cs.AI 2026-02 unverdicted novelty 7.0

RL-VLA³ is an asynchronous RL framework for VLA training that delivers up to 85.2% higher throughput than synchronous baselines while preserving identical sample efficiency and scaling to 256 GPUs.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA introduces plane decoupling and a swimlane asynchronous pipeline to achieve high-concurrency RL training and linear scalability for billion- to trillion-parameter vision-language-action models.
D-VLA: A High-Concurrency Distributed Asynchronous Reinforcement Learning Framework for Vision-Language-Action Models
cs.AI 2026-05 unverdicted novelty 6.0

D-VLA uses plane decoupling and a swimlane pipeline to deliver higher throughput and linear speedup than prior RL frameworks when training billion- and trillion-parameter VLA models on benchmarks like LIBERO.
Overcoming Dynamics-Blindness: Training-Free Pace-and-Path Correction for VLA Models
cs.RO 2026-05 unverdicted novelty 6.0

Pace-and-Path Correction is a closed-form inference-time operator that decomposes a quadratic cost minimization into orthogonal pace compression and path offset channels to correct dynamics-blindness in chunked-action...
Escaping the Diversity Trap in Robotic Manipulation via Anchor-Centric Adaptation
cs.RO 2026-05 unverdicted novelty 6.0

Anchor-Centric Adaptation escapes the diversity trap by prioritizing repeated demonstrations at core anchors over broad coverage, yielding higher success rates under fixed data budgets in robotic manipulation.
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

Vision-language models exhibit perceptual fragility and fail to consistently respect privacy constraints when operating in simulated physical environments, with performance declining in cluttered scenes and under conf...
How Far Are VLMs from Privacy Awareness in the Physical World? An Empirical Study
cs.CR 2026-05 unverdicted novelty 6.0

VLMs show consistent deficits in identifying sensitive items in cluttered scenes, adapting to social contexts, and resolving conflicts between commands and privacy constraints in a new physical simulator benchmark.
DexSim2Real: Foundation Model-Guided Sim-to-Real Transfer for Generalizable Dexterous Manipulation
cs.RO 2026-05 unverdicted novelty 6.0

DexSim2Real integrates FM-guided domain randomization, cross-attention visuo-tactile RL policies, and LLM-based progressive curricula to reach 78.2% average real-world success on six dexterous tasks with an 8.3% sim-t...
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
Learning-augmented robotic automation for real-world manufacturing
cs.RO 2026-04 conditional novelty 6.0

A learning-augmented robotic system automated deformable cable insertion and soldering on a live electric-motor production line for 5 hours 10 minutes, producing 108 motors at 99.4% pass rate with under 20 minutes of ...
A Vision-Language-Action Model for Adaptive Ultrasound-Guided Needle Insertion and Needle Tracking
cs.RO 2026-04 unverdicted novelty 6.0

A VLA model with Cross-Depth Fusion tracking head and TraCon register unifies needle tracking and adaptive insertion control, outperforming prior trackers and manual operation in experiments.
AnchorRefine: Synergy-Manipulation Based on Trajectory Anchor and Residual Refinement for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

AnchorRefine factorizes VLA action generation into a trajectory anchor for coarse planning and residual refinement for local corrections, improving success rates by up to 7.8% in simulation and 18% on real robots acro...
ProGAL-VLA: Grounded Alignment through Prospective Reasoning in Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

ProGAL-VLA uses 3D graphs, symbolic sub-goals, and a Grounding Alignment Contrastive loss to ground actions on verified embeddings, raising robustness from 30.3% to 71.5% and ambiguity AUROC to 0.81 on robotic benchmarks.
Grounding Hierarchical Vision-Language-Action Models Through Explicit Language-Action Alignment
cs.RO 2026-04 unverdicted novelty 6.0

A contrastive alignment model plus offline preference learning explicitly grounds hierarchical VLA language descriptions to actions and visuals on LanguageTable, achieving performance comparable to fully supervised fi...
E-VLA: Event-Augmented Vision-Language-Action Model for Dark and Blurred Scenes
cs.CV 2026-04 conditional novelty 6.0

E-VLA integrates event streams directly into VLA models via lightweight fusion, raising Pick-Place success from 0% to 60-90% at 20 lux and from 0% to 20-25% under severe motion blur.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
Emergent Neural Automaton Policies: Learning Symbolic Structure from Visuomotor Trajectories
cs.RO 2026-03 unverdicted novelty 6.0

ENAP extracts an emergent Mealy automaton from visuomotor trajectories to act as a high-level planner for a low-level residual policy, yielding up to 27% higher success than end-to-end VLA policies in low-data regimes.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making
cs.RO 2026-03 unverdicted novelty 6.0

ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 conditional novelty 6.0

FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
FASTER: Rethinking Real-Time Flow VLAs
cs.RO 2026-03 unverdicted novelty 6.0

FASTER adds a Horizon-Aware Schedule to flow VLAs that compresses immediate-action denoising to one step while keeping long-horizon trajectory quality, lowering real-robot reaction latency.
VLANeXt: Recipes for Building Strong VLA Models
cs.CV 2026-02 conditional novelty 6.0

VLANeXt distills 12 design insights from a unified VLA study into a model that outperforms prior methods on LIBERO benchmarks while releasing code for further exploration.
ActDistill: General Action-Guided Self-Derived Distillation for Efficient Vision-Language-Action Models
cs.CV 2025-11 conditional novelty 6.0

ActDistill transfers action knowledge from heavy VLA teacher models to lightweight students via graph-encapsulated hierarchies and action-guided dynamic routing, delivering over 50% computation reduction and 1.67x spe...
DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models
cs.LG 2025-10 unverdicted novelty 6.0

DeepThinkVLA shows CoT improves VLA models only under decoding and causal alignment, delivering 97% success on LIBERO and 21.7-point gains via hybrid attention and SFT-RL training.
LIBERO-PRO: Towards Robust and Fair Evaluation of Vision-Language-Action Models Beyond Memorization
cs.CV 2025-10 conditional novelty 6.0

LIBERO-PRO shows VLA models collapse from over 90% to 0% accuracy under perturbations in objects, states, instructions, and environments, exposing memorization instead of genuine comprehension.
Block-wise Adaptive Caching for Accelerating Diffusion Policy
cs.AI 2025-06 unverdicted novelty 6.0

BAC accelerates transformer-based Diffusion Policy up to 3x by block-level adaptive feature caching using an Adaptive Caching Scheduler and Bubbling Union Algorithm to control error propagation.
Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations
cs.CV 2024-12 unverdicted novelty 6.0

Video Prediction Policy conditions robot action learning on future-frame predictions inside fine-tuned video diffusion models, yielding 18.6% relative gains on Calvin ABC-D and 31.6% higher real-world success rates.
Language Conditioned Multi-Finger Dexterous Manipulation Enabled by Physical Compliance and Switching of Controllers
cs.RO 2024-10 unverdicted novelty 6.0

A hybrid event-driven switching system pairs VLA models with lightweight dexterous policies on a compliant anthropomorphic hand to perform language-conditioned multi-finger tasks with cross-embodiment modularity.
Anytime Training with Schedule-Free Spectral Optimization
cs.LG 2026-05 unverdicted novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
PRIME: Physically-consistent Robotic Inertial and Motion Estimation for Legged and Humanoid Robots
cs.RO 2026-05 unverdicted novelty 5.0

PRIME is a MAP optimization framework that refines onboard kinematics into dynamically consistent trajectories for legged robots while jointly estimating contact forces and inertial parameters using differentiable smo...
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization
cs.RO 2026-05 unverdicted novelty 5.0

DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
Dynamic Execution Commitment of Vision-Language-Action Models
cs.CV 2026-05 unverdicted novelty 5.0

A3 adaptively selects verifiable action prefixes in VLA models using group-sampled consensus and conditional re-decoding to balance robustness and speed without manual horizon tuning.
Sword: Style-Robust World Models as Simulators via Dynamic Latent Bootstrapping for VLA Policy Post-Training
cs.CV 2026-05 unverdicted novelty 5.0

Sword improves world model simulators for VLA policies by disentangling visual style from dynamics and bootstrapping latents for better consistency, outperforming baselines on LIBERO in generalization and RL post-trai...
VLA-GSE: Boosting Parameter-Efficient Fine-Tuning in VLA with Generalized and Specialized Experts
cs.RO 2026-05 unverdicted novelty 5.0

VLA-GSE uses spectral decomposition of the VLA backbone to create generalized and specialized experts, enabling effective robot task adaptation while updating only 2.51% of parameters and achieving 81.2% zero-shot suc...
CoRAL: Contact-Rich Adaptive LLM-based Control for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

CoRAL lets LLMs design objective functions for robot motion planners and uses vision-language models plus real-time identification to adapt to unknown physical properties, raising success rates by over 50 percent on n...
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
cs.RO 2026-04 unverdicted novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
Agentic Physical AI toward a Domain-Specific Foundation Model for Nuclear Reactor Control
cs.AI 2025-12 unverdicted novelty 5.0

A compact language model trained on scaled synthetic nuclear reactor control data exhibits variance collapse and emergent concentration on a single actuation strategy driven by physical execution success.
SlotVLA: Towards Modeling of Object-Relation Representations in Robotic Manipulation
cs.RO 2025-11 unverdicted novelty 5.0

SlotVLA uses slot attention to model object-relation representations for multitask robotic manipulation, reducing visual tokens while achieving competitive generalization on the new LIBERO+ benchmark.
Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
cs.RO 2025-08 unverdicted novelty 5.0

This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
cs.RO 2025-04 unverdicted novelty 5.0

NORA is a compact 3B-parameter VLA model trained on 970k robot demonstrations that outperforms larger VLA models in embodied tasks while using significantly less computational resources.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning
cs.RO 2025-03 unverdicted novelty 5.0

SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
cs.CV 2026-05 unverdicted novelty 4.0

Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.
Position: Embodied AI Requires a Privacy-Utility Trade-off
cs.AI 2026-05 unverdicted novelty 4.0

Embodied AI requires treating privacy as a lifecycle architectural constraint rather than a stage-local feature, addressed via the proposed SPINE framework with a multi-criterion privacy classification matrix.
Large Language Models for Multi-Robot Systems: A Survey
cs.RO 2025-02 unverdicted novelty 4.0

A survey that categorizes LLM uses in multi-robot systems across task allocation, motion planning, action generation, and human interaction, while noting challenges and future research opportunities.
Large Language Model Agent: A Survey on Methodology, Applications and Challenges
cs.CL 2025-03 accept novelty 3.0

A survey that deconstructs LLM agent systems via a methodology-centered taxonomy linking design principles to emergent behaviors, applications, and challenges.
Redefining End-of-Life: Intelligent Automation for Electronics Remanufacturing Systems
eess.SY 2026-04 unverdicted novelty 2.0

A literature review of intelligent automation approaches using robotics, AI, and control for disassembly, inspection, sorting, and reprocessing of end-of-life electronics.

Reference graph

Works this paper leans on

300 extracted references · 300 canonical work pages · cited by 51 Pith papers · 47 internal anchors

[1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

work page 2012
[3]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008

work page 2017
[4]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nat., vol. 518, no. 7540, pp. 529–533, 2015

work page 2015
[5]

Learning hand- eye coordination for robotic grasping with large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand- eye coordination for robotic grasping with large-scale data collection,” inISER, ser. Springer Proceedings in Advanced Robotics, vol. 1. Springer, 2016, pp. 173–184

work page 2016
[6]

Flamingo: a visual language model for few-shot learning,

J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mod...

work page 2022
[7]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202. PMLR, 2023, pp. 19 730– 19 742

work page 2023
[8]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”CoRR, vol. abs/2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Inner mono- logue: Embodied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner mono- logue: Embodied reasoning through planning with language models,” inCoRL, vol. 205. PMLR, 2022, pp. 1769–1782

work page 2022
[10]

Do as I can, not as I say: Grounding language in robotic affordances,

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K....

work page 2022
[11]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” inICML, vol. 202. PMLR, 2023, pp. 8469–8488

work page 2023
[12]

Foundation models in robotics: Applications, challenges, and the future,

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,”CoRR, vol. abs/2312.07843, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 18

work page arXiv 2023
[13]

Large language models for robotics: Op- portunities, challenges, and perspectives,

J. Wang, Z. Wu, Y . Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, H. Zhao, Z. Liu, H. Dai, L. Zhao, B. Ge, X. Li, T. Liu, and S. Zhang, “Large language models for robotics: Op- portunities, challenges, and perspectives,”CoRR, vol. abs/2401.04334, 2024

work page arXiv 2024
[14]

Toward general-purpose robots via foundation models: A survey and meta-analysis,

Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. V . Keetha, S. Kim, Y . Xie, T. Zhang, S. Zhao, Y . Q. Chong, C. Wang, K. P. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. A. Scherer, Z. Kira, F. Xia, and Y . Bisk, “Toward general-purpose robots via foundation models: A survey and meta-analysis,”CoRR, vol. abs/2312.08782, 2023

work page arXiv 2023
[15]

Real-world robot applications of foundation models: a review,

K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: a review,”Adv. Robotics, vol. 38, no. 18, pp. 1232–1254, 2024

work page 2024
[16]

Decision transformer: Re- inforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re- inforcement learning via sequence modeling,” inNeurIPS, 2021, pp. 15 084–15 097

work page 2021
[17]

Offline reinforcement learning as one big sequence modeling problem,

M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” inNeurIPS, 2021, pp. 1273–1286

work page 2021
[18]

A generalist agent,

S. E. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,”Trans. Mach. Learn. Res., vol. 2022, 2022

work page 2022
[19]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

A. A. Physical Intelligence and, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. God- den, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Primitive skill-based robot learning from human evaluative feedback,

A. Hiranaka, M. Hwang, S. Lee, C. Wang, L. Fei-Fei, J. Wu, and R. Zhang, “Primitive skill-based robot learning from human evaluative feedback,” inIROS, 2023, pp. 7817–7824

work page 2023
[21]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in NeurIPS, 2023

work page 2023
[22]

Eureka: Human-level reward design via coding large language models,

Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inICLR, 2024

work page 2024
[23]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021
[24]

R3M: A universal visual representation for robot manipulation,

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” inCoRL, vol

work page
[25]

PMLR, 2022, pp. 892–909

work page 2022
[26]

VIP: towards universal visual reward and representation via value-implicit pre-training,

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “VIP: towards universal visual reward and representation via value-implicit pre-training,” inICLR, 2023

work page 2023
[27]

Real-world robot learning with masked visual pre-training,

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” inCoRL, vol. 205. PMLR, 2022, pp. 416–426

work page 2022
[28]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1). Association for Computational Linguistics, 2019, pp. 4171–4186

work page 2019
[29]

Robot learning with sensorimotor pre-training,

I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre-training,” inCoRL, vol. 229. PMLR, 2023, pp. 683–693

work page 2023
[30]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, vol. 164. PMLR, 2021, pp. 894–906

work page 2021
[31]

Simple but effective: CLIP embeddings for embodied AI,

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: CLIP embeddings for embodied AI,” inCVPR. IEEE, 2022, pp. 14 809–14 818

work page 2022
[32]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inCVPR. IEEE, 2023, pp. 23 171–23 181

work page 2023
[33]

Where are we in the search for an artificial visual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V . Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an artificial visual cortex for embodied intelligence?” inNeurIPS, 2023

work page 2023
[34]

Language-driven representation learning for robotics,

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” inRSS, 2023

work page 2023
[35]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without sup...

work page 2024
[36]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”CoRR, vol. abs/2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”CoRR, vol. abs/2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. G. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inCVPR. IEEE, 2023, pp. 15 619–15 629

work page 2023
[39]

Theia: Distilling diverse vision foundation models for robot learning,

J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Herlant, “Theia: Distilling diverse vision foundation models for robot learning,” inCoRL, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2024, pp. 724–748

work page 2024
[40]

The unsurprising effectiveness of pre-trained vision models for control,

S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre-trained vision models for control,” inICML, vol. 162. PMLR, 2022, pp. 17 359–17 371

work page 2022
[41]

A path towards autonomous machine intelligence,

Y . LeCun, “A path towards autonomous machine intelligence,” 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf

work page 2022
[42]

Dis- tilled feature fields enable few-shot language-guided manipulation,

W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Dis- tilled feature fields enable few-shot language-guided manipulation,” in CoRL, vol. 229. PMLR, 2023, pp. 405–424

work page 2023
[43]

3d- llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,” inNeurIPS, 2023

work page 2023
[44]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023

work page 2023
[45]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inCVPR. IEEE, 2024, pp. 20 051– 20 060

work page 2024
[46]

That sounds right: Auditory self- supervision for dynamic robot manipulation,

A. Thankaraj and L. Pinto, “That sounds right: Auditory self- supervision for dynamic robot manipulation,” inCoRL, vol. 229. PMLR, 2023, pp. 1036–1049

work page 2023
[47]

Exploring visual pre-training for robot manipulation: Datasets, models and methods,

Y . Jing, X. Zhu, X. Liu, Q. Sima, T. Yang, Y . Feng, and T. Kong, “Exploring visual pre-training for robot manipulation: Datasets, models and methods,” inIROS, 2023, pp. 11 390–11 395

work page 2023
[48]

Masked autoencoding for scalable and generalizable decision making,

F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,” inNeurIPS, 2022

work page 2022
[49]

Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,

J. Li, Q. Gao, M. Johnston, X. Gao, X. He, H. Shi, S. Shakiah, R. Ghanadan, and W. Y . Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” in ICML, 2024

work page 2024
[50]

SMART: self-supervised multi-task pretraining with control trans- formers,

Y . Sun, S. Ma, R. Madaan, R. Bonatti, F. Huang, and A. Kapoor, “SMART: self-supervised multi-task pretraining with control trans- formers,” inICLR, 2023

work page 2023
[51]

PACT: perception-action causal transformer for autore- gressive robotics pre-training,

R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, “PACT: perception-action causal transformer for autore- gressive robotics pre-training,” inIROS, 2023, pp. 3621–3627

work page 2023
[52]

Video pretraining (VPT): learning to act by watching unlabeled online videos,

B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (VPT): learning to act by watching unlabeled online videos,” inNeurIPS, 2022

work page 2022
[53]

Unleashing large-scale video generative pre-training for visual robot manipulation,

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inICLR, 2024

work page 2024
[54]

Dream to control: Learning behaviors by latent imagination,

D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inICLR, 2020

work page 2020
[55]

Mastering atari with discrete world models,

D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” inICLR, 2021

work page 2021
[56]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap, “Mastering diverse domains through world models,”CoRR, vol. abs/2301.04104, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 19

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Day- dreamer: World models for physical robot learning,

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- dreamer: World models for physical robot learning,” inCoRL, vol. 205. PMLR, 2022, pp. 2226–2240

work page 2022
[58]

Transformers are sample- efficient world models,

V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,” inICLR, 2023

work page 2023
[59]

Transformer- based world models are happy with 100k interactions,

J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,” inICLR, 2023

work page 2023
[60]

Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,

K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inICML, vol. 202. PMLR, 2023, pp. 26 311–26 325

work page 2023
[61]

No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,

Z. Song, Y . Zhang, and I. King, “No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,” inNeurIPS, 2023

work page 2023
[62]

Graph component contrastive learning for concept relatedness estimation,

Y . Ma, Z. Song, X. Hu, J. Li, Y . Zhang, and I. King, “Graph component contrastive learning for concept relatedness estimation,” in AAAI. AAAI Press, 2023, pp. 13 362–13 370

work page 2023
[63]

Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,

L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,” inNeurIPS, 2023

work page 2023
[64]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: empowering large language models with optimal planning proficiency,”CoRR, vol. abs/2304.11477, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Reasoning with language model is planning with world model,

S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in EMNLP, 2023, pp. 8154–8173

work page 2023
[66]

Tree-planner: Efficient close-loop task planning with large language models,

M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo, “Tree-planner: Efficient close-loop task planning with large language models,” inICLR, 2024

work page 2024
[67]

Large language models as common- sense knowledge for large-scale task planning,

Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as common- sense knowledge for large-scale task planning,” inNeurIPS, 2023

work page 2023
[68]

(2024) Video generation models as world simulators

OpenAI. (2024) Video generation models as world simulators. [Online]. Available: https://openai.com/index/ video-generation-models-as-world-simulators/

work page 2024
[69]

Is sora a world simulator? A comprehensive survey on general world models and beyond,

Z. Zhu, X. Wang, W. Zhao, C. Min, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is sora a world simulator? A comprehensive survey on general world models and beyond,”CoRR, vol. abs/2405.03520, 2024

work page arXiv 2024
[70]

Genie: Generative interactive environments,

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. M. P. Behbahani, S. C. Y . Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. E. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt ¨aschel, “Genie: Generative interactive environmen...

work page 2024
[71]

3d-vla: A 3d vision-language-action generative world model,

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” inICML, 2024

work page 2024
[72]

Learning interactive real-world simu- lators,

S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel, “Learning interactive real-world simu- lators,” inICLR, 2024

work page 2024
[73]

Language models meet world models: Embodied experiences enhance language models,

J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,” inNeurIPS, 2023

work page 2023
[74]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inNeurIPS, 2022

work page 2022
[75]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022

work page 2022
[76]

Thinkbot: Embodied instruction following with thought chain reasoning

G. Lu, Z. Wang, C. Liu, J. Lu, and Y . Tang, “Thinkbot: Embod- ied instruction following with thought chain reasoning,”CoRR, vol. abs/2312.07062, 2023

work page arXiv 2023
[77]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023

work page 2023
[78]

RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,

Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,”CoRR, vol. abs/2403.05313, 2024

work page arXiv 2024
[79]

Robotic Control via Embodied Chain-of-Thought Reasoning

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”CoRR, vol. abs/2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[80]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T. Lin, G. Wetzstein, M. Liu, and D. Xiang, “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inCVPR, 2025, pp. 1702–1713

work page 2025

Showing first 80 references.

[1] [1]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

A. Brohan, N. Brown, J. Carbajal, Y . Chebotar, X. Chen, K. Choro- manski, T. Ding, D. Driess, A. Dubey, C. Finn, P. Florence, C. Fu, M. G. Arenas, K. Gopalakrishnan, K. Han, K. Hausman, A. Herzog, J. Hsu, B. Ichter, A. Irpan, N. J. Joshi, R. Julian, D. Kalashnikov, Y . Kuang, I. Leal, L. Lee, T. E. Lee, S. Levine, Y . Lu, H. Michalewski, I. Mordatch, K. ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Imagenet classification with deep convolutional neural networks,

A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” inNIPS, 2012, pp. 1106– 1114

work page 2012

[3] [3]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in NIPS, 2017, pp. 5998–6008

work page 2017

[4] [4]

Human-level control through deep reinforcement learning,

V . Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. A. Riedmiller, A. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis, “Human-level control through deep reinforcement learning,”Nat., vol. 518, no. 7540, pp. 529–533, 2015

work page 2015

[5] [5]

Learning hand- eye coordination for robotic grasping with large-scale data collection,

S. Levine, P. Pastor, A. Krizhevsky, and D. Quillen, “Learning hand- eye coordination for robotic grasping with large-scale data collection,” inISER, ser. Springer Proceedings in Advanced Robotics, vol. 1. Springer, 2016, pp. 173–184

work page 2016

[6] [6]

Flamingo: a visual language model for few-shot learning,

J. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y . Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M. Monteiro, J. L. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira, O. Vinyals, A. Zisserman, and K. Simonyan, “Flamingo: a visual language mod...

work page 2022

[7] [7]

BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,

J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” inICML, vol. 202. PMLR, 2023, pp. 19 730– 19 742

work page 2023

[8] [8]

Visual Instruction Tuning

H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,”CoRR, vol. abs/2304.08485, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Inner mono- logue: Embodied reasoning through planning with language models,

W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, J. Tompson, I. Mordatch, Y . Chebotar, P. Sermanet, T. Jackson, N. Brown, L. Luu, S. Levine, K. Hausman, and B. Ichter, “Inner mono- logue: Embodied reasoning through planning with language models,” inCoRL, vol. 205. PMLR, 2022, pp. 1769–1782

work page 2022

[10] [10]

Do as I can, not as I say: Grounding language in robotic affordances,

B. Ichter, A. Brohan, Y . Chebotar, C. Finn, K. Hausman, A. Herzog, D. Ho, J. Ibarz, A. Irpan, E. Jang, R. Julian, D. Kalashnikov, S. Levine, Y . Lu, C. Parada, K. Rao, P. Sermanet, A. Toshev, V . Vanhoucke, F. Xia, T. Xiao, P. Xu, M. Yan, N. Brown, M. Ahn, O. Cortes, N. Sievers, C. Tan, S. Xu, D. Reyes, J. Rettinghouse, J. Quiambao, P. Pastor, L. Luu, K....

work page 2022

[11] [11]

Palm-e: An embodied multimodal language model,

D. Driess, F. Xia, M. S. M. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu, W. Huang, Y . Chebotar, P. Sermanet, D. Duckworth, S. Levine, V . Vanhoucke, K. Hausman, M. Toussaint, K. Greff, A. Zeng, I. Mordatch, and P. Florence, “Palm-e: An embodied multimodal language model,” inICML, vol. 202. PMLR, 2023, pp. 8469–8488

work page 2023

[12] [12]

Foundation models in robotics: Applications, challenges, and the future,

R. Firoozi, J. Tucker, S. Tian, A. Majumdar, J. Sun, W. Liu, Y . Zhu, S. Song, A. Kapoor, K. Hausman, B. Ichter, D. Driess, J. Wu, C. Lu, and M. Schwager, “Foundation models in robotics: Applications, challenges, and the future,”CoRR, vol. abs/2312.07843, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 18

work page arXiv 2023

[13] [13]

Large language models for robotics: Op- portunities, challenges, and perspectives,

J. Wang, Z. Wu, Y . Li, H. Jiang, P. Shu, E. Shi, H. Hu, C. Ma, Y . Liu, X. Wang, Y . Yao, X. Liu, H. Zhao, Z. Liu, H. Dai, L. Zhao, B. Ge, X. Li, T. Liu, and S. Zhang, “Large language models for robotics: Op- portunities, challenges, and perspectives,”CoRR, vol. abs/2401.04334, 2024

work page arXiv 2024

[14] [14]

Toward general-purpose robots via foundation models: A survey and meta-analysis,

Y . Hu, Q. Xie, V . Jain, J. Francis, J. Patrikar, N. V . Keetha, S. Kim, Y . Xie, T. Zhang, S. Zhao, Y . Q. Chong, C. Wang, K. P. Sycara, M. Johnson-Roberson, D. Batra, X. Wang, S. A. Scherer, Z. Kira, F. Xia, and Y . Bisk, “Toward general-purpose robots via foundation models: A survey and meta-analysis,”CoRR, vol. abs/2312.08782, 2023

work page arXiv 2023

[15] [15]

Real-world robot applications of foundation models: a review,

K. Kawaharazuka, T. Matsushima, A. Gambardella, J. Guo, C. Paxton, and A. Zeng, “Real-world robot applications of foundation models: a review,”Adv. Robotics, vol. 38, no. 18, pp. 1232–1254, 2024

work page 2024

[16] [16]

Decision transformer: Re- inforcement learning via sequence modeling,

L. Chen, K. Lu, A. Rajeswaran, K. Lee, A. Grover, M. Laskin, P. Abbeel, A. Srinivas, and I. Mordatch, “Decision transformer: Re- inforcement learning via sequence modeling,” inNeurIPS, 2021, pp. 15 084–15 097

work page 2021

[17] [17]

Offline reinforcement learning as one big sequence modeling problem,

M. Janner, Q. Li, and S. Levine, “Offline reinforcement learning as one big sequence modeling problem,” inNeurIPS, 2021, pp. 1273–1286

work page 2021

[18] [18]

A generalist agent,

S. E. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y . Sulsky, J. Kay, J. T. Springenberg, T. Eccles, J. Bruce, A. Razavi, A. Edwards, N. Heess, Y . Chen, R. Hadsell, O. Vinyals, M. Bordbar, and N. de Freitas, “A generalist agent,”Trans. Mach. Learn. Res., vol. 2022, 2022

work page 2022

[19] [19]

$\pi^{*}_{0.6}$: a VLA That Learns From Experience

A. A. Physical Intelligence and, R. Aniceto, A. Balakrishna, K. Black, K. Conley, G. Connors, J. Darpinian, K. Dhabalia, J. DiCarlo, D. Driess, M. Equi, A. Esmail, Y . Fang, C. Finn, C. Glossop, T. God- den, I. Goryachev, L. Groom, H. Hancock, K. Hausman, G. Hussein, B. Ichter, S. Jakubczak, R. Jen, T. Jones, B. Katz, L. Ke, C. Kuchi, M. Lamb, D. LeBlanc,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Primitive skill-based robot learning from human evaluative feedback,

A. Hiranaka, M. Hwang, S. Lee, C. Wang, L. Fei-Fei, J. Wu, and R. Zhang, “Primitive skill-based robot learning from human evaluative feedback,” inIROS, 2023, pp. 7817–7824

work page 2023

[21] [21]

Reflexion: language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: language agents with verbal reinforcement learning,” in NeurIPS, 2023

work page 2023

[22] [22]

Eureka: Human-level reward design via coding large language models,

Y . J. Ma, W. Liang, G. Wang, D. Huang, O. Bastani, D. Jayaraman, Y . Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-level reward design via coding large language models,” inICLR, 2024

work page 2024

[23] [23]

Learning transferable visual models from natural language supervi- sion,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language supervi- sion,” inICML, vol. 139. PMLR, 2021, pp. 8748–8763

work page 2021

[24] [24]

R3M: A universal visual representation for robot manipulation,

S. Nair, A. Rajeswaran, V . Kumar, C. Finn, and A. Gupta, “R3M: A universal visual representation for robot manipulation,” inCoRL, vol

work page

[25] [25]

PMLR, 2022, pp. 892–909

work page 2022

[26] [26]

VIP: towards universal visual reward and representation via value-implicit pre-training,

Y . J. Ma, S. Sodhani, D. Jayaraman, O. Bastani, V . Kumar, and A. Zhang, “VIP: towards universal visual reward and representation via value-implicit pre-training,” inICLR, 2023

work page 2023

[27] [27]

Real-world robot learning with masked visual pre-training,

I. Radosavovic, T. Xiao, S. James, P. Abbeel, J. Malik, and T. Darrell, “Real-world robot learning with masked visual pre-training,” inCoRL, vol. 205. PMLR, 2022, pp. 416–426

work page 2022

[28] [28]

BERT: pre-training of deep bidirectional transformers for language understanding,

J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in NAACL-HLT (1). Association for Computational Linguistics, 2019, pp. 4171–4186

work page 2019

[29] [29]

Robot learning with sensorimotor pre-training,

I. Radosavovic, B. Shi, L. Fu, K. Goldberg, T. Darrell, and J. Malik, “Robot learning with sensorimotor pre-training,” inCoRL, vol. 229. PMLR, 2023, pp. 683–693

work page 2023

[30] [30]

Cliport: What and where pathways for robotic manipulation,

M. Shridhar, L. Manuelli, and D. Fox, “Cliport: What and where pathways for robotic manipulation,” inCoRL, vol. 164. PMLR, 2021, pp. 894–906

work page 2021

[31] [31]

Simple but effective: CLIP embeddings for embodied AI,

A. Khandelwal, L. Weihs, R. Mottaghi, and A. Kembhavi, “Simple but effective: CLIP embeddings for embodied AI,” inCVPR. IEEE, 2022, pp. 14 809–14 818

work page 2022

[32] [32]

Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,

S. Y . Gadre, M. Wortsman, G. Ilharco, L. Schmidt, and S. Song, “Cows on pasture: Baselines and benchmarks for language-driven zero-shot object navigation,” inCVPR. IEEE, 2023, pp. 23 171–23 181

work page 2023

[33] [33]

Where are we in the search for an artificial visual cortex for embodied intelligence?

A. Majumdar, K. Yadav, S. Arnaud, Y . J. Ma, C. Chen, S. Silwal, A. Jain, V . Berges, T. Wu, J. Vakil, P. Abbeel, J. Malik, D. Batra, Y . Lin, O. Maksymets, A. Rajeswaran, and F. Meier, “Where are we in the search for an artificial visual cortex for embodied intelligence?” inNeurIPS, 2023

work page 2023

[34] [34]

Language-driven representation learning for robotics,

S. Karamcheti, S. Nair, A. S. Chen, T. Kollar, C. Finn, D. Sadigh, and P. Liang, “Language-driven representation learning for robotics,” inRSS, 2023

work page 2023

[35] [35]

Dinov2: Learning robust visual features without supervision,

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khali- dov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “Dinov2: Learning robust visual features without sup...

work page 2024

[36] [36]

OpenVLA: An Open-Source Vision-Language-Action Model

M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. P. Foster, G. Lam, P. Sanketi, Q. Vuong, T. Kollar, B. Burchfiel, R. Tedrake, D. Sadigh, S. Levine, P. Liang, and C. Finn, “Openvla: An open-source vision-language-action model,”CoRR, vol. abs/2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

ReKep: Spatio-Temporal Reasoning of Relational Keypoint Constraints for Robotic Manipulation

W. Huang, C. Wang, Y . Li, R. Zhang, and L. Fei-Fei, “Rekep: Spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,”CoRR, vol. abs/2409.01652, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. G. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” inCVPR. IEEE, 2023, pp. 15 619–15 629

work page 2023

[39] [39]

Theia: Distilling diverse vision foundation models for robot learning,

J. Shang, K. Schmeckpeper, B. B. May, M. V . Minniti, T. Kelestemur, D. Watkins, and L. Herlant, “Theia: Distilling diverse vision foundation models for robot learning,” inCoRL, ser. Proceedings of Machine Learning Research, vol. 270. PMLR, 2024, pp. 724–748

work page 2024

[40] [40]

The unsurprising effectiveness of pre-trained vision models for control,

S. Parisi, A. Rajeswaran, S. Purushwalkam, and A. Gupta, “The unsurprising effectiveness of pre-trained vision models for control,” inICML, vol. 162. PMLR, 2022, pp. 17 359–17 371

work page 2022

[41] [41]

A path towards autonomous machine intelligence,

Y . LeCun, “A path towards autonomous machine intelligence,” 2022. [Online]. Available: https://openreview.net/pdf?id=BZ5a1r-kVsf

work page 2022

[42] [42]

Dis- tilled feature fields enable few-shot language-guided manipulation,

W. Shen, G. Yang, A. Yu, J. Wong, L. P. Kaelbling, and P. Isola, “Dis- tilled feature fields enable few-shot language-guided manipulation,” in CoRL, vol. 229. PMLR, 2023, pp. 405–424

work page 2023

[43] [43]

3d- llm: Injecting the 3d world into large language models,

Y . Hong, H. Zhen, P. Chen, S. Zheng, Y . Du, Z. Chen, and C. Gan, “3d- llm: Injecting the 3d world into large language models,” inNeurIPS, 2023

work page 2023

[44] [44]

3d gaussian splatting for real-time radiance field rendering,

B. Kerbl, G. Kopanas, T. Leimk ¨uhler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,”ACM Trans. Graph., vol. 42, no. 4, pp. 139:1–139:14, 2023

work page 2023

[45] [45]

Langsplat: 3d language gaussian splatting,

M. Qin, W. Li, J. Zhou, H. Wang, and H. Pfister, “Langsplat: 3d language gaussian splatting,” inCVPR. IEEE, 2024, pp. 20 051– 20 060

work page 2024

[46] [46]

That sounds right: Auditory self- supervision for dynamic robot manipulation,

A. Thankaraj and L. Pinto, “That sounds right: Auditory self- supervision for dynamic robot manipulation,” inCoRL, vol. 229. PMLR, 2023, pp. 1036–1049

work page 2023

[47] [47]

Exploring visual pre-training for robot manipulation: Datasets, models and methods,

Y . Jing, X. Zhu, X. Liu, Q. Sima, T. Yang, Y . Feng, and T. Kong, “Exploring visual pre-training for robot manipulation: Datasets, models and methods,” inIROS, 2023, pp. 11 390–11 395

work page 2023

[48] [48]

Masked autoencoding for scalable and generalizable decision making,

F. Liu, H. Liu, A. Grover, and P. Abbeel, “Masked autoencoding for scalable and generalizable decision making,” inNeurIPS, 2022

work page 2022

[49] [49]

Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,

J. Li, Q. Gao, M. Johnston, X. Gao, X. He, H. Shi, S. Shakiah, R. Ghanadan, and W. Y . Wang, “Mastering robot manipulation with multimodal prompts through pretraining and multi-task fine-tuning,” in ICML, 2024

work page 2024

[50] [50]

SMART: self-supervised multi-task pretraining with control trans- formers,

Y . Sun, S. Ma, R. Madaan, R. Bonatti, F. Huang, and A. Kapoor, “SMART: self-supervised multi-task pretraining with control trans- formers,” inICLR, 2023

work page 2023

[51] [51]

PACT: perception-action causal transformer for autore- gressive robotics pre-training,

R. Bonatti, S. Vemprala, S. Ma, F. Frujeri, S. Chen, and A. Kapoor, “PACT: perception-action causal transformer for autore- gressive robotics pre-training,” inIROS, 2023, pp. 3621–3627

work page 2023

[52] [52]

Video pretraining (VPT): learning to act by watching unlabeled online videos,

B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune, “Video pretraining (VPT): learning to act by watching unlabeled online videos,” inNeurIPS, 2022

work page 2022

[53] [53]

Unleashing large-scale video generative pre-training for visual robot manipulation,

H. Wu, Y . Jing, C. Cheang, G. Chen, J. Xu, X. Li, M. Liu, H. Li, and T. Kong, “Unleashing large-scale video generative pre-training for visual robot manipulation,” inICLR, 2024

work page 2024

[54] [54]

Dream to control: Learning behaviors by latent imagination,

D. Hafner, T. P. Lillicrap, J. Ba, and M. Norouzi, “Dream to control: Learning behaviors by latent imagination,” inICLR, 2020

work page 2020

[55] [55]

Mastering atari with discrete world models,

D. Hafner, T. P. Lillicrap, M. Norouzi, and J. Ba, “Mastering atari with discrete world models,” inICLR, 2021

work page 2021

[56] [56]

Mastering Diverse Domains through World Models

D. Hafner, J. Pasukonis, J. Ba, and T. P. Lillicrap, “Mastering diverse domains through world models,”CoRR, vol. abs/2301.04104, 2023. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 19

work page internal anchor Pith review Pith/arXiv arXiv 2023

[57] [57]

Day- dreamer: World models for physical robot learning,

P. Wu, A. Escontrela, D. Hafner, P. Abbeel, and K. Goldberg, “Day- dreamer: World models for physical robot learning,” inCoRL, vol. 205. PMLR, 2022, pp. 2226–2240

work page 2022

[58] [58]

Transformers are sample- efficient world models,

V . Micheli, E. Alonso, and F. Fleuret, “Transformers are sample- efficient world models,” inICLR, 2023

work page 2023

[59] [59]

Transformer- based world models are happy with 100k interactions,

J. Robine, M. H ¨oftmann, T. Uelwer, and S. Harmeling, “Transformer- based world models are happy with 100k interactions,” inICLR, 2023

work page 2023

[60] [60]

Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,

K. Nottingham, P. Ammanabrolu, A. Suhr, Y . Choi, H. Hajishirzi, S. Singh, and R. Fox, “Do embodied agents dream of pixelated sheep: Embodied decision making using language guided world modelling,” inICML, vol. 202. PMLR, 2023, pp. 26 311–26 325

work page 2023

[61] [61]

No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,

Z. Song, Y . Zhang, and I. King, “No change, no gain: Empowering graph neural networks with expected model change maximization for active learning,” inNeurIPS, 2023

work page 2023

[62] [62]

Graph component contrastive learning for concept relatedness estimation,

Y . Ma, Z. Song, X. Hu, J. Li, Y . Zhang, and I. King, “Graph component contrastive learning for concept relatedness estimation,” in AAAI. AAAI Press, 2023, pp. 13 362–13 370

work page 2023

[63] [63]

Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,

L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Lever- aging pre-trained large language models to construct and utilize world models for model-based task planning,” inNeurIPS, 2023

work page 2023

[64] [64]

LLM+P: Empowering Large Language Models with Optimal Planning Proficiency

B. Liu, Y . Jiang, X. Zhang, Q. Liu, S. Zhang, J. Biswas, and P. Stone, “LLM+P: empowering large language models with optimal planning proficiency,”CoRR, vol. abs/2304.11477, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[65] [65]

Reasoning with language model is planning with world model,

S. Hao, Y . Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu, “Reasoning with language model is planning with world model,” in EMNLP, 2023, pp. 8154–8173

work page 2023

[66] [66]

Tree-planner: Efficient close-loop task planning with large language models,

M. Hu, Y . Mu, X. Yu, M. Ding, S. Wu, W. Shao, Q. Chen, B. Wang, Y . Qiao, and P. Luo, “Tree-planner: Efficient close-loop task planning with large language models,” inICLR, 2024

work page 2024

[67] [67]

Large language models as common- sense knowledge for large-scale task planning,

Z. Zhao, W. S. Lee, and D. Hsu, “Large language models as common- sense knowledge for large-scale task planning,” inNeurIPS, 2023

work page 2023

[68] [68]

(2024) Video generation models as world simulators

OpenAI. (2024) Video generation models as world simulators. [Online]. Available: https://openai.com/index/ video-generation-models-as-world-simulators/

work page 2024

[69] [69]

Is sora a world simulator? A comprehensive survey on general world models and beyond,

Z. Zhu, X. Wang, W. Zhao, C. Min, N. Deng, M. Dou, Y . Wang, B. Shi, K. Wang, C. Zhang, Y . You, Z. Zhang, D. Zhao, L. Xiao, J. Zhao, J. Lu, and G. Huang, “Is sora a world simulator? A comprehensive survey on general world models and beyond,”CoRR, vol. abs/2405.03520, 2024

work page arXiv 2024

[70] [70]

Genie: Generative interactive environments,

J. Bruce, M. D. Dennis, A. Edwards, J. Parker-Holder, Y . Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. M. P. Behbahani, S. C. Y . Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. E. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt ¨aschel, “Genie: Generative interactive environmen...

work page 2024

[71] [71]

3d-vla: A 3d vision-language-action generative world model,

H. Zhen, X. Qiu, P. Chen, J. Yang, X. Yan, Y . Du, Y . Hong, and C. Gan, “3d-vla: A 3d vision-language-action generative world model,” inICML, 2024

work page 2024

[72] [72]

Learning interactive real-world simu- lators,

S. Yang, Y . Du, S. K. S. Ghasemipour, J. Tompson, L. P. Kaelbling, D. Schuurmans, and P. Abbeel, “Learning interactive real-world simu- lators,” inICLR, 2024

work page 2024

[73] [73]

Language models meet world models: Embodied experiences enhance language models,

J. Xiang, T. Tao, Y . Gu, T. Shu, Z. Wang, Z. Yang, and Z. Hu, “Language models meet world models: Embodied experiences enhance language models,” inNeurIPS, 2023

work page 2023

[74] [74]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inNeurIPS, 2022

work page 2022

[75] [75]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inNeurIPS, 2022

work page 2022

[76] [76]

Thinkbot: Embodied instruction following with thought chain reasoning

G. Lu, Z. Wang, C. Liu, J. Lu, and Y . Tang, “Thinkbot: Embod- ied instruction following with thought chain reasoning,”CoRR, vol. abs/2312.07062, 2023

work page arXiv 2023

[77] [77]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” inICLR, 2023

work page 2023

[78] [78]

RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,

Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y . Liang, “RAT: retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,”CoRR, vol. abs/2403.05313, 2024

work page arXiv 2024

[79] [79]

Robotic Control via Embodied Chain-of-Thought Reasoning

M. Zawalski, W. Chen, K. Pertsch, O. Mees, C. Finn, and S. Levine, “Robotic control via embodied chain-of-thought reasoning,”CoRR, vol. abs/2407.08693, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[80] [80]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,

Q. Zhao, Y . Lu, M. J. Kim, Z. Fu, Z. Zhang, Y . Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T. Lin, G. Wetzstein, M. Liu, and D. Xiang, “Cot-vla: Visual chain-of-thought reasoning for vision-language-action models,” inCVPR, 2025, pp. 1702–1713

work page 2025