pith. machine review for the scientific record. sign in

arxiv: 2508.13073 · v2 · pith:KJDRCWTRnew · submitted 2025-08-18 · 💻 cs.RO · cs.CV

Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey

Pith reviewed 2026-05-17 20:24 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords Vision-Language-Action modelsrobotic manipulationlarge vision-language modelstaxonomysurveyembodied AIVLA modelsmultimodal models
0
0 comments X

The pith

Large VLM-based VLA models for robotic manipulation can be systematically classified into monolithic and hierarchical architectures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey establishes a taxonomy for large vision-language-action models used in robotic manipulation by dividing them into monolithic designs, which integrate components at different levels, and hierarchical designs that separate planning from execution. A sympathetic reader would care because these models promise to enable robots to handle novel environments where rule-based methods fail, and a clear classification helps make sense of the rapidly growing field. The review examines how these models integrate with reinforcement learning, human video learning, and world models, while highlighting datasets, benchmarks, and future directions like memory mechanisms and multi-agent systems. It aims to resolve inconsistencies in prior taxonomies and provide a unified view to reduce research fragmentation.

Core claim

The paper claims that large VLM-based VLA models for robotic manipulation are best understood through two principal architectural paradigms: monolithic models that encompass single-system and dual-system designs with differing levels of integration, and hierarchical models that explicitly decouple planning from execution using interpretable intermediate representations. This structure allows for an in-depth examination of integrations with advanced domains and the synthesis of characteristics across models.

What carries the argument

The taxonomy of monolithic (single-system and dual-system) versus hierarchical models, which organizes the field by levels of integration and separation of planning and execution.

If this is right

  • Models can be more easily compared and developed based on their architectural type.
  • Integration with reinforcement learning and world models becomes a key area for advancing capabilities.
  • Future work will focus on memory mechanisms, 4D perception, and efficient adaptation.
  • Research fragmentation decreases as inconsistencies in taxonomies are resolved.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Applying this taxonomy to emerging models could reveal new hybrid architectures not yet considered.
  • This classification might help in designing benchmarks that test specific aspects of monolithic versus hierarchical approaches.
  • Connections to broader embodied AI could lead to standardized evaluation protocols across related fields.

Load-bearing premise

The proposed taxonomy into monolithic and hierarchical models, along with the listed integration domains, fully captures the current state of the field without major omissions or problematic overlaps.

What would settle it

The discovery of a significant number of VLA models that cannot be classified into either the monolithic or hierarchical categories, or that show substantial overlap between categories, would challenge the taxonomy's utility.

read the original abstract

Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. This survey claims to deliver the first systematic, taxonomy-oriented review of large VLM-based Vision-Language-Action (VLA) models for robotic manipulation. It defines such models, delineates two principal paradigms—monolithic models (single-system and dual-system) versus hierarchical models that explicitly decouple planning from execution via intermediate representations—and examines their integration with reinforcement learning, training-free optimization, human videos, and world models. The work further synthesizes architectural traits, strengths, datasets, and benchmarks, identifies future directions including memory mechanisms, 4D perception, efficient adaptation, and multi-agent cooperation, and positions itself as resolving inconsistencies in prior taxonomies while filling a critical gap, with an accompanying regularly updated GitHub project page.

Significance. If the taxonomy is shown to be both exhaustive and non-overlapping, the survey would constitute a timely consolidation of a fast-moving intersection between large VLMs and robotic manipulation. It would help mitigate fragmentation by offering a structured lens on architectural choices and integration strategies, and the maintained project page is a concrete strength that supports ongoing utility for the community.

major comments (2)
  1. [Abstract / Taxonomy definition] Abstract and opening taxonomy delineation: the central claim that the monolithic (single/dual-system) versus hierarchical split resolves inconsistencies in existing taxonomies rests on the assertion that these categories are exhaustive and non-overlapping, yet no enumeration of reviewed models, no count of papers per category, and no explicit discussion of boundary cases (e.g., end-to-end models that still emit intermediate 4D or memory tokens) is supplied. Without this, the utility of the taxonomy cannot be evaluated.
  2. [Integration with advanced domains] Integration domains section: the four listed domains (RL, training-free optimization, human videos, world models) are presented as key integration areas, but the manuscript supplies no explicit justification or coverage check showing that these domains capture the dominant variants without significant omissions or forced overlaps that would undermine the taxonomy's claimed resolution of fragmentation.
minor comments (2)
  1. [Abstract] The abstract states that the survey 'consolidates recent advances' but does not indicate the time window or search methodology used; adding a brief methods paragraph would improve transparency for a systematic review.
  2. [Synthesis of distinctive characteristics] Figure or table captions that map specific models to the proposed taxonomy categories would aid readability; currently the synthesis of characteristics appears to rely on prose alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify how to better demonstrate the taxonomy's utility. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Taxonomy definition] Abstract and opening taxonomy delineation: the central claim that the monolithic (single/dual-system) versus hierarchical split resolves inconsistencies in existing taxonomies rests on the assertion that these categories are exhaustive and non-overlapping, yet no enumeration of reviewed models, no count of papers per category, and no explicit discussion of boundary cases (e.g., end-to-end models that still emit intermediate 4D or memory tokens) is supplied. Without this, the utility of the taxonomy cannot be evaluated.

    Authors: We agree that an explicit enumeration and counts would make the taxonomy's scope and non-overlapping nature more transparent. In the revision we will add a summary table listing representative models under each subcategory (monolithic single-system, monolithic dual-system, and hierarchical), with approximate paper counts drawn from the surveyed literature. We will also add a dedicated paragraph discussing boundary cases, including models that emit intermediate 4D or memory tokens yet remain architecturally monolithic, to clarify distinctions and any residual overlaps. revision: yes

  2. Referee: [Integration with advanced domains] Integration domains section: the four listed domains (RL, training-free optimization, human videos, world models) are presented as key integration areas, but the manuscript supplies no explicit justification or coverage check showing that these domains capture the dominant variants without significant omissions or forced overlaps that would undermine the taxonomy's claimed resolution of fragmentation.

    Authors: The four domains were chosen because they correspond to the most frequently explored integration strategies in the current VLA literature. To make this explicit, the revised section will include a short justification paragraph, a coverage summary indicating the proportion of surveyed works falling into each domain, and a brief note on potential overlaps (e.g., RL combined with world models) as well as emerging areas such as multi-agent cooperation that are already flagged in the future-directions section. revision: yes

Circularity Check

0 steps flagged

Survey taxonomy organizes external literature without self-referential derivation or fitted predictions

full rationale

This is a review paper whose central contribution is a proposed taxonomy (monolithic single/dual-system vs. hierarchical models) and synthesis of integration domains drawn from the existing literature. No new quantitative results, equations, or parameters are derived from the authors' own fitted values or self-citations. The abstract explicitly frames the work as consolidating external advances to resolve inconsistencies, which is standard survey practice and does not reduce any claim to an input by construction. No load-bearing self-citation chains or ansatzes are present in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the central claim rests on the assumption that prior literature can be partitioned into the stated paradigms and that the selected integrations represent the main research threads; no new free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Large VLM-based VLA models are those built upon pretrained large vision-language models for robotic manipulation tasks.
    This scoping definition determines which papers are included in the review.

pith-pipeline@v0.9.0 · 5622 in / 1236 out tokens · 47375 ms · 2026-05-17T20:24:02.001566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Dynamic Execution Commitment of Vision-Language-Action Models

    cs.CV 2026-05 unverdicted novelty 7.0

    A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.

  2. CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies

    cs.CV 2026-04 unverdicted novelty 7.0

    CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.

  3. GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning

    cs.RO 2026-04 unverdicted novelty 7.0

    GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.

  4. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  5. ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation

    cs.RO 2026-05 unverdicted novelty 6.0

    ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.

  6. Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing

    cs.RO 2026-05 unverdicted novelty 6.0

    ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.

  7. FASTER: Rethinking Real-Time Flow VLAs

    cs.RO 2026-03 conditional novelty 6.0

    FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.

  8. Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions

    cs.CV 2025-11 conditional novelty 6.0

    VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.

  9. A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory

    cs.RO 2026-05 unverdicted novelty 5.0

    The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000...

  10. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

  11. HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

    cs.CV 2026-04 unverdicted novelty 5.0

    HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.

  12. EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development

    cs.RO 2026-04 unverdicted novelty 5.0

    EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.

  13. Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization

    cs.RO 2026-04 unverdicted novelty 5.0

    EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.

  14. CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment

    cs.RO 2026-04 unverdicted novelty 5.0

    CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...

  15. From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data

    cs.RO 2026-04 accept novelty 5.0

    A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.

  16. RESample: A Robust Data Augmentation Framework via Exploratory Sampling for Robotic Manipulation

    cs.RO 2025-10 unverdicted novelty 5.0

    RESample uses exploratory sampling guided by a lightweight Coverage Function to expand VLA training data coverage, yielding 12% performance gains on LIBERO and real-world tasks with 10-20% added samples.

  17. XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments

    cs.CV 2026-04 unverdicted novelty 4.0

    XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...

  18. TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training

    cs.AI 2026-04 unverdicted novelty 4.0

    TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.

  19. Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines

    cs.RO 2026-04 unverdicted novelty 3.0

    A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...

Reference graph

Works this paper leans on

243 extracted references · 243 canonical work pages · cited by 18 Pith papers · 28 internal anchors

  1. [1]

    A survey on robotics with foundation models: toward embodied ai,

    Z. Xu, K. Wu, J. Wen, J. Li, N. Liu, Z. Che, and J. Tang, “A survey on robotics with foundation models: toward embodied ai,” arXiv:2402.02385, 2024

  2. [2]

    A Survey on Vision-Language-Action Models for Embodied AI

    Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,” arXiv:2405.14093, 2024

  3. [3]

    Metaurban: a simulation platform for embodied ai in urban spaces,

    W. Wu, H. He, Y. Wang, C. Duan, J. He, Z. Liu, Q. Li, and B. Zhou, “Metaurban: a simulation platform for embodied ai in urban spaces,” in ICLR, 2025

  4. [4]

    Generative artificial intelligence in robotic manipulation: a survey,

    K. Zhang, P . Yun, J. Cen, J. Cai, D. Zhu, H. Yuan, C. Zhao, T. Feng, M. Y. Wang, Q. Chen et al. , “Generative artificial intelligence in robotic manipulation: a survey,” arXiv:2503.03464, 2025

  5. [5]

    Aligning cyber space with physical world: A comprehensive survey on embodied ai,

    Y. Liu, W. Chen, Y. Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: a comprehensive survey on embodied ai,” arXiv:2407.06886, 2024

  6. [6]

    A survey of embodied ai in healthcare: techniques, applications, and opportunities,

    Y. Liu, X. Cao, T. Chen, Y. Jiang, J. You, M. Wu, X. Wang, M. Feng, Y. Jin, and J. Chen, “A survey of embodied ai in healthcare: techniques, applications, and opportunities,” INFORM FUSION, vol. 119, p. 103033, 2025

  7. [7]

    Machine learning meets advanced robotic manipu- lation,

    S. Nahavandi, R. Alizadehsani, D. Nahavandi, C. P . Lim, K. Kelly, and F. Bello, “Machine learning meets advanced robotic manipu- lation,” INFORM FUSION, vol. 105, p. 102221, 2024

  8. [8]

    Vision-language-action models: Concepts, progress, applications and challenges

    R. Sapkota, Y. Cao, K. I. Roumeliotis, and M. Karkee, “Vision- language-action models: concepts, progress, applications and challenges,” arXiv:2505.04769, 2025

  9. [9]

    Trends and challenges in robot manip- ulation,

    A. Billard and D. Kragic, “Trends and challenges in robot manip- ulation,” Science, vol. 364, p. eaat8414, 2019

  10. [10]

    Any-point trajectory modeling for policy learning,

    C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P . Abbeel, “Any-point trajectory modeling for policy learning,” in RSS, 2024, p. 92

  11. [11]

    Instruction-driven history-aware policies for robotic manipulations,

    P .-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid, “Instruction-driven history-aware policies for robotic manipulations,” in CoRL, 2023, pp. 175–187

  12. [12]

    Flow as the cross-domain manipulation interface,

    M. Xu, Z. Xu, Y. Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” in CoRL, 2024

  13. [13]

    Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,

    J. Gu, S. Kirmani, P . Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu et al., “Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,” in ICLR, 2024, pp. 2475–2499

  14. [14]

    Polytouch: A robust multi-modal tactile sensor for contact-rich manipulation using tactile-diffusion policies,

    J. Zhao, N. Kuppuswamy, S. Feng, B. Burchfiel, and E. Adelson, “Polytouch: A robust multi-modal tactile sensor for contact-rich manipulation using tactile-diffusion policies,” in ICRA, 2025

  15. [15]

    No plan but everything under control: Robustly solving sequential tasks with dynamically composed gradient descent,

    V . Mengers and O. Brock, “No plan but everything under control: Robustly solving sequential tasks with dynamically composed gradient descent,” in ICRA, 2025

  16. [16]

    Star: Learning diverse robot skill abstractions through rotation- augmented vector quantization,

    H. Li, Q. Lv, R. Shao, X. Deng, Y. Li, J. Hao, and L. Nie, “Star: Learning diverse robot skill abstractions through rotation- augmented vector quantization,” in ICML, 2025

  17. [17]

    Lion: empow- ering multimodal large language model with dual-level visual knowledge,

    G. Chen, L. Shen, R. Shao, X. Deng, and L. Nie, “Lion: empow- ering multimodal large language model with dual-level visual knowledge,” in CVPR, 2024, pp. 26540–26550

  18. [18]

    Improved baselines with visual instruction tuning,

    H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” in CVPR, 2024, pp. 26296–26306

  19. [19]

    Instructblip: towards general-purpose vision-language models with instruction tuning,

    W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, and S. Hoi, “Instructblip: towards general-purpose vision-language models with instruction tuning,” in NeurIPS, 2023, pp. 49250–49267

  20. [20]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and be- yond,” arXiv:2308.12966, 2023

  21. [21]

    Visual instruction tuning,

    H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, 2023, pp. 34892–34916

  22. [22]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al. , “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,” arXiv:2412.05271, 2024

  23. [23]

    Monkey: image resolution and text label are important things for large multi-modal models,

    Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: image resolution and text label are important things for large multi-modal models,” in CVPR, 2024, pp. 26763– 26773

  24. [24]

    Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers,

    R. Zhang, R. Shao, G. Chen, M. Zhang, K. Zhou, W. Guan, and L. Nie, “Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers,” in ICCV, 2025

  25. [25]

    Mome: Mixture of multimodal experts for generalist multimodal large language models,

    L. Shen, G. Chen, R. Shao, W. Guan, and L. Nie, “Mome: Mixture of multimodal experts for generalist multimodal large language models,” in NeurIPS, 2024

  26. [26]

    Openvla: an open-source vision-language-action model,

    M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P . Sanketiet al., “Openvla: an open-source vision-language-action model,” in CoRL, 2024, pp. 2679–2713

  27. [27]

    Rt-2: vision-language-action models transfer web knowledge to robotic control,

    B. Zitkovich, T. Yu, S. Xu, P . Xu, T. Xiao, F. Xia, others, and K. Han, “Rt-2: vision-language-action models transfer web knowledge to robotic control,” in CoRL, 2023, pp. 2165–2183

  28. [28]

    Rt-h: action hierarchies using language,

    S. Belkhale, T. Ding, T. Xiao, P . Sermanet, Q. Vuong, J. Tompson, Y. Chebotar, D. Dwibedi, and D. Sadigh, “Rt-h: action hierarchies using language,” in RSS, 2024

  29. [29]

    π0: A vision- language-action flow model for general robot control,

    K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter et al. , “ π0: A vision- language-action flow model for general robot control,” in RSS, 2025

  30. [31]

    SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

    M. Shukor, D. Aubakirova, F. Capuano, P . Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti et al. , “Smolvla: a vision-language-action model for affordable and efficient robotics,” arXiv:2506.01844, 2025

  31. [32]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang et al., “Gr00t n1: an open foun- dation model for generalist humanoid robots,” arXiv:2503.14734, 2025

  32. [33]

    Cot-vla: visual chain-of-thought reasoning for vision- language-action models,

    Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T.-Y. Lin, G. Wetzstein, M.-Y. Liu, and D. Xiang, “Cot-vla: visual chain-of-thought reasoning for vision- language-action models,” in CVPR, 2025, pp. 1702–1713

  33. [34]

    HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

    J. Liu, H. Chen, P . An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu et al., “Hybridvla: collaborative diffusion and autoregression in a unified vision-language-action model,” arXiv:2503.10631, 2025

  34. [35]

    BridgeVLA: Input-output alignment for efficient 3d manipu- lation learning with vision-language models.arXiv preprint arXiv:2506.07961, 2025

    P . Li, Y. Chen, H. Wu, X. Ma, X. Wu, Y. Huang, L. Wang, T. Kong, and T. Tan, “Bridgevla: input-output alignment for efficient 3d manipulation learning with vision-language models,” arXiv:2506.07961, 2025

  35. [36]

    Deer-vla: dynamic inference of multimodal large language models for efficient robot execution,

    Y. Yue, Y. Wang, B. Kang, Y. Han, S. Wang, S. Song, J. Feng, and G. Huang, “Deer-vla: dynamic inference of multimodal large language models for efficient robot execution,” in NeurIPS, 2024, pp. 56619–56643

  36. [37]

    Knowledge insulating vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

    J. T. S. Danny Driess, L. Y. Brian Ichter, K. P . Adrian Li-Bell, H. W. Allen Z. Ren, L. X. S. Quan Vuong, and S. Levine, “Knowledge insulating vision-language-action models: train fast, run fast, generalize better,” arXiv:2505.23705, 2025

  37. [38]

    WorldVLA: Towards Autoregressive Action World Model

    J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang et al., “Worldvla: towards autoregressive action world model,” arXiv:2506.21539, 2025

  38. [39]

    Rewind: language-guided rewards teach JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 robot policies without new demonstrations,

    J. Zhang, Y. Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “Rewind: language-guided rewards teach JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 robot policies without new demonstrations,” arXiv:2505.10911, 2025

  39. [40]

    Fast-in-slow: a dual-system founda- tion model unifying fast manipulation within slow reasoning,

    H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y. Guo, C.-W. Fu, S. Zhang et al. , “Fast-in-slow: a dual-system founda- tion model unifying fast manipulation within slow reasoning,” arXiv:2506.01953, 2025

  40. [41]

    Real-Time Execution of Action Chunking Flow Policies

    K. Black, M. Y. Galliker, and S. Levine, “Real-time execution of action chunking flow policies,” arXiv:2506.07339, 2025

  41. [42]

    Mitigat- ing the human-robot domain discrepancy in visual pre-training for robotic manipulation,

    J. Zhou, T. Ma, K. Y. Lin, Z. Wang, R. Qiu, and J. Liang, “Mitigat- ing the human-robot domain discrepancy in visual pre-training for robotic manipulation,” in CVPR, 2025, pp. 22551–22561

  42. [43]

    World4omni: a zero-shot framework from image gen- eration world model to robotic manipulation,

    H. Chen, B. Wang, J. Guo, T. Zhang, Y. Hou, X. Huang, C. Tie, and L. Shao, “World4omni: a zero-shot framework from image gen- eration world model to robotic manipulation,” arXiv:2506.23919, 2025

  43. [44]

    Fine-tuning vision-language- action models: optimizing speed and success,

    M. J. Kim, C. Finn, and P . Liang, “Fine-tuning vision-language- action models: optimizing speed and success,” in RSS, 2025

  44. [45]

    NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks

    C.-Y. Hung, Q. Sun, P . Hong, A. Zadeh, C. Li, U. Tan, N. Majumder, S. Poria et al. , “Nora: a small open-sourced generalist vision language action model for embodied tasks,” arXiv:2504.19854, 2025

  45. [46]

    Maniplvm-r1: reinforcement learning for reasoning in embodied manipulation with large vision-language models,

    Z. Song, G. Ouyang, M. Li, Y. Ji, C. Wang, Z. Xu, Z. Zhang, X. Zhang, Q. Jiang, Z. Chen et al. , “Maniplvm-r1: reinforcement learning for reasoning in embodied manipulation with large vision-language models,” arXiv:2505.16517, 2025

  46. [47]

    Embodied-reasoner: synergizing visual search, reasoning, and action for embodied interactive tasks,

    W. Zhang, M. Wang, G. Liu, X. Huixin, Y. Jiang, Y. Shen, G. Hou, Z. Zheng, H. Zhang, X. Li et al., “Embodied-reasoner: synergizing visual search, reasoning, and action for embodied interactive tasks,” arXiv:2503.21696, 2025

  47. [48]

    Hamster: hierarchical action models for open-world robot manipulation,

    Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li et al. , “Hamster: hierarchical action models for open-world robot manipulation,” in ICLR, 2025

  48. [49]

    A0: an affordance-aware hierarchical model for general robotic manipulation,

    R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang et al. , “A0: an affordance-aware hierarchical model for general robotic manipulation,” arXiv:2504.12636, 2025

  49. [50]

    Rekep: spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,

    W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei, “Rekep: spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,” in CoRL, 2024

  50. [51]

    A survey of vision-language pre-trained models,

    Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” in IJCAI, 2022

  51. [52]

    Vision-language models for vision tasks: A survey,

    J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” TP AMI, vol. 46, pp. 5625–5644, 2024

  52. [53]

    Multimodal large language models: A survey,

    J. Wu, W. Gan, Z. Chen, S. Wan, and P . S. Yu, “Multimodal large language models: A survey,” in BigData, 2023, pp. 2247–2256

  53. [54]

    A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,

    Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,” arXiv:2501.02189, 2025

  54. [55]

    Large vision-language model alignment and misalignment: A survey through the lens of explainability,

    D. Shu, H. Zhao, J. Hu, W. Liu, A. Payani, L. Cheng, and M. Du, “Large vision-language model alignment and misalignment: A survey through the lens of explainability,”arXiv:2501.01346, 2025

  55. [56]

    A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions,

    M. Song, X. Deng, Z. Zhou, J. Wei, W. Guan, and L. Nie, “A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions,” Authorea Preprints, 2025

  56. [57]

    Diffusion models for robotic manipulation: A survey,

    R. Wolf, Y. Shi, S. Liu, and R. Rayyes, “Diffusion models for robotic manipulation: A survey,” arXiv:2504.08438, 2025

  57. [58]

    Openhelix: A short survey, empirical analysis, and open-source dual- system vla model for robotic manipulation,

    C. Cui, P . Ding, W. Song, S. Bai, X. Tong, Z. Ge, R. Suo, W. Zhou, Y. Liu, B. Jia et al., “Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipula- tion,” arXiv:2505.03912, 2025

  58. [59]

    A survey of embodied learning for object-centric robotic manipulation,

    Y. Zheng, L. Yao, Y. Su, Y. Zhang, Y. Wang, S. Zhao, Y. Zhang, and L.-P . Chau, “A survey of embodied learning for object-centric robotic manipulation,” MIR, vol. 22, pp. 588–626, 2025

  59. [60]

    Lion-fs: Fast & slow video-language thinker as online video assistant,

    W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,” in CVPR, 2025

  60. [61]

    Optimus-1: Hybrid multimodal memory empowered agents excel in long- horizon tasks,

    Z. Li, Y. Xie, R. Shao, G. Chen, D. Jiang, and L. Nie, “Optimus-1: Hybrid multimodal memory empowered agents excel in long- horizon tasks,” in NeurIPS, 2024

  61. [62]

    Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy,

    ——, “Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy,” in CVPR, 2025

  62. [63]

    Cat: Enhancing multimodal large language model to answer questions in dy- namic audio-visual scenarios,

    Q. Ye, Z. Yu, R. Shao, X. Xie, P . Torr, and X. Cao, “Cat: Enhancing multimodal large language model to answer questions in dy- namic audio-visual scenarios,” in ECCV, 2024

  63. [64]

    Cat+: investigating and enhancing audio-visual understanding in large language models,

    Q. Ye, Z. Yu, R. Shao, Y. Cui, X. Kang, X. Liu, P . Torr, and X. Cao, “Cat+: investigating and enhancing audio-visual understanding in large language models,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , 2025

  64. [65]

    Drivevlm: the convergence of au- tonomous driving and large vision-language models,

    X. Tian, J. Gu, B. Li, Y. Liu, Y. Wang, Z. Zhao, K. Zhan, P . Jia, X. Lang, and H. Zhao, “Drivevlm: the convergence of au- tonomous driving and large vision-language models,” in CoRL, 2025, pp. 4698–4726

  65. [66]

    Cogagent: a visual language model for gui agents,

    W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding et al., “Cogagent: a visual language model for gui agents,” in CVPR, 2024, pp. 14281–14290

  66. [67]

    Less is more: Empowering gui agent with context-aware simplification,

    G. Chen, X. Zhou, R. Shao, Y. Lyu, K. Zhou, S. Wang, W. Li, Y. Li, Z. Qi, and L. Nie, “Less is more: Empowering gui agent with context-aware simplification,” in ICCV, 2025

  67. [68]

    Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning,

    Y. Lyu, R. Shao, G. Chen, Y. Zhu, W. Guan, and L. Nie, “Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning,” in ACM MM, 2025

  68. [69]

    Emosym: A symbiotic framework for unified emotional understanding and generation via latent reasoning,

    Y. Zhu, Y. Lyu, Z. Yu, R. Shao, K. Zhou, and L. Nie, “Emosym: A symbiotic framework for unified emotional understanding and generation via latent reasoning,” in ACM MM, 2025

  69. [70]

    Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent,

    B. Xie, R. Shao, G. Chen, K. Zhou, Y. Li, J. Liu, M. Zhang, and L. Nie, “Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent,” in ACL, 2025

  70. [71]

    Robust sequential deepfake detec- tion,

    R. Shao, T. Wu, and Z. Liu, “Robust sequential deepfake detec- tion,” International Journal of Computer Vision , vol. 133, pp. 3278– 3295, 2025

  71. [72]

    Detecting and grounding multi-modal media manipulation and beyond,

    R. Shao, T. Wu, J. Wu, L. Nie, and Z. Liu, “Detecting and grounding multi-modal media manipulation and beyond,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , vol. 46, pp. 5556–5574, 2024

  72. [73]

    Detecting and grounding multi- modal media manipulation,

    R. Shao, T. Wu, and Z. Liu, “Detecting and grounding multi- modal media manipulation,” in CVPR, 2023

  73. [74]

    Multi-adversarial discrim- inative deep domain generalization for face presentation attack detection,

    R. Shao, X. Lan, J. Li, and P . C. Yuen, “Multi-adversarial discrim- inative deep domain generalization for face presentation attack detection,” in CVPR, 2019

  74. [75]

    Deepfake-adapter: Dual-level adapter for deepfake detection,

    R. Shao, T. Wu, L. Nie, and Z. Liu, “Deepfake-adapter: Dual-level adapter for deepfake detection,” International Journal of Computer Vision, vol. 133, pp. 3613–3628, 2025

  75. [76]

    Spa-bench: A comprehensive benchmark for smartphone agent evaluation,

    J. Chen, D. Yuen, B. Xie, Y. Yang, G. Chen, Z. Wu, L. Yixing, X. Zhou, W. Liu, S. Wang, K. Zhou, R. Shao, L. Nie, Y. Wang, J. HAO, J. Wang, and K. Shao, “Spa-bench: A comprehensive benchmark for smartphone agent evaluation,” in ICLR, 2025

  76. [77]

    Llava-onevision: easy visual task transfer,

    B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu, and C. Li, “Llava-onevision: easy visual task transfer,” TMLR, 2025

  77. [78]

    Qwen2.5-VL Technical Report

    S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang et al. , “Qwen2.5-vl technical report,” arXiv:2502.13923, 2025

  78. [79]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin, “Vision-r1: incentivizing reasoning capability in mul- timodal large language models,” arXiv:2503.06749, 2025

  79. [80]

    Cliport: what and where pathways for robotic manipulation,

    M. Shridhar, L. Manuelli, and D. Fox, “Cliport: what and where pathways for robotic manipulation,” in CoRL, 2022, pp. 894–906

  80. [81]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, others, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748– 8763

Showing first 80 references.