Large VLM-based Vision-Language-Action Models for Robotic Manipulation: A Survey
Pith reviewed 2026-05-17 20:24 UTC · model grok-4.3
The pith
Large VLM-based VLA models for robotic manipulation can be systematically classified into monolithic and hierarchical architectures.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that large VLM-based VLA models for robotic manipulation are best understood through two principal architectural paradigms: monolithic models that encompass single-system and dual-system designs with differing levels of integration, and hierarchical models that explicitly decouple planning from execution using interpretable intermediate representations. This structure allows for an in-depth examination of integrations with advanced domains and the synthesis of characteristics across models.
What carries the argument
The taxonomy of monolithic (single-system and dual-system) versus hierarchical models, which organizes the field by levels of integration and separation of planning and execution.
If this is right
- Models can be more easily compared and developed based on their architectural type.
- Integration with reinforcement learning and world models becomes a key area for advancing capabilities.
- Future work will focus on memory mechanisms, 4D perception, and efficient adaptation.
- Research fragmentation decreases as inconsistencies in taxonomies are resolved.
Where Pith is reading between the lines
- Applying this taxonomy to emerging models could reveal new hybrid architectures not yet considered.
- This classification might help in designing benchmarks that test specific aspects of monolithic versus hierarchical approaches.
- Connections to broader embodied AI could lead to standardized evaluation protocols across related fields.
Load-bearing premise
The proposed taxonomy into monolithic and hierarchical models, along with the listed integration domains, fully captures the current state of the field without major omissions or problematic overlaps.
What would settle it
The discovery of a significant number of VLA models that cannot be classified into either the monolithic or hierarchical categories, or that show substantial overlap between categories, would challenge the taxonomy's utility.
read the original abstract
Robotic manipulation, a key frontier in robotics and embodied AI, requires precise motor control and multimodal understanding, yet traditional rule-based methods fail to scale or generalize in unstructured, novel environments. In recent years, Vision-Language-Action (VLA) models, built upon Large Vision-Language Models (VLMs) pretrained on vast image-text datasets, have emerged as a transformative paradigm. This survey provides the first systematic, taxonomy-oriented review of large VLM-based VLA models for robotic manipulation. We begin by clearly defining large VLM-based VLA models and delineating two principal architectural paradigms: (1) monolithic models, encompassing single-system and dual-system designs with differing levels of integration; and (2) hierarchical models, which explicitly decouple planning from execution via interpretable intermediate representations. Building on this foundation, we present an in-depth examination of large VLM-based VLA models: (1) integration with advanced domains, including reinforcement learning, training-free optimization, learning from human videos, and world model integration; (2) synthesis of distinctive characteristics, consolidating architectural traits, operational strengths, and the datasets and benchmarks that support their development; (3) identification of promising directions, including memory mechanisms, 4D perception, efficient adaptation, multi-agent cooperation, and other emerging capabilities. This survey consolidates recent advances to resolve inconsistencies in existing taxonomies, mitigate research fragmentation, and fill a critical gap through the systematic integration of studies at the intersection of large VLMs and robotic manipulation. We provide a regularly updated project page to document ongoing progress: https://github.com/JiuTian-VL/Large-VLM-based-VLA-for-Robotic-Manipulation
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This survey claims to deliver the first systematic, taxonomy-oriented review of large VLM-based Vision-Language-Action (VLA) models for robotic manipulation. It defines such models, delineates two principal paradigms—monolithic models (single-system and dual-system) versus hierarchical models that explicitly decouple planning from execution via intermediate representations—and examines their integration with reinforcement learning, training-free optimization, human videos, and world models. The work further synthesizes architectural traits, strengths, datasets, and benchmarks, identifies future directions including memory mechanisms, 4D perception, efficient adaptation, and multi-agent cooperation, and positions itself as resolving inconsistencies in prior taxonomies while filling a critical gap, with an accompanying regularly updated GitHub project page.
Significance. If the taxonomy is shown to be both exhaustive and non-overlapping, the survey would constitute a timely consolidation of a fast-moving intersection between large VLMs and robotic manipulation. It would help mitigate fragmentation by offering a structured lens on architectural choices and integration strategies, and the maintained project page is a concrete strength that supports ongoing utility for the community.
major comments (2)
- [Abstract / Taxonomy definition] Abstract and opening taxonomy delineation: the central claim that the monolithic (single/dual-system) versus hierarchical split resolves inconsistencies in existing taxonomies rests on the assertion that these categories are exhaustive and non-overlapping, yet no enumeration of reviewed models, no count of papers per category, and no explicit discussion of boundary cases (e.g., end-to-end models that still emit intermediate 4D or memory tokens) is supplied. Without this, the utility of the taxonomy cannot be evaluated.
- [Integration with advanced domains] Integration domains section: the four listed domains (RL, training-free optimization, human videos, world models) are presented as key integration areas, but the manuscript supplies no explicit justification or coverage check showing that these domains capture the dominant variants without significant omissions or forced overlaps that would undermine the taxonomy's claimed resolution of fragmentation.
minor comments (2)
- [Abstract] The abstract states that the survey 'consolidates recent advances' but does not indicate the time window or search methodology used; adding a brief methods paragraph would improve transparency for a systematic review.
- [Synthesis of distinctive characteristics] Figure or table captions that map specific models to the proposed taxonomy categories would aid readability; currently the synthesis of characteristics appears to rely on prose alone.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify how to better demonstrate the taxonomy's utility. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Taxonomy definition] Abstract and opening taxonomy delineation: the central claim that the monolithic (single/dual-system) versus hierarchical split resolves inconsistencies in existing taxonomies rests on the assertion that these categories are exhaustive and non-overlapping, yet no enumeration of reviewed models, no count of papers per category, and no explicit discussion of boundary cases (e.g., end-to-end models that still emit intermediate 4D or memory tokens) is supplied. Without this, the utility of the taxonomy cannot be evaluated.
Authors: We agree that an explicit enumeration and counts would make the taxonomy's scope and non-overlapping nature more transparent. In the revision we will add a summary table listing representative models under each subcategory (monolithic single-system, monolithic dual-system, and hierarchical), with approximate paper counts drawn from the surveyed literature. We will also add a dedicated paragraph discussing boundary cases, including models that emit intermediate 4D or memory tokens yet remain architecturally monolithic, to clarify distinctions and any residual overlaps. revision: yes
-
Referee: [Integration with advanced domains] Integration domains section: the four listed domains (RL, training-free optimization, human videos, world models) are presented as key integration areas, but the manuscript supplies no explicit justification or coverage check showing that these domains capture the dominant variants without significant omissions or forced overlaps that would undermine the taxonomy's claimed resolution of fragmentation.
Authors: The four domains were chosen because they correspond to the most frequently explored integration strategies in the current VLA literature. To make this explicit, the revised section will include a short justification paragraph, a coverage summary indicating the proportion of surveyed works falling into each domain, and a brief note on potential overlaps (e.g., RL combined with world models) as well as emerging areas such as multi-agent cooperation that are already flagged in the future-directions section. revision: yes
Circularity Check
Survey taxonomy organizes external literature without self-referential derivation or fitted predictions
full rationale
This is a review paper whose central contribution is a proposed taxonomy (monolithic single/dual-system vs. hierarchical models) and synthesis of integration domains drawn from the existing literature. No new quantitative results, equations, or parameters are derived from the authors' own fitted values or self-citations. The abstract explicitly frames the work as consolidating external advances to resolve inconsistencies, which is standard survey practice and does not reduce any claim to an input by construction. No load-bearing self-citation chains or ansatzes are present in the provided text.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large VLM-based VLA models are those built upon pretrained large vision-language models for robotic manipulation tasks.
Forward citations
Cited by 18 Pith papers
-
Dynamic Execution Commitment of Vision-Language-Action Models
A3 determines the execution horizon in VLA models as the longest prefix of actions that passes consensus-based verification and sequential consistency checks.
-
CF-VLA: Efficient Coarse-to-Fine Action Generation for Vision-Language-Action Policies
CF-VLA uses a coarse initialization over endpoint velocity followed by single-step refinement to achieve strong performance with low inference steps on CALVIN, LIBERO, and real-robot tasks.
-
GaLa: Hypergraph-Guided Visual Language Models for Procedural Planning
GaLa uses hypergraph representations of objects and a TriView encoder with contrastive learning to improve vision-language models on procedural planning benchmarks.
-
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
-
ConsisVLA-4D: Advancing Spatiotemporal Consistency in Efficient 3D-Perception and 4D-Reasoning for Robotic Manipulation
ConsisVLA-4D adds cross-view semantic alignment, cross-object geometric fusion, and cross-scene dynamic reasoning to VLA models, delivering 21.6% and 41.5% gains plus 2.3x and 2.4x speedups on LIBERO and real-world tasks.
-
Task-Aware Scanning Parameter Configuration for Robotic Inspection Using Vision Language Embeddings and Hyperdimensional Computing
ScanHD achieves 92.7% exact accuracy and 98.1% Win@1 accuracy in recommending discrete scanning parameters from instructions and images on a new real-world dataset.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
-
Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
VLMs show systematic drops in counting accuracy as visual and linguistic complexity rise, with modest gains from targeted attention reweighting in the decoder.
-
A Semantic Autonomy Framework for VLM-Integrated Indoor Mobile Robots: Hybrid Deterministic Reasoning and Cross-Robot Adaptive Memory
The Semantic Autonomy Stack combines a seven-step parametric resolver handling 88% of instructions in under 0.1 ms with VLM escalation and a five-category cross-robot memory system, achieving 100% accuracy and 103,000...
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning from DiT-based motor control via structured plans and cascaded cross-attention to outperform end-to-end VLA baselines in long-horizon and fine-grained manipulation.
-
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.
-
EmbodiedClaw: Conversational Workflow Execution for Embodied AI Development
EmbodiedClaw automates embodied AI development workflows through conversation, reducing manual effort and improving consistency and reproducibility.
-
Evolvable Embodied Agent for Robotic Manipulation via Long Short-Term Reflection and Optimization
EEAgent with LSTRO sets new state-of-the-art results on six VIMA-Bench robotic manipulation tasks by dynamically refining prompts through reflection on successes and failures.
-
CoEnv: Driving Embodied Multi-Agent Collaboration via Compositional Environment
CoEnv introduces a compositional environment that integrates real and simulated spaces for multi-agent robotic collaboration, using real-to-sim reconstruction, VLM action synthesis, and validated sim-to-real transfer ...
-
From Video to Control: A Survey of Learning Manipulation Interfaces from Temporal Visual Data
A survey introduces an interface-centric taxonomy for video-to-control methods in robotic manipulation and identifies the robotics integration layer as the central open challenge.
-
XEmbodied: A Foundation Model with Enhanced Geometric and Physical Cues for Large-Scale Embodied Environments
XEmbodied is a foundation model that integrates 3D geometric and physical signals into VLMs using a 3D Adapter and Efficient Image-Embodied Adapter, plus progressive curriculum and RL post-training, to improve spatial...
-
TorchUMM: A Unified Multimodal Model Codebase for Evaluation, Analysis, and Post-training
TorchUMM is the first unified codebase and benchmark suite for standardized evaluation of diverse unified multimodal models on understanding, generation, and editing tasks.
-
Vision-Language-Action in Robotics: A Survey of Datasets, Benchmarks, and Data Engines
A survey of VLA robotics research identifies data infrastructure as the primary bottleneck and distills four open challenges in representation alignment, multimodal supervision, reasoning assessment, and scalable data...
Reference graph
Works this paper leans on
-
[1]
A survey on robotics with foundation models: toward embodied ai
Z. Xu, K. Wu, J. Wen, J. Li, N. Liu, Z. Che, and J. Tang, “A survey on robotics with foundation models: toward embodied ai,” arXiv:2402.02385, 2024
-
[2]
A Survey on Vision-Language-Action Models for Embodied AI
Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,” arXiv:2405.14093, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[3]
Metaurban: a simulation platform for embodied ai in urban spaces,
W. Wu, H. He, Y. Wang, C. Duan, J. He, Z. Liu, Q. Li, and B. Zhou, “Metaurban: a simulation platform for embodied ai in urban spaces,” in ICLR, 2025
work page 2025
-
[4]
Generative artificial intelligence in robotic manipulation: a survey,
K. Zhang, P . Yun, J. Cen, J. Cai, D. Zhu, H. Yuan, C. Zhao, T. Feng, M. Y. Wang, Q. Chen et al. , “Generative artificial intelligence in robotic manipulation: a survey,” arXiv:2503.03464, 2025
-
[5]
Aligning cyber space with physical world: a comprehensive survey on embodied ai,
Y. Liu, W. Chen, Y. Bai, X. Liang, G. Li, W. Gao, and L. Lin, “Aligning cyber space with physical world: a comprehensive survey on embodied ai,” arXiv:2407.06886, 2024
-
[6]
A survey of embodied ai in healthcare: techniques, applications, and opportunities,
Y. Liu, X. Cao, T. Chen, Y. Jiang, J. You, M. Wu, X. Wang, M. Feng, Y. Jin, and J. Chen, “A survey of embodied ai in healthcare: techniques, applications, and opportunities,” INFORM FUSION, vol. 119, p. 103033, 2025
work page 2025
-
[7]
Machine learning meets advanced robotic manipu- lation,
S. Nahavandi, R. Alizadehsani, D. Nahavandi, C. P . Lim, K. Kelly, and F. Bello, “Machine learning meets advanced robotic manipu- lation,” INFORM FUSION, vol. 105, p. 102221, 2024
work page 2024
-
[8]
Vision-language-action models: Concepts, progress, applications and challenges
R. Sapkota, Y. Cao, K. I. Roumeliotis, and M. Karkee, “Vision- language-action models: concepts, progress, applications and challenges,” arXiv:2505.04769, 2025
-
[9]
Trends and challenges in robot manip- ulation,
A. Billard and D. Kragic, “Trends and challenges in robot manip- ulation,” Science, vol. 364, p. eaat8414, 2019
work page 2019
-
[10]
Any-point trajectory modeling for policy learning,
C. Wen, X. Lin, J. So, K. Chen, Q. Dou, Y. Gao, and P . Abbeel, “Any-point trajectory modeling for policy learning,” in RSS, 2024, p. 92
work page 2024
-
[11]
Instruction-driven history-aware policies for robotic manipulations,
P .-L. Guhur, S. Chen, R. G. Pinel, M. Tapaswi, I. Laptev, and C. Schmid, “Instruction-driven history-aware policies for robotic manipulations,” in CoRL, 2023, pp. 175–187
work page 2023
-
[12]
Flow as the cross-domain manipulation interface,
M. Xu, Z. Xu, Y. Xu, C. Chi, G. Wetzstein, M. Veloso, and S. Song, “Flow as the cross-domain manipulation interface,” in CoRL, 2024
work page 2024
-
[13]
Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,
J. Gu, S. Kirmani, P . Wohlhart, Y. Lu, M. G. Arenas, K. Rao, W. Yu, C. Fu, K. Gopalakrishnan, Z. Xu et al., “Rt-trajectory: Robotic task generalization via hindsight trajectory sketches,” in ICLR, 2024, pp. 2475–2499
work page 2024
-
[14]
J. Zhao, N. Kuppuswamy, S. Feng, B. Burchfiel, and E. Adelson, “Polytouch: A robust multi-modal tactile sensor for contact-rich manipulation using tactile-diffusion policies,” in ICRA, 2025
work page 2025
-
[15]
V . Mengers and O. Brock, “No plan but everything under control: Robustly solving sequential tasks with dynamically composed gradient descent,” in ICRA, 2025
work page 2025
-
[16]
Star: Learning diverse robot skill abstractions through rotation- augmented vector quantization,
H. Li, Q. Lv, R. Shao, X. Deng, Y. Li, J. Hao, and L. Nie, “Star: Learning diverse robot skill abstractions through rotation- augmented vector quantization,” in ICML, 2025
work page 2025
-
[17]
Lion: empow- ering multimodal large language model with dual-level visual knowledge,
G. Chen, L. Shen, R. Shao, X. Deng, and L. Nie, “Lion: empow- ering multimodal large language model with dual-level visual knowledge,” in CVPR, 2024, pp. 26540–26550
work page 2024
-
[18]
Improved baselines with visual instruction tuning,
H. Liu, C. Li, Y. Li, and Y. J. Lee, “Improved baselines with visual instruction tuning,” in CVPR, 2024, pp. 26296–26306
work page 2024
-
[19]
Instructblip: towards general-purpose vision-language models with instruction tuning,
W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, and S. Hoi, “Instructblip: towards general-purpose vision-language models with instruction tuning,” in NeurIPS, 2023, pp. 49250–49267
work page 2023
-
[20]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P . Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: a versatile vision-language model for understanding, localization, text reading, and be- yond,” arXiv:2308.12966, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, 2023, pp. 34892–34916
work page 2023
-
[22]
Z. Chen, W. Wang, Y. Cao, Y. Liu, Z. Gao, E. Cui, J. Zhu, S. Ye, H. Tian, Z. Liu et al. , “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,” arXiv:2412.05271, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Monkey: image resolution and text label are important things for large multi-modal models,
Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y. Sun, Y. Liu, and X. Bai, “Monkey: image resolution and text label are important things for large multi-modal models,” in CVPR, 2024, pp. 26763– 26773
work page 2024
-
[24]
R. Zhang, R. Shao, G. Chen, M. Zhang, K. Zhou, W. Guan, and L. Nie, “Falcon: Resolving visual redundancy and fragmentation in high-resolution multimodal large language models via visual registers,” in ICCV, 2025
work page 2025
-
[25]
Mome: Mixture of multimodal experts for generalist multimodal large language models,
L. Shen, G. Chen, R. Shao, W. Guan, and L. Nie, “Mome: Mixture of multimodal experts for generalist multimodal large language models,” in NeurIPS, 2024
work page 2024
-
[26]
Openvla: an open-source vision-language-action model,
M. J. Kim, K. Pertsch, S. Karamcheti, T. Xiao, A. Balakrishna, S. Nair, R. Rafailov, E. Foster, G. Lam, P . Sanketiet al., “Openvla: an open-source vision-language-action model,” in CoRL, 2024, pp. 2679–2713
work page 2024
-
[27]
Rt-2: vision-language-action models transfer web knowledge to robotic control,
B. Zitkovich, T. Yu, S. Xu, P . Xu, T. Xiao, F. Xia, others, and K. Han, “Rt-2: vision-language-action models transfer web knowledge to robotic control,” in CoRL, 2023, pp. 2165–2183
work page 2023
-
[28]
Rt-h: action hierarchies using language,
S. Belkhale, T. Ding, T. Xiao, P . Sermanet, Q. Vuong, J. Tompson, Y. Chebotar, D. Dwibedi, and D. Sadigh, “Rt-h: action hierarchies using language,” in RSS, 2024
work page 2024
-
[29]
π0: A vision- language-action flow model for general robot control,
K. Black, N. Brown, D. Driess, A. Esmail, M. Equi, C. Finn, N. Fusai, L. Groom, K. Hausman, B. Ichter et al. , “ π0: A vision- language-action flow model for general robot control,” in RSS, 2025
work page 2025
-
[31]
SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics
M. Shukor, D. Aubakirova, F. Capuano, P . Kooijmans, S. Palma, A. Zouitine, M. Aractingi, C. Pascal, M. Russi, A. Marafioti et al. , “Smolvla: a vision-language-action model for affordable and efficient robotics,” arXiv:2506.01844, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
J. Bjorck, F. Casta ˜neda, N. Cherniadev, X. Da, R. Ding, L. Fan, Y. Fang, D. Fox, F. Hu, S. Huang et al., “Gr00t n1: an open foun- dation model for generalist humanoid robots,” arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Cot-vla: visual chain-of-thought reasoning for vision- language-action models,
Q. Zhao, Y. Lu, M. J. Kim, Z. Fu, Z. Zhang, Y. Wu, Z. Li, Q. Ma, S. Han, C. Finn, A. Handa, T.-Y. Lin, G. Wetzstein, M.-Y. Liu, and D. Xiang, “Cot-vla: visual chain-of-thought reasoning for vision- language-action models,” in CVPR, 2025, pp. 1702–1713
work page 2025
-
[34]
HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model
J. Liu, H. Chen, P . An, Z. Liu, R. Zhang, C. Gu, X. Li, Z. Guo, S. Chen, M. Liu et al., “Hybridvla: collaborative diffusion and autoregression in a unified vision-language-action model,” arXiv:2503.10631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
P . Li, Y. Chen, H. Wu, X. Ma, X. Wu, Y. Huang, L. Wang, T. Kong, and T. Tan, “Bridgevla: input-output alignment for efficient 3d manipulation learning with vision-language models,” arXiv:2506.07961, 2025
-
[36]
Deer-vla: dynamic inference of multimodal large language models for efficient robot execution,
Y. Yue, Y. Wang, B. Kang, Y. Han, S. Wang, S. Song, J. Feng, and G. Huang, “Deer-vla: dynamic inference of multimodal large language models for efficient robot execution,” in NeurIPS, 2024, pp. 56619–56643
work page 2024
-
[37]
J. T. S. Danny Driess, L. Y. Brian Ichter, K. P . Adrian Li-Bell, H. W. Allen Z. Ren, L. X. S. Quan Vuong, and S. Levine, “Knowledge insulating vision-language-action models: train fast, run fast, generalize better,” arXiv:2505.23705, 2025
-
[38]
WorldVLA: Towards Autoregressive Action World Model
J. Cen, C. Yu, H. Yuan, Y. Jiang, S. Huang, J. Guo, X. Li, Y. Song, H. Luo, F. Wang et al., “Worldvla: towards autoregressive action world model,” arXiv:2506.21539, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
J. Zhang, Y. Luo, A. Anwar, S. A. Sontakke, J. J. Lim, J. Thomason, E. Biyik, and J. Zhang, “Rewind: language-guided rewards teach JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 18 robot policies without new demonstrations,” arXiv:2505.10911, 2025
-
[40]
Fast-in-slow: a dual-system founda- tion model unifying fast manipulation within slow reasoning,
H. Chen, J. Liu, C. Gu, Z. Liu, R. Zhang, X. Li, X. He, Y. Guo, C.-W. Fu, S. Zhang et al. , “Fast-in-slow: a dual-system founda- tion model unifying fast manipulation within slow reasoning,” arXiv:2506.01953, 2025
-
[41]
Real-Time Execution of Action Chunking Flow Policies
K. Black, M. Y. Galliker, and S. Levine, “Real-time execution of action chunking flow policies,” arXiv:2506.07339, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Mitigat- ing the human-robot domain discrepancy in visual pre-training for robotic manipulation,
J. Zhou, T. Ma, K. Y. Lin, Z. Wang, R. Qiu, and J. Liang, “Mitigat- ing the human-robot domain discrepancy in visual pre-training for robotic manipulation,” in CVPR, 2025, pp. 22551–22561
work page 2025
-
[43]
World4omni: a zero-shot framework from image gen- eration world model to robotic manipulation,
H. Chen, B. Wang, J. Guo, T. Zhang, Y. Hou, X. Huang, C. Tie, and L. Shao, “World4omni: a zero-shot framework from image gen- eration world model to robotic manipulation,” arXiv:2506.23919, 2025
-
[44]
Fine-tuning vision-language- action models: optimizing speed and success,
M. J. Kim, C. Finn, and P . Liang, “Fine-tuning vision-language- action models: optimizing speed and success,” in RSS, 2025
work page 2025
-
[45]
NORA: A Small Open-Sourced Generalist Vision Language Action Model for Embodied Tasks
C.-Y. Hung, Q. Sun, P . Hong, A. Zadeh, C. Li, U. Tan, N. Majumder, S. Poria et al. , “Nora: a small open-sourced generalist vision language action model for embodied tasks,” arXiv:2504.19854, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Z. Song, G. Ouyang, M. Li, Y. Ji, C. Wang, Z. Xu, Z. Zhang, X. Zhang, Q. Jiang, Z. Chen et al. , “Maniplvm-r1: reinforcement learning for reasoning in embodied manipulation with large vision-language models,” arXiv:2505.16517, 2025
-
[47]
Embodied-reasoner: synergizing visual search, reasoning, and action for embodied interactive tasks,
W. Zhang, M. Wang, G. Liu, X. Huixin, Y. Jiang, Y. Shen, G. Hou, Z. Zheng, H. Zhang, X. Li et al., “Embodied-reasoner: synergizing visual search, reasoning, and action for embodied interactive tasks,” arXiv:2503.21696, 2025
-
[48]
Hamster: hierarchical action models for open-world robot manipulation,
Y. Li, Y. Deng, J. Zhang, J. Jang, M. Memmel, R. Yu, C. R. Garrett, F. Ramos, D. Fox, A. Li et al. , “Hamster: hierarchical action models for open-world robot manipulation,” in ICLR, 2025
work page 2025
-
[49]
A0: an affordance-aware hierarchical model for general robotic manipulation,
R. Xu, J. Zhang, M. Guo, Y. Wen, H. Yang, M. Lin, J. Huang, Z. Li, K. Zhang, L. Wang et al. , “A0: an affordance-aware hierarchical model for general robotic manipulation,” arXiv:2504.12636, 2025
-
[50]
Rekep: spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,
W. Huang, C. Wang, Y. Li, R. Zhang, and L. Fei-Fei, “Rekep: spatio-temporal reasoning of relational keypoint constraints for robotic manipulation,” in CoRL, 2024
work page 2024
-
[51]
A survey of vision-language pre-trained models,
Y. Du, Z. Liu, J. Li, and W. X. Zhao, “A survey of vision-language pre-trained models,” in IJCAI, 2022
work page 2022
-
[52]
Vision-language models for vision tasks: A survey,
J. Zhang, J. Huang, S. Jin, and S. Lu, “Vision-language models for vision tasks: A survey,” TP AMI, vol. 46, pp. 5625–5644, 2024
work page 2024
-
[53]
Multimodal large language models: A survey,
J. Wu, W. Gan, Z. Chen, S. Wan, and P . S. Yu, “Multimodal large language models: A survey,” in BigData, 2023, pp. 2247–2256
work page 2023
-
[54]
Z. Li, X. Wu, H. Du, F. Liu, H. Nghiem, and G. Shi, “A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges,” arXiv:2501.02189, 2025
-
[55]
Large vision-language model alignment and misalignment: A survey through the lens of explainability,
D. Shu, H. Zhao, J. Hu, W. Liu, A. Payani, L. Cheng, and M. Du, “Large vision-language model alignment and misalignment: A survey through the lens of explainability,”arXiv:2501.01346, 2025
-
[56]
A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions,
M. Song, X. Deng, Z. Zhou, J. Wei, W. Guan, and L. Nie, “A survey on diffusion policy for robotic manipulation: Taxonomy, analysis, and future directions,” Authorea Preprints, 2025
work page 2025
-
[57]
Diffusion models for robotic manipulation: A survey,
R. Wolf, Y. Shi, S. Liu, and R. Rayyes, “Diffusion models for robotic manipulation: A survey,” arXiv:2504.08438, 2025
-
[58]
C. Cui, P . Ding, W. Song, S. Bai, X. Tong, Z. Ge, R. Suo, W. Zhou, Y. Liu, B. Jia et al., “Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipula- tion,” arXiv:2505.03912, 2025
-
[59]
A survey of embodied learning for object-centric robotic manipulation,
Y. Zheng, L. Yao, Y. Su, Y. Zhang, Y. Wang, S. Zhao, Y. Zhang, and L.-P . Chau, “A survey of embodied learning for object-centric robotic manipulation,” MIR, vol. 22, pp. 588–626, 2025
work page 2025
-
[60]
Lion-fs: Fast & slow video-language thinker as online video assistant,
W. Li, B. Hu, R. Shao, L. Shen, and L. Nie, “Lion-fs: Fast & slow video-language thinker as online video assistant,” in CVPR, 2025
work page 2025
-
[61]
Optimus-1: Hybrid multimodal memory empowered agents excel in long- horizon tasks,
Z. Li, Y. Xie, R. Shao, G. Chen, D. Jiang, and L. Nie, “Optimus-1: Hybrid multimodal memory empowered agents excel in long- horizon tasks,” in NeurIPS, 2024
work page 2024
-
[62]
Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy,
——, “Optimus-2: Multimodal minecraft agent with goal- observation-action conditioned policy,” in CVPR, 2025
work page 2025
-
[63]
Q. Ye, Z. Yu, R. Shao, X. Xie, P . Torr, and X. Cao, “Cat: Enhancing multimodal large language model to answer questions in dy- namic audio-visual scenarios,” in ECCV, 2024
work page 2024
-
[64]
Cat+: investigating and enhancing audio-visual understanding in large language models,
Q. Ye, Z. Yu, R. Shao, Y. Cui, X. Kang, X. Liu, P . Torr, and X. Cao, “Cat+: investigating and enhancing audio-visual understanding in large language models,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , 2025
work page 2025
-
[65]
Drivevlm: the convergence of au- tonomous driving and large vision-language models,
X. Tian, J. Gu, B. Li, Y. Liu, Y. Wang, Z. Zhao, K. Zhan, P . Jia, X. Lang, and H. Zhao, “Drivevlm: the convergence of au- tonomous driving and large vision-language models,” in CoRL, 2025, pp. 4698–4726
work page 2025
-
[66]
Cogagent: a visual language model for gui agents,
W. Hong, W. Wang, Q. Lv, J. Xu, W. Yu, J. Ji, Y. Wang, Z. Wang, Y. Dong, M. Ding et al., “Cogagent: a visual language model for gui agents,” in CVPR, 2024, pp. 14281–14290
work page 2024
-
[67]
Less is more: Empowering gui agent with context-aware simplification,
G. Chen, X. Zhou, R. Shao, Y. Lyu, K. Zhou, S. Wang, W. Li, Y. Li, Z. Qi, and L. Nie, “Less is more: Empowering gui agent with context-aware simplification,” in ICCV, 2025
work page 2025
-
[68]
Y. Lyu, R. Shao, G. Chen, Y. Zhu, W. Guan, and L. Nie, “Puma: Layer-pruned language model for efficient unified multimodal retrieval with modality-adaptive learning,” in ACM MM, 2025
work page 2025
-
[69]
Y. Zhu, Y. Lyu, Z. Yu, R. Shao, K. Zhou, and L. Nie, “Emosym: A symbiotic framework for unified emotional understanding and generation via latent reasoning,” in ACM MM, 2025
work page 2025
-
[70]
Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent,
B. Xie, R. Shao, G. Chen, K. Zhou, Y. Li, J. Liu, M. Zhang, and L. Nie, “Gui-explorer: Autonomous exploration and mining of transition-aware knowledge for gui agent,” in ACL, 2025
work page 2025
-
[71]
Robust sequential deepfake detec- tion,
R. Shao, T. Wu, and Z. Liu, “Robust sequential deepfake detec- tion,” International Journal of Computer Vision , vol. 133, pp. 3278– 3295, 2025
work page 2025
-
[72]
Detecting and grounding multi-modal media manipulation and beyond,
R. Shao, T. Wu, J. Wu, L. Nie, and Z. Liu, “Detecting and grounding multi-modal media manipulation and beyond,” IEEE T ransactions on Pattern Analysis and Machine Intelligence , vol. 46, pp. 5556–5574, 2024
work page 2024
-
[73]
Detecting and grounding multi- modal media manipulation,
R. Shao, T. Wu, and Z. Liu, “Detecting and grounding multi- modal media manipulation,” in CVPR, 2023
work page 2023
-
[74]
R. Shao, X. Lan, J. Li, and P . C. Yuen, “Multi-adversarial discrim- inative deep domain generalization for face presentation attack detection,” in CVPR, 2019
work page 2019
-
[75]
Deepfake-adapter: Dual-level adapter for deepfake detection,
R. Shao, T. Wu, L. Nie, and Z. Liu, “Deepfake-adapter: Dual-level adapter for deepfake detection,” International Journal of Computer Vision, vol. 133, pp. 3613–3628, 2025
work page 2025
-
[76]
Spa-bench: A comprehensive benchmark for smartphone agent evaluation,
J. Chen, D. Yuen, B. Xie, Y. Yang, G. Chen, Z. Wu, L. Yixing, X. Zhou, W. Liu, S. Wang, K. Zhou, R. Shao, L. Nie, Y. Wang, J. HAO, J. Wang, and K. Shao, “Spa-bench: A comprehensive benchmark for smartphone agent evaluation,” in ICLR, 2025
work page 2025
-
[77]
Llava-onevision: easy visual task transfer,
B. Li, Y. Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, P . Zhang, Y. Li, Z. Liu, and C. Li, “Llava-onevision: easy visual task transfer,” TMLR, 2025
work page 2025
-
[78]
S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P . Wang, S. Wang, J. Tang et al. , “Qwen2.5-vl technical report,” arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Z. Xu, Y. Hu, and S. Lin, “Vision-r1: incentivizing reasoning capability in mul- timodal large language models,” arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Cliport: what and where pathways for robotic manipulation,
M. Shridhar, L. Manuelli, and D. Fox, “Cliport: what and where pathways for robotic manipulation,” in CoRL, 2022, pp. 894–906
work page 2022
-
[81]
Learning transferable visual models from natural language supervision,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, others, and I. Sutskever, “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748– 8763
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.