Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
hub Canonical reference
Less is more: Em- powering gui agent with context-aware simplification
Canonical reference. 100% of citing Pith papers cite this work as background.
hub tools
citation-role summary
citation-polarity summary
roles
background 5polarities
background 5representative citing papers
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
VLA-Corrector adds a detect-and-correct inference layer using a latent vision monitor and online gradient guidance to enable adaptive action horizons in chunked VLA policies.
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
IGen generates realistic visuomotor training data including actions and temporally coherent visuals from unstructured open-world images via 3D reconstruction and VLM reasoning.
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
A training-free fusion layer enables stale VLM selections to improve a real-time planner's trajectory scoring for urban sidewalk navigation, yielding 30% ADE reduction in challenging scenarios.
This survey organizes large VLM-based VLA models for robotic manipulation into monolithic and hierarchical paradigms, reviews their integrations and datasets, and outlines future directions.
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.
citing papers explorer
-
Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
-
AT-VLA: Adaptive Tactile Injection for Enhanced Feedback Reaction in Vision-Language-Action Models
AT-VLA proposes adaptive tactile injection and a dual-stream tactile reaction mechanism to enhance VLA models for contact-rich robotic manipulation with real-time responses.
-
VLA-Corrector: Lightweight Detect-and-Correct Inference for Adaptive Action Horizon
VLA-Corrector adds a detect-and-correct inference layer using a latent vision monitor and online gradient guidance to enable adaptive action horizons in chunked VLA policies.
-
UniTacVLA: Unified Tactile Understanding and Prediction in Vision Language Action Models
UniTacVLA builds a state-aware and dynamics-aware tactile prior via unified latent space, tactile chain-of-thought, and mixed real/predicted feedback controller to boost dexterous manipulation performance.
-
Vesta: A Generalist Embodied Reasoning Model
Vesta is a unified embodied generalist model that outperforms specialist baselines by over 20% on average and improves real-world robotic task success by over 35%.
-
T-Rex: Tactile-Reactive Dexterous Manipulation
T-Rex introduces a large tactile dataset and MoT architecture that achieves over 30% higher success rates than baselines on 12 tasks requiring force control and deformable object handling.
-
HarmoWAM: Harmonizing Generalizable and Precise Manipulation via Adaptive World Action Models
HarmoWAM unifies predictive and reactive control in world action models via an adaptive gating mechanism to deliver improved zero-shot generalization and precision in robotic manipulation.
-
LaST-R1: Reinforcing Robotic Manipulation via Adaptive Physical Latent Reasoning
LaST-R1 introduces a RL post-training method called LAPO that optimizes latent Chain-of-Thought reasoning in vision-language-action models, yielding 99.9% success on LIBERO and up to 22.5% real-world gains.
-
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
-
TwinRL: Digital Twin-Driven Reinforcement Learning for Real-World Robotic Manipulation
TwinRL expands RL exploration via digital twin reconstruction and twin RL warm-up to guide real-world learning, reaching near-100% success with 20 minutes of on-robot time across four tasks.
-
Position: Vision-Language-Action Models Cannot Be Verified to Perform Physical Reasoning
VLA benchmark success rates cannot distinguish semantic generalization from physical reasoning due to an identifiability gap in current evaluation protocols.
-
Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation
A training-free fusion layer enables stale VLM selections to improve a real-time planner's trajectory scoring for urban sidewalk navigation, yielding 30% ADE reduction in challenging scenarios.
-
From Abstraction to Instantiation: Learning Behavioral Representation for Vision-Language-Action Model
BehaviorVLA learns long-horizon behavioral representations via causal Mamba encoder and phase-conditioned decoder, reporting SOTA results of 58% on RoboTwin 2.0, 98% on LIBERO, 4.36 on CALVIN, and matching OpenVLA-OFT performance with 50% data in sim-to-real transfer.