VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System
2 Pith papers cite this work. Polarity classification is still indexing.
abstract
While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.
fields
cs.RO 2years
2026 2verdicts
UNVERDICTED 2representative citing papers
AHA-WAM is a dual-DiT asynchronous world-action model with horizon-adaptive offset training and OVCR routing that reports 92.8% success on RoboTwin and 78.3% on real tasks at 24.17 Hz without robot pretraining.
citing papers explorer
-
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
-
AHA-WAM:Asynchronous Horizon-Adaptive World-Action Modeling with Observation-Guided Context Routing
AHA-WAM is a dual-DiT asynchronous world-action model with horizon-adaptive offset training and OVCR routing that reports 92.8% success on RoboTwin and 78.3% on real tasks at 24.17 Hz without robot pretraining.