hub Canonical reference

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong · 2025 · cs.RO · arXiv 2502.19417

Canonical reference. 100% of citing Pith papers cite this work as background.

47 Pith papers citing it

Background 100% of classified citations

open full Pith review browse 47 citing papers arXiv PDF

abstract

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 9

representative citing papers

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots

cs.RO · 2026-07-02 · unverdicted · novelty 7.0

Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.

Adapting Generalist Robot Policies with Semantic Reinforcement Learning

cs.RO · 2026-06-30 · unverdicted · novelty 7.0

SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.

VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation

cs.RO · 2026-06-05 · unverdicted · novelty 7.0

VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.

AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.

GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations

cs.RO · 2026-05-21 · unverdicted · novelty 7.0

GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.

RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation

cs.RO · 2026-05-17 · unverdicted · novelty 7.0

RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.

Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment

cs.RO · 2026-04-27 · unverdicted · novelty 7.0

VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.

${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight

cs.RO · 2026-04-03 · unverdicted · novelty 7.0

QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselines in simulations and achieving real-world speeds up to 5 m/s.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

LA4VLA: Learning to Act without Seeing via Language-Action Pretraining

cs.RO · 2026-06-25 · unverdicted · novelty 6.0

LA4VLA creates a 33K language-action dataset from existing demos and shows that pretraining on language-action pairs before or alongside vision-language-action training boosts success rates in sim and real robot tasks.

Geometric Action Model for Robot Policy Learning

cs.RO · 2026-06-15 · unverdicted · novelty 6.0

GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.

DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners?

cs.RO · 2026-06-10 · unverdicted · novelty 6.0

DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.

Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration

cs.RO · 2026-06-10 · conditional · novelty 6.0

VLA models with inference-time steering mitigate action leakage in implicit human-robot collaboration, supporting longer horizons and yielding faster, more reliable assembly than shorter-horizon baselines in a 16-person study.

What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents

cs.RO · 2026-06-09 · unverdicted · novelty 6.0

A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

cs.CV · 2026-06-05 · unverdicted · novelty 6.0 · 2 refs

LARA jointly optimizes LAM and VLA models via representation alignment to improve robotic manipulation performance using human videos.

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

cs.RO · 2026-06-04 · unverdicted · novelty 6.0

AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

cs.RO · 2026-06-02 · unverdicted · novelty 6.0

ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

cs.RO · 2026-06-01 · unverdicted · novelty 6.0

S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.

Action with Visual Primitives

cs.RO · 2026-05-21 · unverdicted · novelty 6.0

AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.

Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models

cs.RO · 2026-05-13 · unverdicted · novelty 6.0

VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.

G-Zero: Self-Play for Open-Ended Generation from Zero Data

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

G-Zero uses the Hint-δ intrinsic reward to drive co-evolution between a Proposer and Generator via GRPO and DPO, providing a theoretical suboptimality guarantee for self-improvement from internal dynamics alone.

SEIF: Self-Evolving Reinforcement Learning for Instruction Following

cs.CL · 2026-05-08 · conditional · novelty 6.0

SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

citing papers explorer

Showing 36 of 36 citing papers after filters.

Embodied.cpp: A Portable Inference Runtime of Embodied AI Models on Heterogeneous Robots cs.RO · 2026-07-02 · unverdicted · none · ref 20 · internal anchor
Embodied.cpp introduces a portable C++ inference runtime with modular layers for deploying VLA and WAM models on heterogeneous robots, reporting 100% and 91% task success on two models plus memory reduction on a WAM benchmark.
Adapting Generalist Robot Policies with Semantic Reinforcement Learning cs.RO · 2026-06-30 · unverdicted · none · ref 52 · internal anchor
SARL optimizes language prompt inputs to generalist vision-language-action policies through online RL to solve complex long-horizon tasks by composing existing skills.
VoLo: A Physical Orchestrator for Open-Vocabulary Long-Horizon Manipulation cs.RO · 2026-06-05 · unverdicted · none · ref 63 · internal anchor
VoLoAgent uses a VLM to steer heterogeneous robot capabilities as interruptible tools for long-horizon manipulation and introduces the RoboVoLo benchmark, claiming substantial outperformance over single VLA/VLM or tool-based systems with real-robot validation.
AwareVLN: Reasoning with Self-awareness for Vision-Language Navigation cs.RO · 2026-05-21 · unverdicted · none · ref 36 · internal anchor
AwareVLN introduces a structural reasoning module and automatic data engine with progress division to equip VLN agents with self-awareness of agent state and task progress, outperforming prior methods on Habitat datasets.
GesVLA: Gesture-Aware Vision-Language-Action Model Embedded Representations cs.RO · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
GesVLA encodes gesture features directly into the latent space of VLA models using a dual-VLM architecture and a rendering-based data pipeline, yielding improved target grounding in real robotic tasks.
RoboFlow4D: A Lightweight Flow World Model Toward Real-Time Flow-Guided Robotic Manipulation cs.RO · 2026-05-17 · unverdicted · none · ref 24 · internal anchor
RoboFlow4D is an end-to-end lightweight flow world model that predicts multi-frame 3D flows from visual observations and textual instructions to provide explicit planning for real-time robotic manipulation.
Characterizing Vision-Language-Action Models across XPUs: Constraints and Acceleration for On-Robot Deployment cs.RO · 2026-04-27 · unverdicted · none · ref 18 · internal anchor
VLA models exhibit a compute-bound VLM phase followed by a memory-bound action phase on edge hardware; DP-Cache and V-AEFusion reduce redundancy and enable pipeline parallelism for up to 6x speedup on NPUs with marginal task degradation.
QuadAgent: A Responsive Agent System for Vision-Language Guided Quadrotor Agile Flight cs.RO · 2026-04-03 · unverdicted · none · ref 24 · internal anchor
QuadAgent uses an asynchronous multi-agent architecture with an Impression Graph for scene memory and vision-based avoidance to enable training-free vision-language guided agile quadrotor flight, outperforming baselines in simulations and achieving real-world speeds up to 5 m/s.
VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models cs.RO · 2026-03-23 · unverdicted · none · ref 33 · internal anchor
VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.
UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models cs.RO · 2026-02-23 · unverdicted · none · ref 31 · internal anchor
UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.
LA4VLA: Learning to Act without Seeing via Language-Action Pretraining cs.RO · 2026-06-25 · unverdicted · none · ref 33 · internal anchor
LA4VLA creates a 33K language-action dataset from existing demos and shows that pretraining on language-action pairs before or alongside vision-language-action training boosts success rates in sim and real robot tasks.
Geometric Action Model for Robot Policy Learning cs.RO · 2026-06-15 · unverdicted · none · ref 24 · internal anchor
GAM splits a geometric foundation model to enable language-conditioned future geometry prediction and action decoding for robot policies, claiming superior performance on manipulation benchmarks.
DIRECT: When and Where Should You Allocate Test-Time Compute in Embodied Planners? cs.RO · 2026-06-10 · unverdicted · none · ref 4 · internal anchor
DIRECT is a multimodal-context router that allocates test-time compute across chain-of-thought depth, model size, and memory history for VLM embodied planners, improving the success-cost Pareto frontier and matching stronger models at up to 65% lower latency on benchmarks and a physical Franka arm.
Learning to Assist: Collaborative VLAs for Implicit Human-Robot Collaboration cs.RO · 2026-06-10 · conditional · none · ref 19 · internal anchor
VLA models with inference-time steering mitigate action leakage in implicit human-robot collaboration, supporting longer horizons and yielding faster, more reliable assembly than shorter-horizon baselines in a 16-person study.
What Matters in Orchestrating Robot Policies: A Systematic Study of Hierarchical VLA Agents cs.RO · 2026-06-09 · unverdicted · none · ref 11 · internal anchor
A systematic study of hierarchical VLA agents identifies design principles that improve robot manipulation performance over flat and naive hierarchical baselines in simulation and real-world experiments.
AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding cs.RO · 2026-06-04 · unverdicted · none · ref 56 · internal anchor
AffordanceVLA proposes a VLA model with affordance-aware modules (Which2Act, Where2Act, How2Act) in a Mixture-of-Transformer trained in three stages to improve robotic manipulation.
Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation cs.RO · 2026-06-02 · unverdicted · none · ref 44 · internal anchor
ERVLA trains on a 978k-trajectory embodied CoT corpus using reasoning as supervision with dropout, then predicts actions without CoT at test time, reaching 86.9% on LIBERO-Plus and 53.2% on VLABench.
See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs cs.RO · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
S2 improves generalization in vision-language-action models by using goal-preserving refined language guidance and explicit visual evidence budgets, raising mean subtask success from 54.2% to 79.0% on eight real-robot tasks compared to pi0.5.
Action with Visual Primitives cs.RO · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
AVP architecture has VLM emit visual-primitive tokens to condition flow-matching action expert, yielding 27.61% higher success rate than pi_0.5 on real-robot pick-and-place tasks.
Towards Long-horizon Embodied Agents with Tool-Aligned Vision-Language-Action Models cs.RO · 2026-05-13 · unverdicted · none · ref 27 · internal anchor
VLAs-as-Tools pairs a VLM planner with specialized VLA executors via a new interface and Tool-Aligned Post-Training to raise long-horizon robot success rates on LIBERO-Long and RoboTwin benchmarks.
ExpressMM: Expressive Mobile Manipulation Behaviors in Human-Robot Interactions cs.RO · 2026-04-07 · unverdicted · none · ref 12 · internal anchor
ExpressMM integrates high-level language-guided planning with low-level vision-language-action policies to enable expressive and interruptible mobile manipulation behaviors in human-robot collaboration, shown effective in an assembly task via audience evaluations.
ThermoAct:Thermal-Aware Vision-Language-Action Models for Robotic Perception and Decision-Making cs.RO · 2026-03-26 · unverdicted · none · ref 18 · internal anchor
ThermoAct integrates thermal imaging into VLA models via a VLM planner to enable robots to perceive physical properties like heat and improve safety over vision-only systems.
OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism cs.RO · 2026-03-15 · unverdicted · none · ref 35 · internal anchor
OxyGen unifies KV cache management in MoT VLAs to enable cross-task KV sharing and cross-frame continuous batching, delivering up to 3.7x speedup with 200+ tokens/s language and 70 Hz action on on-device platforms.
World Action Models are Zero-shot Policies cs.RO · 2026-02-17 · unverdicted · none · ref 74 · internal anchor
DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control cs.RO · 2026-02-13 · unverdicted · none · ref 2 · internal anchor
Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 34 · internal anchor
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
Ctrl-World: A Controllable Generative World Model for Robot Manipulation cs.RO · 2025-10-11 · unverdicted · none · ref 39 · internal anchor
A controllable world model trained on the DROID dataset generates consistent multi-view robot trajectories for over 20 seconds and improves generalist policy success rates by 44.7% via imagined trajectory fine-tuning.
Wall-OSS-0.5 Technical Report cs.RO · 2026-05-29 · unverdicted · none · ref 83 · internal anchor
Wall-OSS-0.5 is a 4B VLA model pretrained across many embodiments that achieves zero-shot real-robot performance on a 17-task suite and outperforms π_0.5 after fine-tuning.
Extending Embodied Question Answering from Perception to Decision cs.RO · 2026-05-25 · unverdicted · none · ref 46 · internal anchor
Introduces EQA-Decision dataset with 4M+ QA pairs across four embodied reasoning dimensions and RoboDecision baseline for joint perception-reasoning-decision evaluation.
DyGRO-VLA: Cross-Task Scaling of Vision-Language-Action Models via Dynamic Grouped Residual Optimization cs.RO · 2026-05-17 · unverdicted · none · ref 39 · internal anchor
DyGRO-VLA is a two-stage optimization framework for cross-task scaling of Vision-Language-Action models via dynamic grouped residual optimization in RL.
RoboAgent: Chaining Basic Capabilities for Embodied Task Planning cs.RO · 2026-04-09 · unverdicted · none · ref 89 · internal anchor
RoboAgent chains basic vision-language capabilities inside a single VLM via a scheduler and trains it in three stages (behavior cloning, DAgger, RL) to improve embodied task planning.
A Survey on Vision-Language-Action Models: An Action Tokenization Perspective cs.RO · 2025-07-02 · unverdicted · none · ref 25 · internal anchor
The survey frames VLA models as pipelines that generate progressively grounded action tokens and classifies those tokens into eight types to guide future development.
SafeVLA: Towards Safety Alignment of Vision-Language-Action Model via Constrained Learning cs.RO · 2025-03-05 · unverdicted · none · ref 47 · internal anchor
SafeVLA applies constrained reinforcement learning via CMDP min-max optimization to VLAs, cutting safety violation costs by 83.58% while preserving task success on long-horizon mobile manipulation tasks.
EA-WM: Event-Aware World Models with Task-Specification Grounding for Long-Horizon Manipulation cs.RO · 2026-06-11 · unverdicted · none · ref 13 · internal anchor
EA-WM adds task-specification-grounded event prediction and verification to frozen visual-feature world models for improved long-horizon robot manipulation planning.
JOIN: Anchor-Grasp-Conditioned Joining via Opposition, Inference, and Navigation for Bimanual Assistive Manipulation cs.RO · 2026-06-09 · unverdicted · none · ref 24 · internal anchor
JOIN decomposes bimanual joining into plan-drive-grasp phases and uses a VLM to let a mobile manipulator complete tasks with a pre-grasped anchor arm, achieving 19/20 success versus 14/20 for baselines on representative ADLs.
AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement cs.RO · 2026-04-12 · unreviewed · ref 28 · internal anchor

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer