hub

Hamster: Hierarchical action models for open-world robot manipulation

· 2025 · arXiv 2502.05485

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1 method 1

citation-polarity summary

background 1 use method 1

representative citing papers

Spatially Prompted Visual Trajectory Prediction for Egocentric Manipulation

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

The paper introduces SP-VTP as a new setting for egocentric manipulation, releases the EgoSPT dataset with first-frame spatial annotations, and proposes the SPOT model that outperforms non-prompted baselines on cross-scene trajectory prediction.

VP-VLA: Visual Prompting as an Interface for Vision-Language-Action Models

cs.RO · 2026-03-23 · unverdicted · novelty 7.0

VP-VLA decouples high-level reasoning from low-level control in VLA models by rendering spatial anchors as visual prompts directly in the RGB observation space, outperforming end-to-end baselines.

UniLACT: Depth-Aware RGB Latent Action Learning for Vision-Language-Action Models

cs.RO · 2026-02-23 · unverdicted · novelty 7.0

UniLACT improves VLA models by adding depth-aware unified latent action pretraining that outperforms RGB-only baselines on seen and unseen manipulation tasks.

World Action Models are Zero-shot Policies

cs.RO · 2026-02-17 · unverdicted · novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment transfer with 10-30 minutes of data.

Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control

cs.RO · 2026-02-13 · unverdicted · novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and generalization tasks.

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

cs.RO · 2025-11-04 · unverdicted · novelty 6.0

LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

cs.RO · 2025-10-15 · unverdicted · novelty 6.0

InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

cs.LG · 2025-04-22 · unverdicted · novelty 6.0

π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

cs.RO · 2025-02-26 · unverdicted · novelty 6.0

A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

cs.CV · 2026-04-15 · unverdicted · novelty 5.0 · 2 refs

HiVLA decouples VLM-based semantic planning with visual grounding from a cascaded cross-attention DiT action expert, outperforming end-to-end VLAs on long-horizon and fine-grained manipulation.

ROSClaw: A Hierarchical Semantic-Physical Framework for Heterogeneous Multi-Agent Collaboration

cs.RO · 2026-04-06 · unverdicted · novelty 5.0

ROSClaw is a hierarchical framework that unifies vision-language model control with e-URDF-based sim-to-real mapping and closed-loop data collection to enable semantic-physical collaboration among heterogeneous multi-agent robots.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

cs.CV · 2025-07-22 · unverdicted · novelty 5.0

ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

cs.CV · 2026-06-16 · unverdicted · novelty 4.0

GeneralVLA-2 introduces GeoFuse-MV3D for improved multi-view 3D reconstruction and a governed memory system, demonstrating modest gains on 3D object and task benchmarks.

AnySlot: Goal-Conditioned Vision-Language-Action Policies for Zero-Shot Slot-Level Placement

cs.RO · 2026-04-12

citing papers explorer

Showing 5 of 5 citing papers after filters.

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation cs.RO · 2025-11-04 · unverdicted · none · ref 27
LACY is a VLM framework jointly trained on L2A, A2L, and L2C tasks that uses an active augmentation cycle to self-improve robotic manipulation policies, reporting a 56.46% average success rate gain in simulation and real-world experiments.
InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy cs.RO · 2025-10-15 · unverdicted · none · ref 21
InternVLA-M1 uses spatially guided pre-training on 2.3M examples followed by action post-training to deliver up to 17% gains on robot manipulation benchmarks and 20.6% on unseen objects.
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization cs.LG · 2025-04-22 · unverdicted · none · ref 47
π_{0.5} is a VLA model that achieves long-horizon dexterous manipulation in entirely new homes through co-training on heterogeneous tasks and multi-source data including web and semantic predictions.
Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models cs.RO · 2025-02-26 · unverdicted · none · ref 20
A hierarchical VLA architecture lets robots follow complex instructions and situated feedback by separating high-level reasoning from low-level control.
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning cs.CV · 2025-07-22 · unverdicted · none · ref 21
ThinkAct introduces reinforced visual latent planning in a dual VLA system to enable better long-horizon reasoning and adaptation for embodied tasks.

Hamster: Hierarchical action models for open-world robot manipulation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer