HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

· 2025 · cs.RO · arXiv 2510.00695

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

open full Pith review browse 8 citing papers arXiv PDF

abstract

Inherently, robotic manipulation tasks are history-dependent: leveraging past context could be beneficial. However, most existing Vision-Language-Action models (VLAs) have been designed without considering this aspect, i.e., they rely solely on the current observation, ignoring preceding context. In this paper, we propose HAMLET, a scalable framework to adapt VLAs to attend to the historical context during action prediction. Specifically, we introduce moment tokens that compactly encode perceptual information at each timestep. Their representations are initialized with time-contrastive learning, allowing them to better capture temporally distinctive aspects. Next, we employ a lightweight memory module that integrates the moment tokens across past timesteps into memory features, which are then leveraged for action prediction. Through empirical evaluation, we show that HAMLET successfully transforms a state-of-the-art VLA into a history-aware policy, especially demonstrating significant improvements on long-horizon tasks that require historical context. In particular, on top of GR00T N1.5, HAMLET achieves an average success rate of 76.4% on history-dependent real-world tasks, surpassing the baseline performance by 47.2%. Furthermore, HAMLET pushes prior art performance from 64.1% to 66.4% on RoboCasa Kitchen (100-demo setup) and from 95.6% to 97.7% on LIBERO, highlighting its effectiveness even under generic robot-manipulation benchmarks.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control

cs.RO · 2026-05-21 · conditional · novelty 7.0

EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.

DSSP: Diffusion State Space Policy with Full-History Encoding

cs.RO · 2026-05-14 · conditional · novelty 7.0

DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.

Towards Generalizable Robotic Manipulation in Dynamic Environments

cs.CV · 2026-03-16 · unverdicted · novelty 7.0

DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.

WarmPrior: Straightening Flow-Matching Policies with Temporal Priors

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

cs.RO · 2026-05-11 · unverdicted · novelty 6.0

RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.

Adaptive Action Chunking at Inference-time for Vision-Language-Action Models

cs.RO · 2026-04-05 · unverdicted · novelty 6.0

Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

cs.AI · 2026-01-22 · conditional · novelty 6.0

Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

cs.RO · 2026-03-04

citing papers explorer

Showing 8 of 8 citing papers.

EvoScene-VLA: Evolving Scene Beliefs Inside the Action Decoder for Chunked Robot Control cs.RO · 2026-05-21 · conditional · none · ref 15 · internal anchor
EvoScene-VLA maintains an action-updated scene prior across control chunks in VLA policies, raising success rates on RoboTwin tasks from 87.2% to 89.1% fixed and 86.1% to 88.5% randomized while outperforming baselines on a real robot.
DSSP: Diffusion State Space Policy with Full-History Encoding cs.RO · 2026-05-14 · conditional · none · ref 30 · internal anchor
DSSP is a history-conditioned diffusion state space policy that uses SSMs to encode full observation streams with an auxiliary dynamics objective and hierarchical fusion, achieving SOTA results with reduced model size in robot manipulation.
Towards Generalizable Robotic Manipulation in Dynamic Environments cs.CV · 2026-03-16 · unverdicted · none · ref 26 · internal anchor
DOMINO dataset and PUMA architecture enable better dynamic robotic manipulation by incorporating motion history, delivering 6.3% higher success rates than prior VLA models.
WarmPrior: Straightening Flow-Matching Policies with Temporal Priors cs.LG · 2026-05-13 · unverdicted · none · ref 5 · internal anchor
Replacing Gaussian noise with a temporally grounded prior from recent actions straightens flow-matching paths and improves success rates in robotic manipulation and prior-space RL.
RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark cs.RO · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
RoboMemArena is a new large-scale robotic memory benchmark with real-world tasks, and PrediMem is a dual VLA system that outperforms baselines by managing memory buffers with predictive coding.
Adaptive Action Chunking at Inference-time for Vision-Language-Action Models cs.RO · 2026-04-05 · unverdicted · none · ref 18 · internal anchor
Adaptive Action Chunking uses action entropy to dynamically adjust chunk sizes in VLA models, improving performance on simulated and real robotic manipulation tasks.
Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning cs.AI · 2026-01-22 · conditional · none · ref 21 · internal anchor
Single-stage fine-tuning of a video model to generate actions as latent frames plus future states and values yields state-of-the-art robot policy performance on LIBERO, RoboCasa, and bimanual tasks.
RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies cs.RO · 2026-03-04 · unreviewed · ref 24 · internal anchor

HAMLET: Switch your Vision-Language-Action Model into a History-Aware Policy

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer