UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Bowen Fang; Jiabing Yang; Jing Liu; Kai Wang; Liang Wang; Nianfeng Liu; Peiyan Li; Qisen Ma; Tao Yu; Xiangnan Wu

arxiv: 2602.18020 · v2 · pith:SYOY6FNAnew · submitted 2026-02-20 · 💻 cs.CV · cs.RO

UAOR: Uncertainty-aware Observation Reinjection for Vision-Language-Action Models

Jiabing Yang , Yixiang Chen , Yuan Xu , Peiyan Li , Zichen Wen , Bowen Fang , Tao Yu , Xiangnan Wu

show 9 more authors

Qisen Ma Kai Wang Ziheng He Yingda Li Zhengbo Zhang Jing Liu Nianfeng Liu Yan Huang Liang Wang

This is my paper

classification 💻 cs.CV cs.RO

keywords modelsobservationuaoractionadditionalcuesexistingfeed-forward

0 comments

read the original abstract

Vision-Language-Action (VLA) models leverage pretrained Vision-Language Models (VLMs) as backbones to map images and instructions to actions, demonstrating remarkable potential for generalizable robotic manipulation. To enhance performance, existing methods often incorporate extra observation cues (e.g., depth maps, point clouds) or auxiliary modules (e.g., object detectors, encoders) to enable more precise and reliable task execution, yet these typically require costly data collection and additional training. Inspired by the finding that Feed-Forward Network (FFN) in language models can act as "key-value memory", we propose Uncertainty-aware Observation Reinjection (UAOR), an effective, training-free and plug-and-play module for VLA models. Specifically, when the current language model layer exhibits high uncertainty, measured by Action Entropy, it reinjects key observation information into the next layer's Feed-Forward Network (FFN) through attention retrieval. This mechanism directly augments the hidden states with observation evidence at high-uncertainty layers, enabling more accurate and reliable action generation. Comprehensive experiments show that our method consistently improves diverse VLA models across simulation and real-world tasks with minimal overhead. Notably, UAOR eliminates the need for additional observation cues or modules, making it a versatile and practical plug-in for existing VLA pipelines. The project page is at https://uaor.jiabingyang.cn.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniCoT: A Benchmark for Global and Multi-Step Panoramic Reasoning
cs.CV 2026-06 unverdicted novelty 7.0

OmniCoT is a new panoramic reasoning benchmark with 6.7K eval, 1K real, and 14.3K training examples plus a two-stage SFT+GRPO training method to enforce global 360-degree consistency.
ProbeAct: Probe-Guided Training-Free Failure Recovery in Vision-Language-Action Models
cs.RO 2026-06 unverdicted novelty 7.0

PROBEACT is a plug-and-play intervention framework that combines hidden-state probing, kinematic failure detection, and CBF-based correction to boost success rates of pre-trained VLA models on the LIBERO-plus benchmar...
E-TTS: A New Embodied Test-Time Scaling Framework for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

E-TTS introduces a plug-and-play test-time scaling method for embodied tasks that unifies reasoning-action sampling with history buffers and closed-loop refinement to improve performance on manipulation benchmarks.
Open-Loop Planning, Closed-Loop Verification: Speculative Verification for VLA
cs.RO 2026-04 unverdicted novelty 6.0

SV-VLA uses infrequent heavy VLA planning of action chunks plus a lightweight closed-loop verifier to achieve both efficiency and robustness in dynamic robot control.