Contrastive Representation Regularization for Vision-Language-Action Models

· 2025 · cs.RO · arXiv 2510.01711

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

representative citing papers

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning

cs.CV · 2026-05-28 · unverdicted · novelty 6.0

Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.

QuoVLA: Quotient Space for Vision-Language-Action Models

cs.CV · 2026-05-24 · unverdicted · novelty 5.0

QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.

Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models

cs.CV · 2026-05-23 · unverdicted · novelty 3.0

The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Mitigating State Aliasing in Vision-Language-Action Models via Inverse Dynamics Learning cs.CV · 2026-05-28 · unverdicted · none · ref 45 · internal anchor
Inverse dynamics prediction is added as an auxiliary task to reduce state aliasing in VLA models by directly supervising the vision encoder on action-relevant visual distinctions using only standard observation-action pairs.
QuoVLA: Quotient Space for Vision-Language-Action Models cs.CV · 2026-05-24 · unverdicted · none · ref 14 · internal anchor
QuoVLA introduces a quotient-space framework that compresses VLM latents into action-sufficient representations via quantization and dual-branch design for better VLA generalization.
Understanding the Impact of Geometric Foundation Models on Vision-Language-Action Models cs.CV · 2026-05-23 · unverdicted · none · ref 26 · internal anchor
The paper quantifies the geometric gap in current VLAs via linear probing and compares three architectures for injecting geometry from GFMs while analyzing impacts of data, cameras, and reconstruction quality.

Contrastive Representation Regularization for Vision-Language-Action Models

fields

years

verdicts

representative citing papers

citing papers explorer