CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Arjun Bhorkar; Catherine Glossop; Dhruv Shah; Sergey Levine; William Chen

arxiv: 2508.13446 · v2 · pith:UWXKYFBWnew · submitted 2025-08-19 · 💻 cs.RO

CAST: Counterfactual Labels Improve Instruction Following in Vision-Language-Action Models

Catherine Glossop , William Chen , Arjun Bhorkar , Dhruv Shah , Sergey Levine This is my paper

classification 💻 cs.RO

keywords datasetslanguagerobotcounterfactualdiversityexistingfollowinstructions

0 comments

read the original abstract

Generalist robots should be able to understand and follow user instructions. Despite providing a powerful architecture for mapping open-vocabulary language instructions to robot actions, current vision-language-action (VLA) models struggle to follow fine-grained commands. One cause for this is a lack of semantic diversity and language grounding in existing robot datasets and, specifically, a lack of fine-grained task diversity for similar observations. To address this, we present a novel method to augment existing robot datasets by leveraging vision-language models to create counterfactual labels. By augmenting existing datasets with these labels, we increase the diversity and granularity of language grounding for robot datasets, ultimately improving the language-following capabilities of VLAs. We evaluate the resulting model's ability to follow language instructions, ranging from simple object-centric commands to complex referential tasks, by conducting vision-language navigation experiments in 3 different indoor and outdoor environments. Our experiments show that counterfactual relabeling (without additional data collection) significantly improves instruction-following in VLA policies, outperforming state-of-the-art methods and doubling the success rate compared to VLAs trained on unaugmented data. We also evaluate our method for manipulation VLAs and find a similar gain in performance on tasks with distractors.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NORM-Nav: Zero-Shot Mobile Robot Navigation with Natural Language Behavioral Constraints
cs.RO 2026-05 unverdicted novelty 6.0

NORM-Nav is a zero-shot framework that parses natural language behavioral constraints with an LLM, grounds them via vision-LiDAR, and encodes them as multi-layer costmaps for grid-based robot navigation.
Sentinel-VLA: A Metacognitive VLA Model with Active Status Monitoring for Dynamic Reasoning and Error Recovery
cs.RO 2026-05 unverdicted novelty 6.0

Sentinel-VLA introduces a metacognitive VLA model with a sentinel module for real-time status monitoring, dynamic reasoning, and error recovery, plus a self-evolving continual learning method, raising real-world task ...
Steerable Vision-Language-Action Policies for Embodied Reasoning and Hierarchical Control
cs.RO 2026-02 unverdicted novelty 6.0

Steerable VLAs trained on rich synthetic commands at subtask, motion, and pixel levels enable VLMs to steer robot behavior more effectively, outperforming prior hierarchical baselines on real-world manipulation and ge...
Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning
cs.RO 2026-02 unverdicted novelty 6.0

R&B-EnCoRe uses self-supervised importance-weighted variational inference to distill action-predictive reasoning datasets that improve VLA performance on manipulation, navigation, and driving tasks without external verifiers.