Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Aiden Swann; Hugo Buurmeijer; Lachlain McGranahan; Mac Schwager; Monroe Kennedy III

arxiv: 2603.19183 · v2 · pith:X6WU3FC3new · submitted 2026-03-19 · 💻 cs.RO

Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

Aiden Swann , Lachlain McGranahan , Hugo Buurmeijer , Monroe Kennedy III , Mac Schwager This is my paper

classification 💻 cs.RO

keywords featuresacrossgeneralmodelsparseactivationsautoencodersdirections

0 comments

read the original abstract

Vision-Language-Action (VLA) models have emerged as a promising approach for general-purpose robot manipulation. However, little research has mechanistically explored when and why they generalize across objects, scenes, and instructions. To probe internal representations, we train Sparse Autoencoders (SAEs) on the VLA's hidden-layer activations. SAEs learn sparse dictionaries over model activations, often revealing features that correspond to interpretable directions in the model's representation space. We identify SAE features corresponding to motion primitives and semantic concepts, including features that are general across episodes and causally steerable. We propose a metric to categorize features as general transferable primitives or episode-specific memorizations, offering a promising glimpse towards VLA generalization. We validate these findings through steering experiments on both the LIBERO simulation benchmark and on real-world DROID hardware. We find that amplifying general and semantic features induces behaviors consistent with their meanings, whereas ablating them destroys model performance. Furthermore, we demonstrate steering as a way to control behavior in unpromptable directions. Together, these results provide mechanistic evidence that VLAs can learn reusable internal features linking perception, language, and action across tasks and scenes. Our project page is located at https://drvla.github.io

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Point Tracking Improves World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

JOPAT jointly models pixels, point tracks, and actions in a diffusion transformer and reports gains over pixel-only baselines on long-horizon robot tasks with occlusion and off-screen motion.
Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies
cs.RO 2026-05 conditional novelty 7.0

Event-grounded SAE analysis in VLA policies produces stronger causal effects on robot behavior than standard methods by anchoring features to clustered end-effector keyframes across simulations and real-robot tests.