Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Aditya Chatterjee; Pranav Kumar; Rohan Paleja; Xinchen Jin

arxiv: 2605.17204 · v1 · pith:FNUFVVQNnew · submitted 2026-05-17 · 💻 cs.RO · cs.AI

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Xinchen Jin , Aditya Chatterjee , Pranav Kumar , Rohan Paleja This is my paper

Pith reviewed 2026-05-20 13:38 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords sparse autoencodersvision-language-actionmechanistic interpretabilityevent groundingcausal effectsrobot policieskeyframe clustering

0 comments

The pith

Grounding sparse autoencoder features to robot behavior events yields stronger causal interventions in vision-language-action policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an interpretability method for vision-language-action policies that connects internal model features to actual robot behaviors. Instead of using language contexts, it identifies key events in a task by grouping end-effector positions according to visual appearance, robot state, and timing. These events then serve as anchors for ranking and intervening on sparse autoencoder features. Tests across simulation and real robots show this event-based approach gives better results for understanding and controlling the policies.

Core claim

Event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of π0.5, showing that anchoring SAE analysis to closed-loop behavioral events offers a practical starting point for VLA interpretability.

What carries the argument

Event-grounded interpretability pipeline anchoring SAE features to behaviorally salient events identified via end-effector keyframe clustering with visual, state, and temporal cues.

If this is right

Event-grounded ranking produces stronger causal effects than alternatives when intervening in OpenVLA.
The approach transfers to models with continuous action outputs like π0.5.
SAE serves as a sparse yet imperfect basis for interventions, varying by architecture and site.
Aggressive interventions expose safety and interpretability boundaries in VLA systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extensions could target SAE features that go beyond immediate action coordinates to capture longer-term planning.
Finer-grained event definitions might allow more precise closed-loop testing of feature effects.
Targeted safety measures for interventions could enable use in higher-risk robot applications.

Load-bearing premise

Clustering end-effector keyframes with visual, state, and temporal cues reliably identifies behaviorally salient events causally linked to SAE features rather than incidental correlations.

What would settle it

Closed-loop robot rollouts showing no stronger causal effects for event-grounded feature rankings than for text-context rankings would disprove the main claimed advantage.

Figures

Figures reproduced from arXiv: 2605.17204 by Aditya Chatterjee, Pranav Kumar, Rohan Paleja, Xinchen Jin.

**Figure 2.** Figure 2: Example π0.5 keyframe clusters in end-effector position space for four LIBERO tasks; each color is a task-local event cluster. 0.0 0.2 0.4 0.6 0.8 activation (a) event-aligned pulse step-up step-down (b) window-mean ranked high ranked low timestep 0.0 0.2 0.4 0.6 activation (c) task-mean ranked high ranked low timestep (d) random-alive randomly picked [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Simulator dose-response on LIBERO-10: π0.5 action-expert per-layer ∆SR (percentage points) as αf sweeps from 0 (zero-out) to 1 (identity), one line per ranking. Case study: ranking overlap. We also characterize what each ranking selects ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Real hardware dose-response: end-effector distance to the prompted chip cluster vs SAE [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Chip-approach setup with counterbalanced red/yellow piles; orange dashed arrow shows a perturbed trajectory. Random Prompt Red Prompt Yellow Steer Red Steer Yellow −6 −4 −2 0 2 4 6 Prefere n c e s c ore: dyellow − dred (c m) closer to red closer to yellow Rollout preference by condition [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 8.** Figure 8: Event-aligned feature score heatmaps (OpenVLA, four LIBERO suites): rows are event [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Four representative OpenVLA task-local event clusters; each row shows one keyframe [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: This figure shows representative zero-out failures on both VLA architectures, evidencing [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Additional paired rollout snapshots for suite-level zero-out; each row ablates one event [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: This figure shows per-feature ∆SR (percentage points) under zero-out for the three SAE streams. Left: OpenVLA layer 31 (top-1 to top-5). Middle: π0.5 PG, layer-averaged across {0, 5, 11, 16} (top-1 to top-3). Right: π0.5 AE, layer-averaged across {0, 5, 11, 17} (top-1 to top-3). Each panel uses an independent y-scale to preserve within-stream structure. Layer Ranking top-1 top-2 top-3 0 Event-aligned −1.4… view at source ↗

**Figure 13.** Figure 13: Qualitative comparison on LIBERO-Object pick up the cream cheese and place it in the basket (f24287, layer 31, α = 150). Top: baseline succeeds. Middle: decoder-vector steering drives the arm far off-task, then locks into that pose. Bottom: matched random-vector control produces erratic off-task motion. The decoder direction collapses behavior far more than the random vector (0% vs 52% SR), confirming di… view at source ↗

**Figure 14.** Figure 14: Per-rank single-feature zero-out closed-loop success rate for [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $\pi_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Event-grounded SAE via keyframe clustering gives stronger causal interventions than text baselines on OpenVLA and transfers to π0.5, but the clustering step needs tighter validation to rule out incidental patterns.

read the letter

The main point is that this paper adapts sparse autoencoders to vision-language-action policies by anchoring features to behavioral events extracted from end-effector keyframe clusters, and the resulting ranking produces stronger causal effects than prior methods while carrying over to continuous action chunks in π0.5. They run the pipeline on two simulation setups plus a real-robot study and release the code, which is straightforward to use for follow-up work.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an event-grounded interpretability pipeline for Vision-Language-Action (VLA) policies that anchors Sparse Autoencoder (SAE) feature analysis to behavioral events. End-effector keyframes are clustered using visual, state, and temporal cues within each task, with optional VLM annotations for semantic context. The central claim is that event-grounded ranking produces the strongest causal effects under interventions on OpenVLA, with transfer to the continuous action chunks of π0.5, demonstrated across two simulation architectures and a real-robot study. The work also notes SAE limitations as an intervention basis and calls for future work on finer-grained evaluation and safe interventions.

Significance. If the results hold, the paper provides a practical, behavior-anchored approach to mechanistic interpretability for VLAs, addressing the mismatch between standard MI tools (designed for token outputs) and closed-loop robotic actions. The emphasis on reproducibility via released code, the cross-architecture transfer results, and the explicit discussion of intervention safety limits are strengths that could inform safer VLA deployments. This is a timely contribution given the rapid adoption of VLAs in robotics.

major comments (3)

[Abstract] Abstract: The abstract states that event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to π0.5, yet reports no quantitative details on effect sizes, statistical controls, confidence intervals, or p-values. This information is load-bearing for evaluating whether the superiority claim is robust or sensitive to the chosen event definitions.
[§3.2] §3.2 (Keyframe clustering): The method clusters end-effector keyframes on visual/state/temporal cues but provides no quantitative validation such as cluster purity against expert labels, sensitivity to cue weighting, or direct comparison against action-only change-point detection. Without these checks, it remains unclear whether the clusters capture causally salient events tied to SAE features or merely incidental co-occurrences, directly affecting the central claim.
[§5] §5 (Intervention results): The reported causal effects and transfer to continuous action chunks lack explicit baselines, intervention site details, and statistical comparisons. This makes it difficult to assess whether the event-grounded approach is generally superior or specific to the post-hoc event definitions used.

minor comments (2)

[Abstract] The abstract could explicitly name the non-event-grounded baselines used for comparison to improve readability.
[§3] Notation for SAE feature indices and intervention magnitudes could be introduced with a single equation early in §3 for consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects for strengthening the clarity and robustness of our claims. We appreciate the positive assessment of the work's timeliness and reproducibility focus. Below we respond point-by-point to the major comments, indicating revisions where we agree the manuscript should be updated.

read point-by-point responses

Referee: [Abstract] Abstract: The abstract states that event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to π0.5, yet reports no quantitative details on effect sizes, statistical controls, confidence intervals, or p-values. This information is load-bearing for evaluating whether the superiority claim is robust or sensitive to the chosen event definitions.

Authors: We agree that the abstract would be strengthened by including summary quantitative metrics. The full manuscript reports effect sizes (e.g., task success rate drops under targeted interventions) and comparisons across methods in §5, along with results aggregated over multiple seeds. In the revision we will add concise quantitative statements to the abstract, such as average causal effect magnitudes and the number of evaluation trials, while noting that closed-loop robotic settings limit conventional p-value reporting; confidence intervals from repeated rollouts will be referenced. revision: yes
Referee: [§3.2] §3.2 (Keyframe clustering): The method clusters end-effector keyframes on visual/state/temporal cues but provides no quantitative validation such as cluster purity against expert labels, sensitivity to cue weighting, or direct comparison against action-only change-point detection. Without these checks, it remains unclear whether the clusters capture causally salient events tied to SAE features or merely incidental co-occurrences, directly affecting the central claim.

Authors: This is a fair critique. The current manuscript validates clusters primarily through their downstream impact on SAE feature interventions and qualitative examples in §3.2. We will add a quantitative comparison to action-only change-point detection baselines, demonstrating that the multi-cue clustering produces features with measurably stronger causal effects. We will also include sensitivity analysis to cue weighting. Full cluster purity against expert labels is feasible for a subset of tasks and will be reported; however, exhaustive expert annotation across all evaluated tasks was outside the original experimental scope. revision: partial
Referee: [§5] §5 (Intervention results): The reported causal effects and transfer to continuous action chunks lack explicit baselines, intervention site details, and statistical comparisons. This makes it difficult to assess whether the event-grounded approach is generally superior or specific to the post-hoc event definitions used.

Authors: We thank the referee for this observation. Section 5 already includes explicit baselines (standard SAE activation ranking and random interventions) and specifies intervention sites by layer and feature index for both OpenVLA and π0.5. We will revise the section to make these baselines and sites more prominent in the text and tables, and add statistical comparisons (mean and standard deviation across seeds and tasks) for the reported causal effects and action-chunk transfer metrics. This will clarify that the superiority holds relative to the chosen baselines rather than being an artifact of event definitions alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims.

full rationale

The paper describes an empirical interpretability pipeline that clusters end-effector keyframes using visual/state/temporal cues and then ranks SAE features by their causal effects on policy rollouts. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on experimental intervention outcomes across OpenVLA and π0.5 rather than on any renaming, ansatz smuggling, or uniqueness theorem imported from prior self-citations. The method is data-driven and externally testable via closed-loop rollouts, satisfying the criteria for a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that behavioral events extracted from end-effector trajectories provide a meaningful anchor for SAE features; clustering hyperparameters and intervention strength thresholds are likely free parameters chosen during development.

free parameters (1)

keyframe clustering parameters
Number of clusters, distance thresholds, or temporal windows used to group end-effector keyframes within each task.

axioms (1)

domain assumption Interventions on SAE features produce measurable causal changes in closed-loop robot behavior that can be attributed to the linked events.
The pipeline's value depends on this causal link holding in practice.

pith-pipeline@v0.9.0 · 5797 in / 1331 out tokens · 49955 ms · 2026-05-20T13:38:29.627210+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

event-aligned ranking... scores SAE activations conditioned on these external behavioral events

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

[1]

Open- VLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...

work page 2024
[2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[3]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Learning interpretable, high-performing policies for autonomous driving

Rohan Paleja, Yaru Niu, Andrew Silva, Chace Ritchie, Sugju Choi, and Matthew Gombolay. Learning interpretable, high-performing policies for autonomous driving. InRobotics: Science and Systems (RSS), 2022

work page 2022
[5]

The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

work page 2021
[6]

Interpreting GPT: The logit lens, 2020

nostalgebraist. Interpreting GPT: The logit lens, 2020. URL https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . LessWrong blog post

work page 2020
[7]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

work page 2025
[8]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022

work page 2022
[9]

Analyzing transformers in embedding space

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, 2023

work page 2023
[10]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

work page 2023
[11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

Language models can ex- plain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can ex- plain neurons in language models. https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023

work page 2023
[13]

Automatically inter- preting millions of features in large language models

Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc

work page 2025
[14]

Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy III, and Mac Schwa- ger. Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026. 10

work page arXiv 2026
[15]

Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

Bear Häon, Kaylene Stocking, Ian Chuang, and Claire Tomlin. Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

work page arXiv 2025
[16]

VLAs are Confined yet Capable of Generalizing to Novel Instructions

Quanyi Li. Task reconstruction and extrapolation for π0 using text latent.arXiv preprint arXiv:2505.03500, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Anwar, and Manzoor A

Momin Ahmad Khan, Novak Boskov, Fatima M. Anwar, and Manzoor A. Khan. Control- ling vision–language–action policies through sparse latent directions. InMechanistic Inter- pretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview.net/forum?id= wtf3ww1EOL

work page 2025
[18]

Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

Bryce Grant, Xijia Zhao, and Peng Wang. Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

work page arXiv 2026
[19]

Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann, and Marco Pavone. Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

work page arXiv 2026
[20]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[22]

Zhao, and Chelsea Finn

Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, and Chelsea Finn. Waypoint-based imitation learning for robotic manipulation. In7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=X0cmlTh1Vl

work page 2023
[23]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022
[24]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

work page 2023
[25]

Towards interpreting visual information processing in vision-language models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=chanJGoa7f

work page 2025
[26]

Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms

Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, and Koustava Goswami. Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2025
[27]

A is for absorption: Studying feature splitting and absorption in sparse autoencoders

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, and Joseph Isaac Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. In Interpretable AI: Past, Present and Future, 2024. URLhttps://openreview.net/forum? id=Wzav8fesTL

work page 2024
[28]

SAE-v: Interpreting multimodal models for enhanced alignment

Hantao Lou, Changye Li, Jiaming Ji, and Yaodong Yang. SAE-v: Interpreting multimodal models for enhanced alignment. InForty-second International Conference on Machine Learning,

work page
[29]

URLhttps://openreview.net/forum?id=S4HPn5Bo6k

work page
[30]

Sparse autoencoders learn monosemantic features in vision-language models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=DaNnkQJSQf. 11

work page 2025
[31]

Kakade, and Stephanie Gil

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham M. Kakade, and Stephanie Gil. Inter- preting the linear structure of vision-language model embedding spaces. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=qPsmGjpq1j

work page 2025
[32]

Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative

Chengyang He, Xu Liu, Gadiel Mark Sznaier Camps, Joseph Bruno, Guillaume Adrien Sar- toretti, and Mac Schwager. Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PL0tJOfm7I

work page 2026
[33]

When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, and Mingyu Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

work page arXiv 2026
[34]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URLhttps: //openreview.net/forum?id=d4dpOCqybL

work page 2024
[35]

Robo2vlm: Improving visual question answering using large-scale robot manipulation data

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page
[36]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1690...

work page doi:10.18653/v1/2025.findings-emnlp.89 2025
[37]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023
[38]

<task description>

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj. 12 A SAE Training Details We train one BatchTopK SAE per policy stream, LIBERO suite, and layer using closed-loop activation ...

work page 2024

[1] [1]

Open- VLA: An open-source vision-language-action model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...

work page 2024

[2] [2]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[3] [3]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Learning interpretable, high-performing policies for autonomous driving

Rohan Paleja, Yaru Niu, Andrew Silva, Chace Ritchie, Sugju Choi, and Matthew Gombolay. Learning interpretable, high-performing policies for autonomous driving. InRobotics: Science and Systems (RSS), 2022

work page 2022

[5] [5]

The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

work page 2021

[6] [6]

Interpreting GPT: The logit lens, 2020

nostalgebraist. Interpreting GPT: The logit lens, 2020. URL https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . LessWrong blog post

work page 2020

[7] [7]

Scaling and evaluating sparse autoencoders

Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

work page 2025

[8] [8]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022

work page 2022

[9] [9]

Analyzing transformers in embedding space

Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, 2023

work page 2023

[10] [10]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

work page 2023

[11] [11]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

Language models can ex- plain neurons in language models

Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can ex- plain neurons in language models. https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023

work page 2023

[13] [13]

Automatically inter- preting millions of features in large language models

Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc

work page 2025

[14] [14]

Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026

Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy III, and Mac Schwa- ger. Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026. 10

work page arXiv 2026

[15] [15]

Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

Bear Häon, Kaylene Stocking, Ian Chuang, and Claire Tomlin. Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

work page arXiv 2025

[16] [16]

VLAs are Confined yet Capable of Generalizing to Novel Instructions

Quanyi Li. Task reconstruction and extrapolation for π0 using text latent.arXiv preprint arXiv:2505.03500, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Anwar, and Manzoor A

Momin Ahmad Khan, Novak Boskov, Fatima M. Anwar, and Manzoor A. Khan. Control- ling vision–language–action policies through sparse latent directions. InMechanistic Inter- pretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview.net/forum?id= wtf3ww1EOL

work page 2025

[18] [18]

Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

Bryce Grant, Xijia Zhao, and Peng Wang. Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

work page arXiv 2026

[19] [19]

Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann, and Marco Pavone. Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

work page arXiv 2026

[20] [20]

Toy Models of Superposition

Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024

[22] [22]

Zhao, and Chelsea Finn

Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, and Chelsea Finn. Waypoint-based imitation learning for robotic manipulation. In7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=X0cmlTh1Vl

work page 2023

[23] [23]

Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

work page 2022

[24] [24]

Dissecting recall of factual associations in auto-regressive language models

Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

work page 2023

[25] [25]

Towards interpreting visual information processing in vision-language models

Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=chanJGoa7f

work page 2025

[26] [26]

Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms

Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, and Koustava Goswami. Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

work page 2025

[27] [27]

A is for absorption: Studying feature splitting and absorption in sparse autoencoders

David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, and Joseph Isaac Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. In Interpretable AI: Past, Present and Future, 2024. URLhttps://openreview.net/forum? id=Wzav8fesTL

work page 2024

[28] [28]

SAE-v: Interpreting multimodal models for enhanced alignment

Hantao Lou, Changye Li, Jiaming Ji, and Yaodong Yang. SAE-v: Interpreting multimodal models for enhanced alignment. InForty-second International Conference on Machine Learning,

work page

[29] [29]

URLhttps://openreview.net/forum?id=S4HPn5Bo6k

work page

[30] [30]

Sparse autoencoders learn monosemantic features in vision-language models

Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=DaNnkQJSQf. 11

work page 2025

[31] [31]

Kakade, and Stephanie Gil

Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham M. Kakade, and Stephanie Gil. Inter- preting the linear structure of vision-language model embedding spaces. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=qPsmGjpq1j

work page 2025

[32] [32]

Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative

Chengyang He, Xu Liu, Gadiel Mark Sznaier Camps, Joseph Bruno, Guillaume Adrien Sar- toretti, and Mac Schwager. Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PL0tJOfm7I

work page 2026

[33] [33]

When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, and Mingyu Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

work page arXiv 2026

[34] [34]

Batchtopk sparse autoencoders

Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URLhttps: //openreview.net/forum?id=d4dpOCqybL

work page 2024

[35] [35]

Robo2vlm: Improving visual question answering using large-scale robot manipulation data

Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

work page

[36] [36]

A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1690...

work page doi:10.18653/v1/2025.findings-emnlp.89 2025

[37] [37]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

work page 2023

[38] [38]

<task description>

Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj. 12 A SAE Training Details We train one BatchTopK SAE per policy stream, LIBERO suite, and layer using closed-loop activation ...

work page 2024