pith. sign in

arxiv: 2605.17204 · v1 · pith:FNUFVVQNnew · submitted 2026-05-17 · 💻 cs.RO · cs.AI

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

Pith reviewed 2026-05-20 13:38 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords sparse autoencodersvision-language-actionmechanistic interpretabilityevent groundingcausal effectsrobot policieskeyframe clustering
0
0 comments X

The pith

Grounding sparse autoencoder features to robot behavior events yields stronger causal interventions in vision-language-action policies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an interpretability method for vision-language-action policies that connects internal model features to actual robot behaviors. Instead of using language contexts, it identifies key events in a task by grouping end-effector positions according to visual appearance, robot state, and timing. These events then serve as anchors for ranking and intervening on sparse autoencoder features. Tests across simulation and real robots show this event-based approach gives better results for understanding and controlling the policies.

Core claim

Event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of π0.5, showing that anchoring SAE analysis to closed-loop behavioral events offers a practical starting point for VLA interpretability.

What carries the argument

Event-grounded interpretability pipeline anchoring SAE features to behaviorally salient events identified via end-effector keyframe clustering with visual, state, and temporal cues.

If this is right

  • Event-grounded ranking produces stronger causal effects than alternatives when intervening in OpenVLA.
  • The approach transfers to models with continuous action outputs like π0.5.
  • SAE serves as a sparse yet imperfect basis for interventions, varying by architecture and site.
  • Aggressive interventions expose safety and interpretability boundaries in VLA systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extensions could target SAE features that go beyond immediate action coordinates to capture longer-term planning.
  • Finer-grained event definitions might allow more precise closed-loop testing of feature effects.
  • Targeted safety measures for interventions could enable use in higher-risk robot applications.

Load-bearing premise

Clustering end-effector keyframes with visual, state, and temporal cues reliably identifies behaviorally salient events causally linked to SAE features rather than incidental correlations.

What would settle it

Closed-loop robot rollouts showing no stronger causal effects for event-grounded feature rankings than for text-context rankings would disprove the main claimed advantage.

Figures

Figures reproduced from arXiv: 2605.17204 by Aditya Chatterjee, Pranav Kumar, Rohan Paleja, Xinchen Jin.

Figure 1
Figure 1. Figure 1: This figure shows the 4 stages of the event-grounded SAE pipeline: (1) SAE training [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example π0.5 keyframe clus￾ters in end-effector position space for four LIBERO tasks; each color is a task-local event cluster. 0.0 0.2 0.4 0.6 0.8 activation (a) event-aligned pulse step-up step-down (b) window-mean ranked high ranked low timestep 0.0 0.2 0.4 0.6 activation (c) task-mean ranked high ranked low timestep (d) random-alive randomly picked [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Simulator dose-response on LIBERO-10: π0.5 action-expert per-layer ∆SR (percentage points) as αf sweeps from 0 (zero-out) to 1 (identity), one line per ranking. Case study: ranking overlap. We also characterize what each ranking selects ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Real hardware dose-response: end-effector distance to the prompted chip cluster vs SAE [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Chip-approach setup with counterbal￾anced red/yellow piles; orange dashed arrow shows a perturbed trajectory. Random Prompt Red Prompt Yellow Steer Red Steer Yellow −6 −4 −2 0 2 4 6 Prefere n c e s c ore: dyellow − dred (c m) closer to red closer to yellow Rollout preference by condition [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Event-aligned feature score heatmaps (OpenVLA, four LIBERO suites): rows are event [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Four representative OpenVLA task-local event clusters; each row shows one keyframe [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: This figure shows representative zero-out failures on both VLA architectures, evidencing [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional paired rollout snapshots for suite-level zero-out; each row ablates one event [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: This figure shows per-feature ∆SR (percentage points) under zero-out for the three SAE streams. Left: OpenVLA layer 31 (top-1 to top-5). Middle: π0.5 PG, layer-averaged across {0, 5, 11, 16} (top-1 to top-3). Right: π0.5 AE, layer-averaged across {0, 5, 11, 17} (top-1 to top-3). Each panel uses an independent y-scale to preserve within-stream structure. Layer Ranking top-1 top-2 top-3 0 Event-aligned −1.4… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative comparison on LIBERO-Object pick up the cream cheese and place it in the basket (f24287, layer 31, α = 150). Top: baseline succeeds. Middle: decoder-vector steering drives the arm far off-task, then locks into that pose. Bottom: matched random-vector control produces erratic off-task motion. The decoder direction collapses behavior far more than the random vector (0% vs 52% SR), confirm￾ing di… view at source ↗
Figure 14
Figure 14. Figure 14: Per-rank single-feature zero-out closed-loop success rate for [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
read the original abstract

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $\pi_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces an event-grounded interpretability pipeline for Vision-Language-Action (VLA) policies that anchors Sparse Autoencoder (SAE) feature analysis to behavioral events. End-effector keyframes are clustered using visual, state, and temporal cues within each task, with optional VLM annotations for semantic context. The central claim is that event-grounded ranking produces the strongest causal effects under interventions on OpenVLA, with transfer to the continuous action chunks of π0.5, demonstrated across two simulation architectures and a real-robot study. The work also notes SAE limitations as an intervention basis and calls for future work on finer-grained evaluation and safe interventions.

Significance. If the results hold, the paper provides a practical, behavior-anchored approach to mechanistic interpretability for VLAs, addressing the mismatch between standard MI tools (designed for token outputs) and closed-loop robotic actions. The emphasis on reproducibility via released code, the cross-architecture transfer results, and the explicit discussion of intervention safety limits are strengths that could inform safer VLA deployments. This is a timely contribution given the rapid adoption of VLAs in robotics.

major comments (3)
  1. [Abstract] Abstract: The abstract states that event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to π0.5, yet reports no quantitative details on effect sizes, statistical controls, confidence intervals, or p-values. This information is load-bearing for evaluating whether the superiority claim is robust or sensitive to the chosen event definitions.
  2. [§3.2] §3.2 (Keyframe clustering): The method clusters end-effector keyframes on visual/state/temporal cues but provides no quantitative validation such as cluster purity against expert labels, sensitivity to cue weighting, or direct comparison against action-only change-point detection. Without these checks, it remains unclear whether the clusters capture causally salient events tied to SAE features or merely incidental co-occurrences, directly affecting the central claim.
  3. [§5] §5 (Intervention results): The reported causal effects and transfer to continuous action chunks lack explicit baselines, intervention site details, and statistical comparisons. This makes it difficult to assess whether the event-grounded approach is generally superior or specific to the post-hoc event definitions used.
minor comments (2)
  1. [Abstract] The abstract could explicitly name the non-event-grounded baselines used for comparison to improve readability.
  2. [§3] Notation for SAE feature indices and intervention magnitudes could be introduced with a single equation early in §3 for consistency.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important aspects for strengthening the clarity and robustness of our claims. We appreciate the positive assessment of the work's timeliness and reproducibility focus. Below we respond point-by-point to the major comments, indicating revisions where we agree the manuscript should be updated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract states that event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to π0.5, yet reports no quantitative details on effect sizes, statistical controls, confidence intervals, or p-values. This information is load-bearing for evaluating whether the superiority claim is robust or sensitive to the chosen event definitions.

    Authors: We agree that the abstract would be strengthened by including summary quantitative metrics. The full manuscript reports effect sizes (e.g., task success rate drops under targeted interventions) and comparisons across methods in §5, along with results aggregated over multiple seeds. In the revision we will add concise quantitative statements to the abstract, such as average causal effect magnitudes and the number of evaluation trials, while noting that closed-loop robotic settings limit conventional p-value reporting; confidence intervals from repeated rollouts will be referenced. revision: yes

  2. Referee: [§3.2] §3.2 (Keyframe clustering): The method clusters end-effector keyframes on visual/state/temporal cues but provides no quantitative validation such as cluster purity against expert labels, sensitivity to cue weighting, or direct comparison against action-only change-point detection. Without these checks, it remains unclear whether the clusters capture causally salient events tied to SAE features or merely incidental co-occurrences, directly affecting the central claim.

    Authors: This is a fair critique. The current manuscript validates clusters primarily through their downstream impact on SAE feature interventions and qualitative examples in §3.2. We will add a quantitative comparison to action-only change-point detection baselines, demonstrating that the multi-cue clustering produces features with measurably stronger causal effects. We will also include sensitivity analysis to cue weighting. Full cluster purity against expert labels is feasible for a subset of tasks and will be reported; however, exhaustive expert annotation across all evaluated tasks was outside the original experimental scope. revision: partial

  3. Referee: [§5] §5 (Intervention results): The reported causal effects and transfer to continuous action chunks lack explicit baselines, intervention site details, and statistical comparisons. This makes it difficult to assess whether the event-grounded approach is generally superior or specific to the post-hoc event definitions used.

    Authors: We thank the referee for this observation. Section 5 already includes explicit baselines (standard SAE activation ranking and random interventions) and specifies intervention sites by layer and feature index for both OpenVLA and π0.5. We will revise the section to make these baselines and sites more prominent in the text and tables, and add statistical comparisons (mean and standard deviation across seeds and tasks) for the reported causal effects and action-chunk transfer metrics. This will clarify that the superiority holds relative to the chosen baselines rather than being an artifact of event definitions alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or claims.

full rationale

The paper describes an empirical interpretability pipeline that clusters end-effector keyframes using visual/state/temporal cues and then ranks SAE features by their causal effects on policy rollouts. No equations, derivations, or first-principles results are presented that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on experimental intervention outcomes across OpenVLA and π0.5 rather than on any renaming, ansatz smuggling, or uniqueness theorem imported from prior self-citations. The method is data-driven and externally testable via closed-loop rollouts, satisfying the criteria for a self-contained empirical contribution with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that behavioral events extracted from end-effector trajectories provide a meaningful anchor for SAE features; clustering hyperparameters and intervention strength thresholds are likely free parameters chosen during development.

free parameters (1)
  • keyframe clustering parameters
    Number of clusters, distance thresholds, or temporal windows used to group end-effector keyframes within each task.
axioms (1)
  • domain assumption Interventions on SAE features produce measurable causal changes in closed-loop robot behavior that can be attributed to the linked events.
    The pipeline's value depends on this causal link holding in practice.

pith-pipeline@v0.9.0 · 5797 in / 1331 out tokens · 49955 ms · 2026-05-20T13:38:29.627210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 5 internal anchors

  1. [1]

    Open- VLA: An open-source vision-language-action model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan P Foster, Pannag R Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Open- VLA: An open-source vision-language-action model. In8th Annual Conference on Robot Learn...

  2. [2]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  3. [3]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. π0.5: a vision- language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

  4. [4]

    Learning interpretable, high-performing policies for autonomous driving

    Rohan Paleja, Yaru Niu, Andrew Silva, Chace Ritchie, Sugju Choi, and Matthew Gombolay. Learning interpretable, high-performing policies for autonomous driving. InRobotics: Science and Systems (RSS), 2022

  5. [5]

    The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

    Rohan Paleja, Muyleng Ghuy, Nadun Ranawaka Arachchige, Reed Jensen, and Matthew Gombolay. The utility of explainable ai in ad hoc human-machine teaming.Advances in neural information processing systems, 34:610–623, 2021

  6. [6]

    Interpreting GPT: The logit lens, 2020

    nostalgebraist. Interpreting GPT: The logit lens, 2020. URL https://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens . LessWrong blog post

  7. [7]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupre la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025. URL https:// openreview.net/forum?id=tcsZt9ZNKD

  8. [8]

    Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space

    Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Goldberg. Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space. InProceedings of the 2022 conference on empirical methods in natural language processing, pages 30–45, 2022

  9. [9]

    Analyzing transformers in embedding space

    Guy Dar, Mor Geva, Ankit Gupta, and Jonathan Berant. Analyzing transformers in embedding space. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16124–16170, 2023

  10. [10]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, et al. Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread, 2(5):6, 2023

  11. [11]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

  12. [12]

    Language models can ex- plain neurons in language models

    Steven Bills, Nick Cammarata, Dan Mossing, Henk Tillman, Leo Gao, Gabriel Goh, Ilya Sutskever, Jan Leike, Jeff Wu, and William Saunders. Language models can ex- plain neurons in language models. https://openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023

  13. [13]

    Automatically inter- preting millions of features in large language models

    Gonçalo Santos Paulo, Alex Troy Mallen, Caden Juang, and Nora Belrose. Automatically inter- preting millions of features in large language models. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=EemtbhJOXc

  14. [14]

    Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026

    Aiden Swann, Lachlain McGranahan, Hugo Buurmeijer, Monroe Kennedy III, and Mac Schwa- ger. Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026. 10

  15. [15]

    Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

    Bear Häon, Kaylene Stocking, Ian Chuang, and Claire Tomlin. Mechanistic interpretability for steering vision-language-action models.arXiv preprint arXiv:2509.00328, 2025

  16. [16]

    VLAs are Confined yet Capable of Generalizing to Novel Instructions

    Quanyi Li. Task reconstruction and extrapolation for π0 using text latent.arXiv preprint arXiv:2505.03500, 2025

  17. [17]

    Anwar, and Manzoor A

    Momin Ahmad Khan, Novak Boskov, Fatima M. Anwar, and Manzoor A. Khan. Control- ling vision–language–action policies through sparse latent directions. InMechanistic Inter- pretability Workshop at NeurIPS 2025, 2025. URLhttps://openreview.net/forum?id= wtf3ww1EOL

  18. [18]

    Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

    Bryce Grant, Xijia Zhao, and Peng Wang. Not all features are created equal: A mechanistic study of vision-language-action models.arXiv preprint arXiv:2603.19233, 2026

  19. [19]

    Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

    Hugo Buurmeijer, Carmen Amo Alonso, Aiden Swann, and Marco Pavone. Observing and controlling features in vision-language-action models.arXiv preprint arXiv:2603.05487, 2026

  20. [20]

    Toy Models of Superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652, 2022

  21. [21]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  22. [22]

    Zhao, and Chelsea Finn

    Lucy Xiaoyang Shi, Archit Sharma, Tony Z. Zhao, and Chelsea Finn. Waypoint-based imitation learning for robotic manipulation. In7th Annual Conference on Robot Learning, 2023. URL https://openreview.net/forum?id=X0cmlTh1Vl

  23. [23]

    Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022

  24. [24]

    Dissecting recall of factual associations in auto-regressive language models

    Mor Geva, Jasmijn Bastings, Katja Filippova, and Amir Globerson. Dissecting recall of factual associations in auto-regressive language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 12216–12235, 2023

  25. [25]

    Towards interpreting visual information processing in vision-language models

    Clement Neo, Luke Ong, Philip Torr, Mor Geva, David Krueger, and Fazl Barez. Towards interpreting visual information processing in vision-language models. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. URLhttps://openreview.net/ forum?id=chanJGoa7f

  26. [26]

    Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms

    Anirudh Phukan, Divyansh Divyansh, Harshit Kumar Morj, Vaishnavi Vaishnavi, Apoorv Saxena, and Koustava Goswami. Beyond logit lens: Contextual embeddings for robust halluci- nation detection & grounding in vlms. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Tech...

  27. [27]

    A is for absorption: Studying feature splitting and absorption in sparse autoencoders

    David Chanin, James Wilken-Smith, Tomáš Dulka, Hardik Bhatnagar, and Joseph Isaac Bloom. A is for absorption: Studying feature splitting and absorption in sparse autoencoders. In Interpretable AI: Past, Present and Future, 2024. URLhttps://openreview.net/forum? id=Wzav8fesTL

  28. [28]

    SAE-v: Interpreting multimodal models for enhanced alignment

    Hantao Lou, Changye Li, Jiaming Ji, and Yaodong Yang. SAE-v: Interpreting multimodal models for enhanced alignment. InForty-second International Conference on Machine Learning,

  29. [29]

    URLhttps://openreview.net/forum?id=S4HPn5Bo6k

  30. [30]

    Sparse autoencoders learn monosemantic features in vision-language models

    Mateusz Pach, Shyamgopal Karthik, Quentin Bouniot, Serge Belongie, and Zeynep Akata. Sparse autoencoders learn monosemantic features in vision-language models. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems, 2025. URL https: //openreview.net/forum?id=DaNnkQJSQf. 11

  31. [31]

    Kakade, and Stephanie Gil

    Isabel Papadimitriou, Huangyuan Su, Thomas Fel, Sham M. Kakade, and Stephanie Gil. Inter- preting the linear structure of vision-language model embedding spaces. InSecond Conference on Language Modeling, 2025. URLhttps://openreview.net/forum?id=qPsmGjpq1j

  32. [32]

    Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative

    Chengyang He, Xu Liu, Gadiel Mark Sznaier Camps, Joseph Bruno, Guillaume Adrien Sar- toretti, and Mac Schwager. Demystifying robot diffusion policies: Action memorization and a simple lookup table alternative. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=PL0tJOfm7I

  33. [33]

    When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

    Yu Fang, Yuchun Feng, Dong Jing, Jiaqi Liu, Yue Yang, Zhenyu Wei, Daniel Szafir, and Mingyu Ding. When vision overrides language: Evaluating and mitigating counterfactual failures in vlas.arXiv preprint arXiv:2602.17659, 2026

  34. [34]

    Batchtopk sparse autoencoders

    Bart Bussmann, Patrick Leask, and Neel Nanda. Batchtopk sparse autoencoders. InNeurIPS 2024 Workshop on Scientific Methods for Understanding Deep Learning, 2024. URLhttps: //openreview.net/forum?id=d4dpOCqybL

  35. [35]

    Robo2vlm: Improving visual question answering using large-scale robot manipulation data

    Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R Sanketi, and Ken Goldberg. Robo2vlm: Improving visual question answering using large-scale robot manipulation data. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

  36. [36]

    A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models

    Dong Shu, Xuansheng Wu, Haiyan Zhao, Daking Rai, Ziyu Yao, Ninghao Liu, and Mengnan Du. A survey on sparse autoencoders: Interpreting the internal mechanisms of large language models. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors, Findings of the Association for Computational Linguistics: EMNLP 2025, pages 1690...

  37. [37]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  38. [38]

    <task description>

    Zipeng Fu, Tony Z. Zhao, and Chelsea Finn. Mobile ALOHA: Learning bimanual mobile manipulation using low-cost whole-body teleoperation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=FO6tePGRZj. 12 A SAE Training Details We train one BatchTopK SAE per policy stream, LIBERO suite, and layer using closed-loop activation ...