pith. sign in

arxiv: 2606.01095 · v1 · pith:WIRQ766Unew · submitted 2026-05-31 · 💻 cs.RO · cs.AI

Beyond Task Success: Behavioral and Representational Diagnostics for WAM and VLA

Pith reviewed 2026-06-28 17:19 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords vision-language-actionworld-action modelsrobotic manipulationbehavioral diagnosticsrepresentational analysispolicy evaluationLIBERO benchmark
0
0 comments X

The pith

WAMs improve object-level robot behavior and target selectivity over VLAs, but gains vary by architecture and raise inference cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks whether adding future prediction through world-action models produces behaviorally meaningful changes in robotic manipulation policies, or whether it only adds computation without altering control-relevant actions or internal states. It introduces a diagnostic framework that runs behavioral rollouts to track action consistency, object progress, and distractor resistance, then applies sparse autoencoders to label internal features as memorized, reactive, or predictive. Across seven policies on two benchmarks the results indicate that sequential WAMs most clearly encode future structure while auxiliary versions compress it and joint versions entangle it, with all WAM variants showing higher runtime cost than direct VLAs. A reader would care because final task success alone can mask whether a model actually plans ahead or simply reacts better to visible objects.

Core claim

Success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information.

What carries the argument

Model-agnostic diagnostic framework that pairs behavioral rollout analysis (action dynamics consistency, target-object progress, distractor disturbance, runtime cost) with sparse-autoencoder feature analysis that classifies representations as memorized, reactive, or predictive.

If this is right

  • Sequential WAMs preserve the clearest future-oriented structure in their representations.
  • Auxiliary WAMs tend to compress future information relative to direct prediction.
  • Joint WAMs tend to entangle future information with current observations.
  • WAM architectures improve object-level behavior and target selectivity compared with direct VLAs.
  • All tested WAM variants incur higher inference cost than the corresponding VLAs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future WAM designs could prioritize sequential prediction to keep future representations behaviorally actionable while controlling compute cost.
  • Policy evaluation suites should include the behavioral and representational diagnostics as standard checks rather than relying on success rate alone.
  • The same diagnostic pair could be applied to test whether predictive representations improve robustness when object dynamics change mid-task.

Load-bearing premise

The chosen behavioral metrics and representation classifiers capture improvements that actually matter for control performance.

What would settle it

A head-to-head test in which a WAM and a VLA produce identical scores on action consistency, target progress, distractor resistance, and representation category yet still differ in final task success would falsify the claim that the diagnostics reveal control-relevant differences.

Figures

Figures reproduced from arXiv: 2606.01095 by Bin Zhu, Hung Mai, Tuan Do.

Figure 1
Figure 1. Figure 1: Overview of our WAM–VLA analysis framework. We compare direct VLA policies and WAM variants through two complementary lenses: behavioral rollout diagnostics and sparse-autoencoder feature￾space analysis. The framework evaluates not only task success, but also motion smoothness, target-object progress, distractor stability, and future-oriented internal representations. Abstract: Vision-language-action (VLA)… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the three World-Action Model (WAM) paradigms. 3 Evaluation Protocols 3.1 Preliminary We study robot policies that map observations ot ∈ O, robot or environment states est ∈ S, and task instructions g ∈ G to executed actions at ∈ A. Many policies predict action chunks Ct = 3 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Phase-aligned sparse feature activations. 2 Cosmos LIBERO-10 rollouts are shown with task phases and selected SAE activations. General features show phase-specific patterns, e.g., f1644 during carry/lift and f1828 during grasp, while the memorized feature f1701 appears only in one rollout near place/release. 4.4 Discussion Our results show that comparing WAMs and VLAs only by success rate misses an importa… view at source ↗
Figure 4
Figure 4. Figure 4: Phase-aligned sparse feature activations of Lingbot-VA [PITH_FULL_IMAGE:figures/full_fig_p021_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Phase-aligned sparse feature activations of VLAJEPA [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase-aligned sparse feature activations of XVLA. To support phase interpretation, we estimate a dominant phase for each feature. Let Φ = {approach, grasp, lift/carry,release/place}. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Phase-aligned sparse feature activations of π0 [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Phase-aligned sparse feature activations of π0.5. For each timestep, we estimate a soft phase distribution P(ϕt = ϕ | rollout) from rollout states when available, and from a weaker normalized-time proxy otherwise. The feature-level phase prob￾ability is computed by activation-weighted aggregation: P(ϕ | j) = P e P t fj (x (e) t )P(ϕ (e) t = ϕ | rollout) P e P t fj (x (e) t ) + ϵ . The predicted phase is ϕˆ… view at source ↗
Figure 9
Figure 9. Figure 9: SAE-behavior probe deltas. Each bar reports the change in probe performance after removing predictive SAE statistics, computed as ∆ = scoreall −scorew/o pred.. Positive values indicate that predictive SAE statistics improve behavioral prediction. out outcomes. For each episode, we compute activation mass by feature type, active feature ra￾tios, activation- weighted future consistency score (FCS), activatio… view at source ↗
Figure 10
Figure 10. Figure 10: Real-vs-null future consistency. Real future-stream features retain higher median FCS than random decoder directions and temporal/episode shuffles, while the current stream does not show the same positive future-aligned structure. C.6 Multi-seed training ablation To verify that the SAE features used in our analysis are not artifacts of a particular random initializa￾tion, we train multiple SAEs with 5 ind… view at source ↗
Figure 11
Figure 11. Figure 11: Multi-seed SAE stability ablation. We train SAEs from 5 independent random seeds on the same activation data and visualize the stable top-5 features for a fixed rollout. Activations are normalized for visualization and ordered by cross-seed consistency. Both X-VLA and Cosmos exhibit recurring temporal activation patterns across independent SAE initializations, suggesting that the selected features reflect… view at source ↗
read the original abstract

Vision-language-action (VLA) policies and World-Action Models (WAM) represent two increasingly important paradigms for robotic manipulation. However, it remains unclear whether future prediction in WAMs leads to behaviorally meaningful improvements beyond final task success. In this paper, we ask whether WAMs merely add future prediction, or whether they change robot behavior and internal representations in ways that are actionable for control. We introduce a model-agnostic diagnostic framework that compares WAMs and VLAs through two complementary lenses: behavioral rollout analysis and sparse-autoencoder-based feature analysis. The behavioral protocol measures action dynamics consistency, target-object progress, distractor disturbance, and runtime cost. The feature-space protocol characterizes internal representations as memorized, reactive, or predictive, revealing whether models encode future-oriented structure. Across LIBERO and RoboTwin2.0, we evaluate 7 policies spanning direct VLAs and joint, sequential, and auxiliary WAMs. Our results show that success alone hides key differences: WAMs often improve object-level behavior and target selectivity, but their gains depend on architecture and incur higher inference cost. Sequential WAMs show the clearest predictive structure, while auxiliary and joint WAMs respectively compress or entangle future information. These findings suggest future directions for WAMs design to preserve behaviorally actionable future representations for efficient manipulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that task success rates alone obscure important differences between direct VLAs and three WAM variants (joint, sequential, auxiliary). It introduces a model-agnostic diagnostic framework consisting of a behavioral protocol (action-dynamics consistency, target-object progress, distractor disturbance, runtime cost) and a sparse-autoencoder feature protocol that classifies internal representations as memorized, reactive, or predictive. Experiments on LIBERO and RoboTwin2.0 across seven policies indicate that WAMs can improve object-level behavior and target selectivity in an architecture-dependent manner, that sequential WAMs exhibit the clearest predictive structure, and that auxiliary/joint variants respectively compress or entangle future information, albeit at higher inference cost.

Significance. If the two diagnostic protocols can be shown to track control-relevant quantities, the work supplies concrete, architecture-specific guidance for WAM design that goes beyond aggregate success rates. The explicit comparison of integration strategies and the use of sparse autoencoders to probe representational structure are potentially useful contributions to the VLA/WAM literature.

major comments (3)
  1. [Behavioral protocol (§3)] Behavioral protocol (abstract and §3): no correlation, ablation, or held-out validation is reported showing that action-dynamics consistency, target-object progress, or distractor disturbance scores predict downstream control quality, expert preference, or performance on unseen tasks. Without such evidence the claim that these metrics reveal 'behaviorally meaningful' and 'actionable for control' differences remains unanchored.
  2. [Feature-space protocol (§4)] Feature-space protocol (abstract and §4): the classification of representations into memorized/reactive/predictive categories via sparse autoencoders is presented without quantitative checks (e.g., reconstruction fidelity on future frames, causal intervention tests, or correlation with rollout metrics) that the categories correspond to control-relevant future information rather than dataset artifacts.
  3. [Results and discussion] Results interpretation (abstract): the statements that 'sequential WAMs show the clearest predictive structure' and that 'auxiliary and joint WAMs respectively compress or entangle future information' rest directly on the unvalidated protocols; if the protocols measure non-actionable quantities, these architecture-specific conclusions do not support design recommendations.
minor comments (2)
  1. [Methods] The abstract and methods should explicitly state the number of seeds, exact hyper-parameters of the sparse autoencoder, and the precise definition of each behavioral metric so that the protocols can be reproduced.
  2. [Behavioral protocol] Runtime cost is listed as a behavioral metric; it would be clearer to separate computational overhead from behavioral quality metrics in the presentation of results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, clarifying the exploratory nature of the proposed diagnostics while acknowledging the need for stronger validation evidence.

read point-by-point responses
  1. Referee: [Behavioral protocol (§3)] Behavioral protocol (abstract and §3): no correlation, ablation, or held-out validation is reported showing that action-dynamics consistency, target-object progress, or distractor disturbance scores predict downstream control quality, expert preference, or performance on unseen tasks. Without such evidence the claim that these metrics reveal 'behaviorally meaningful' and 'actionable for control' differences remains unanchored.

    Authors: The behavioral metrics were designed from first principles of manipulation control to capture aspects such as action stability and target selectivity that success rates overlook. Experiments on LIBERO and RoboTwin2.0 demonstrate architecture-dependent patterns consistent with WAM design choices. We agree that explicit correlations or held-out ablations would provide stronger anchoring and will add a dedicated limitations subsection discussing validation strategies and potential extensions. revision: partial

  2. Referee: [Feature-space protocol (§4)] Feature-space protocol (abstract and §4): the classification of representations into memorized/reactive/predictive categories via sparse autoencoders is presented without quantitative checks (e.g., reconstruction fidelity on future frames, causal intervention tests, or correlation with rollout metrics) that the categories correspond to control-relevant future information rather than dataset artifacts.

    Authors: The SAE categorization relies on differential activation across temporal windows to distinguish feature types. While causal interventions and future-frame reconstruction checks are absent, the observed patterns align with both behavioral results and architectural priors. We will incorporate additional quantitative checks, such as future-frame reconstruction fidelity, into the revised feature analysis section. revision: partial

  3. Referee: [Results and discussion] Results interpretation (abstract): the statements that 'sequential WAMs show the clearest predictive structure' and that 'auxiliary and joint WAMs respectively compress or entangle future information' rest directly on the unvalidated protocols; if the protocols measure non-actionable quantities, these architecture-specific conclusions do not support design recommendations.

    Authors: These statements are presented as observations derived from the diagnostics rather than prescriptive design rules. We will revise the abstract and discussion to moderate the language, explicitly framing the findings as exploratory and noting that stronger validation is required before they inform concrete design choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical protocols and external benchmarks

full rationale

The paper introduces behavioral and feature-space diagnostic protocols and applies them to evaluate existing VLA and WAM policies on LIBERO and RoboTwin2.0 benchmarks. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All comparisons rest on independent external task success rates and model-agnostic analysis rather than quantities defined in terms of the paper's own outputs. The central claims therefore remain non-circular by the stated criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

No free parameters or invented entities; relies on standard domain assumptions about benchmark validity and the utility of sparse autoencoders for distinguishing representation types.

axioms (2)
  • domain assumption LIBERO and RoboTwin2.0 benchmarks are representative of real robotic manipulation challenges and suitable for measuring target selectivity and distractor effects.
    Invoked when stating results across these benchmarks.
  • domain assumption Sparse autoencoder features can be meaningfully labeled as memorized, reactive, or predictive.
    Central to the feature-space protocol described in the abstract.

pith-pipeline@v0.9.1-grok · 5771 in / 1271 out tokens · 22916 ms · 2026-06-28T17:19:18.329955+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Zitkovich, T

    B. Zitkovich, T. Yu, S. Xu, P. Xu, T. Xiao, F. Xia, J. Wu, P. Wohlhart, S. Welker, A. Wahid, Q. Vuong, V . Vanhoucke, H. Tran, R. Soricut, A. Singh, J. Singh, P. Sermanet, P. R. Sanketi, G. Salazar, M. S. Ryoo, K. Reymann, K. Rao, K. Pertsch, I. Mordatch, H. Michalewski, Y . Lu, S. Levine, L. Lee, T.-W. E. Lee, I. Leal, Y . Kuang, D. Kalashnikov, R. Julia...

  2. [2]

    Black, N

    K. Black, N. Brown, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, L. Groom, K. Haus- man, B. Ichter, S. Jakubczak, T. Jones, L. Ke, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, L. X. Shi, L. Smith, J. Tanner, Q. Vuong, A. Walling, H. Wang, and U. Zhilinsky. π0: A Vision-Language-Action Flow Model for General Robot Control. InProceeding...

  3. [3]

    Black, N

    K. Black, N. Brown, J. Darpinian, K. Dhabalia, D. Driess, A. Esmail, M. R. Equi, C. Finn, N. Fusai, M. Y . Galliker, D. Ghosh, L. Groom, K. Hausman, b. ichter, S. Jakubczak, T. Jones, L. Ke, D. LeBlanc, S. Levine, A. Li-Bell, M. Mothukuri, S. Nair, K. Pertsch, A. Z. Ren, L. X. Shi, L. Smith, J. T. Springenberg, K. Stachowicz, J. Tanner, Q. Vuong, H. Walke...

  4. [4]

    Ghosh, H

    Octo Model Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees, S. Dasari, J. Hejna, C. Xu, J. Luo, T. Kreiman, Y . Tan, L. Y . Chen, P. Sanketi, Q. Vuong, T. Xiao, D. Sadigh, C. Finn, and S. Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherlands, 2024

  5. [5]

    S. Ye, Y . Ge, K. Zheng, S. Gao, S. Yu, G. Kurian, S. Indupuru, Y . L. Tan, C. Zhu, J. Xi- ang, A. Malik, K. Lee, W. Liang, N. Ranawaka, J. Gu, Y . Xu, G. Wang, F. Hu, A. Narayan, J. Bjorck, J. Wang, G. Kim, D. Niu, R. Zheng, Y . Xie, J. Wu, Q. Wang, R. Julian, D. Xu, Y . Du, Y . Chebotar, S. Reed, J. Kautz, Y . Zhu, L. J. Fan, and J. Jang. World action m...

  6. [6]

    L. Li, Q. Zhang, Y . Luo, S. Yang, R. Wang, F. Han, M. Yu, Z. Gao, N. Xue, X. Zhu, Y . Shen, and Y . Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026

  7. [7]

    M. J. Kim, Y . Gao, T.-Y . Lin, Y .-C. Lin, Y . Ge, G. Lam, P. Liang, S. Song, M.-Y . Liu, C. Finn, and J. Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning. In The F ourteenth International Conference on Learning Representations, 2026

  8. [8]

    S. Gu, Y . Cai, T. Wang, S. Wu, and Y . Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717, 2026

  9. [9]

    J. Sun, W. Zhang, Z. Qi, S. Ren, Z. Liu, H. Zhu, G. Sun, X. Jin, and Z. Chen. Vla-jepa: Enhancing vision-language-action model with latent world model, 2026

  10. [10]

    T. Yuan, Z. Dong, Y . Liu, and H. Zhao. Fast-wam: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666, 2026

  11. [11]

    Zhang, Z

    Z. Zhang, Z. Li, B. Rahmati, R. H. Yang, Y . Ma, A. Rasouli, S. Pakdamansavoji, Y . Wu, L. Zhang, T. Cao, F. Wen, X. Wang, X. Quan, and Y . Zhang. Do world action models generalize better than vlas? a robustness study, 2026. 10

  12. [12]

    Huben, H

    R. Huben, H. Cunningham, L. R. Smith, A. Ewart, and L. Sharkey. Sparse autoencoders find highly interpretable features in language models. InThe Twelfth International Conference on Learning Representations, 2024

  13. [13]

    M. Lan, P. Torr, A. Meek, D. Krueger, and F. Barez. Sparse autoencoders reveal universal feature spaces across large language models, 2025

  14. [14]

    C. Chi, Z. Xu, S. Feng, E. Cousineau, Y . Du, B. Burchfiel, R. Tedrake, and S. Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 2024

  15. [15]

    Zheng, J

    J. Zheng, J. Li, Z. Wang, D. Liu, X. Kang, Y . Feng, Y . Zheng, J. Zou, Y . Chen, J. Zeng, T. Wang, Y .-Q. Zhang, J. Liu, and X. Zhan. X-VLA: Soft-prompted transformer as scalable cross-embodiment vision-language-action model. InThe F ourteenth International Conference on Learning Representations, 2026

  16. [16]

    Ha and J

    D. Ha and J. Schmidhuber. Recurrent world models facilitate policy evolution. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018

  17. [17]

    Bruce, M

    J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y . J. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y . Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. De Freitas, S. Singh, and T. Rockt ¨aschel. Genie: generative interactive environments. InProceedings...

  18. [18]

    Hafner, J

    D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap. Mastering diverse control tasks through world models.Nature, 640(8059):647–653, 2025

  19. [19]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. Robert Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self- superv...

  20. [20]

    V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

    L. Mur-Labadia, M. Muckley, A. Bar, M. Assran, K. Sinha, M. Rabbat, Y . LeCun, N. Ballas, and A. Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482, 2026

  21. [21]

    A. Ali, J. Bai, M. Bala, Y . Balaji, A. Blakeman, T. Cai, J. Cao, T. Cao, E. Cha, Y .-W. Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

  22. [22]

    A. Ye, B. Wang, C. Ni, G. Huang, G. Zhao, H. Li, H. Li, J. Li, J. Lv, J. Liu, M. Cao, P. Li, Q. Deng, W. Mei, X. Wang, X. Chen, X. Zhou, Y . Wang, Y . Chang, Y . Li, Y . Zhou, Y . Ye, Z. Liu, and Z. Zhu. Gigaworld-policy: An efficient action-centered world-action model.arXiv preprint arXiv:2603.17240, 2026

  23. [23]

    H. Zhao, J. Wang, W. Song, S. Chen, Y . Liu, Y . Wang, H. Li, and D. Wang. Frappe: Infusing world modeling into generalist policies via multiple future representation alignment.arXiv preprint arXiv:2602.17259, 2026

  24. [24]

    L. Maes, Q. L. Lidec, D. Scieur, Y . LeCun, and R. Balestriero. Leworldmodel: Stable end- to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026. 11

  25. [25]

    Mandlekar, D

    A. Mandlekar, D. Xu, J. Wong, S. Nasiriany, C. Wang, R. Kulkarni, L. Fei-Fei, S. Savarese, Y . Zhu, and R. Mart´ın-Mart´ın. What matters in learning from offline human demonstrations for robot manipulation. In5th Annual Conference on Robot Learning, 2021

  26. [26]

    B. Liu, Y . Zhu, C. Gao, Y . Feng, qiang liu, Y . Zhu, and P. Stone. LIBERO: Benchmarking knowledge transfer for lifelong robot learning. InThirty-seventh Conference on Neural Infor- mation Processing Systems Datasets and Benchmarks Track, 2023

  27. [27]

    S. Tao, F. Xiang, A. Shukla, Y . Qin, X. Hinrichsen, X. Yuan, C. Bao, X. Lin, Y . Liu, T. kai Chan, Y . Gao, X. Li, T. Mu, N. Xiao, A. Gurha, V . N. Rajesh, Y . W. Choi, Y .-R. Chen, Z. Huang, R. Calandra, R. Chen, S. Luo, and H. Su. Maniskill3: Gpu parallelized robotics simulation and rendering for generalizable embodied ai.Robotics: Science and Systems, 2025

  28. [28]

    T. Chen, Z. Chen, B. Chen, Z. Cai, Y . Liu, Q. Liang, Z. Li, X. Lin, Y . Ge, Z. Gu, et al. Robotwin 2.0: A scalable data generator and benchmark with strong domain randomization for robust bimanual robotic manipulation.arXiv preprint arXiv:2506.18088, 2025

  29. [29]

    Khazatsky, K

    A. Khazatsky, K. Pertsch, S. Nair, A. Balakrishna, S. Dasari, S. Karamcheti, S. Nasiriany, M. K. Srirama, L. Y . Chen, K. Ellis, P. D. Fagan, J. Hejna, M. Itkina, M. Lepert, Y . J. Ma, P. T. Miller, J. Wu, S. Belkhale, S. Dass, H. Ha, A. Jain, A. Lee, Y . Lee, M. Memmel, S. Park, I. Radosavovic, K. Wang, A. Zhan, K. Black, C. Chi, K. B. Hatch, S. Lin, J. ...

  30. [30]

    O’Neill, A

    A. O’Neill, A. Rehman, A. Maddukuri, A. Gupta, A. Padalkar, A. Lee, A. Pooley, A. Gupta, A. Mandlekar, A. Jain, A. Tung, A. Bewley, A. Herzog, A. Irpan, A. Khazatsky, A. Rai, A. Gupta, A. Wang, A. Singh, A. Garg, A. Kembhavi, A. Xie, A. Brohan, A. Raffin, A. Sharma, A. Yavary, A. Jain, A. Balakrishna, A. Wahid, B. Burgess-Limerick, B. Kim, B. Sch ¨olkopf,...

  31. [31]

    Sparse Autoencoders Reveal Interpretable and Steerable Features in VLA Models

    A. Swann, L. McGranahan, H. Buurmeijer, M. Kennedy III, and M. Schwager. Sparse autoencoders reveal interpretable and steerable features in vla models.arXiv preprint arXiv:2603.19183, 2026

  32. [32]

    J. Gao, S. Belkhale, S. Dasari, A. Balakrishna, D. Shah, and D. Sadigh. A taxonomy for evaluating generalist robot manipulation policies.IEEE Robotics and Automation Letters, 11 (3):3182–3189, 2026

  33. [33]

    C. Agia, R. Sinha, J. Yang, Z. Cao, R. Antonova, M. Pavone, and J. Bohg. Unpacking failure modes of generative policies: Runtime monitoring of consistency and progress. In8th Annual Conference on Robot Learning, 2024

  34. [34]

    Black, M

    K. Black, M. Y . Galliker, and S. Levine. Real-time execution of action chunking flow policies. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  35. [35]

    Gasparetto and V

    A. Gasparetto and V . Zanotto. A new method for smooth trajectory planning of robot manipu- lators.Mechanism and Machine Theory, 42(4):455–471, 2007. ISSN 0094-114X

  36. [36]

    Y . R. Wang, C. Ung, C. Tan, G. Tannert, J. Duan, J. Li, A. Le, R. Oswal, M. Grotz, W. Pumacay, Y . Deng, R. Krishna, D. Fox, and S. Srinivasa. Roboeval: Where robotic manipulation meets structured and scalable evaluation, 2026

  37. [37]

    Buurmeijer, C

    H. Buurmeijer, C. A. Alonso, A. Swann, and M. Pavone. Observing and controlling features in vision-language-action models, 2026

  38. [38]

    M. A. Khan, N. Boskov, F. M. Anwar, and M. A. Khan. Controlling vision–language–action policies through sparse latent directions. InMechanistic Interpretability Workshop at NeurIPS 2025, 2025

  39. [39]

    L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu. Scaling and evaluating sparse autoencoders. InThe Thirteenth International Conference on Learning Representations, 2025

  40. [40]

    Cadene, S

    R. Cadene, S. Alibert, F. Capuano, M. Aractingi, A. Zouitine, P. Kooijmans, J. Choghari, M. Russi, C. Pascal, S. Palma, D. Aubakirova, M. Shukor, J. Moss, A. Soare, Q. Lhoest, Q. Gallou´edec, and T. Wolf. Lerobot: An open-source library for end-to-end robot learning. InThe F ourteenth International Conference on Learning Representations, 2026

  41. [41]

    Assran, A

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V . Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y . Li, X. Ma, S. Chandar, F. Meier, Y . LeCun, M. Rabbat, and N. Ballas. V-jepa 2: Self-super...