pith. sign in

arxiv: 2605.18287 · v1 · pith:S7BET5EGnew · submitted 2026-05-18 · 💻 cs.CV · cs.RO

StableVLA: Towards Robust Vision-Language-Action Models without Extra Data

Pith reviewed 2026-05-20 11:50 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords Vision-Language-ActionRobustnessInformation BottleneckAdapter ModuleVisual DisturbancesLightweight AdaptationRobotics Models
0
0 comments X

The pith

A small information bottleneck adapter improves VLA robustness to unseen visual disturbances by 30 percent without extra data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language-action models suffer sharp performance drops when visual conditions introduce disturbances absent from training data. The paper introduces the IB-Adapter, a compact module based on information theory that filters noise from visual inputs. Adding this module raises baseline performance by an average of 30 percent while using fewer than 10 million parameters. The resulting StableVLA model, built on a 0.5 billion parameter backbone with no pre-training on large embodiment datasets, reaches robustness levels competitive with 7 billion parameter state-of-the-art models. The approach leaves long-horizon task accuracy intact and succeeds against both synthetic and physical visual corruptions.

Core claim

The paper establishes that an Information Bottleneck Adapter can be inserted into VLA models to selectively filter potential noise from visual inputs according to information-theoretic criteria, producing an average 30 percent robustness gain over baselines, fewer than 10 million added parameters, and no extra data or augmentation, while enabling a 14 times smaller backbone model to achieve competitive robustness to 7B-scale VLAs without Open X-Embodiment pre-training and while preserving long-horizon accuracy.

What carries the argument

The Information Bottleneck Adapter, a lightweight module that applies the information bottleneck principle to compress visual features and discard noise while retaining task-relevant information.

If this is right

  • Robustness gains apply to existing VLA baselines without any new data collection or augmentation.
  • Much smaller backbone sizes can deliver comparable robustness, lowering deployment hardware costs.
  • Long-horizon task performance remains stable or improves together with the robustness gains.
  • The same adapter works under both synthetic corruptions and physical real-world visual disturbances.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robotic systems could reach high robustness with far less pre-training data and compute.
  • The bottleneck filtering idea might extend to other vision-language models that encounter noisy inputs.
  • Direct comparisons on additional physical robot platforms would reveal how far the noise removal generalizes.

Load-bearing premise

The information bottleneck can be tuned to remove only visual noise while keeping every detail required for correct long-horizon action sequences.

What would settle it

A controlled test in which the IB-Adapter is inserted but either long-horizon accuracy falls or robustness to a suite of visual disturbances shows no improvement.

read the original abstract

It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript examines the vulnerability of Vision-Language-Action (VLA) models to unseen visual disturbances absent from training data and proposes the Information Bottleneck Adapter (IB-Adapter), a lightweight module (<10M parameters) that uses information-theoretic principles to filter noise from visual inputs. Without extra data or augmentations, the adapter is claimed to yield an average 30% performance gain over baselines; the resulting StableVLA model (0.5B backbone, no Open X-Embodiment pre-training) is reported to match the robustness of 7B-scale VLAs on both synthetic and physical corruptions while preserving long-horizon accuracy.

Significance. If the empirical gains prove robust and the mechanism is shown to generalize beyond incidental regularization, the work would be significant for practical VLA deployment under imperfect real-world visuals. The data-free efficiency and competitive smaller-model results would offer a practical route to robustness without the cost of large-scale pre-training or augmentation pipelines.

major comments (2)
  1. [Abstract] Abstract: the central claim of an average 30% improvement and competitive robustness for the 0.5B StableVLA rests on empirical results, yet the manuscript supplies no details on experimental setup, exact baselines, disturbance types/severities, number of trials, or error bars; this omission directly affects confidence in the reported gains and the long-horizon maintenance assertion.
  2. [IB-Adapter] IB-Adapter description: the claim that the information-bottleneck objective selectively removes unseen visual noise while preserving I(Z; action) is load-bearing for attributing the 30% gain to the adapter rather than parameter addition or implicit regularization, but the manuscript does not specify the concrete loss (variational bound, estimator for mutual informations, or training procedure on clean trajectories only), leaving generalization to out-of-distribution corruptions as an unverified assumption.
minor comments (2)
  1. Add a table or figure summarizing the precise visual corruption types, severity levels, and per-task metrics with standard deviations to support the aggregate 30% figure.
  2. Clarify whether the IB-Adapter is inserted only at inference or also during fine-tuning, and how its parameters are optimized without any auxiliary data.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the changes we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of an average 30% improvement and competitive robustness for the 0.5B StableVLA rests on empirical results, yet the manuscript supplies no details on experimental setup, exact baselines, disturbance types/severities, number of trials, or error bars; this omission directly affects confidence in the reported gains and the long-horizon maintenance assertion.

    Authors: We agree that the abstract would benefit from additional context to support the reported gains. The full experimental details—including baselines (OpenVLA, RT-2, and 7B-scale VLAs), disturbance types and severities (synthetic corruptions such as Gaussian noise and blur, plus physical corruptions such as lighting changes), evaluation on long-horizon tasks, and results averaged over multiple trials with standard deviations—are provided in Section 4 and the supplementary material. In the revised manuscript we will expand the abstract to briefly reference the evaluation protocol, the use of multiple trials, and the reporting of error bars, thereby increasing confidence in the 30% average improvement and the preservation of long-horizon accuracy. revision: yes

  2. Referee: [IB-Adapter] IB-Adapter description: the claim that the information-bottleneck objective selectively removes unseen visual noise while preserving I(Z; action) is load-bearing for attributing the 30% gain to the adapter rather than parameter addition or implicit regularization, but the manuscript does not specify the concrete loss (variational bound, estimator for mutual informations, or training procedure on clean trajectories only), leaving generalization to out-of-distribution corruptions as an unverified assumption.

    Authors: We concur that a precise description of the objective is necessary to attribute the gains specifically to the information-bottleneck mechanism. The IB-Adapter is trained exclusively on clean trajectories using a variational approximation to the information-bottleneck objective that minimizes a variational upper bound on I(Z; X) while maximizing a lower bound on I(Z; A) via a task-specific action head; the implementation employs a KL-divergence term together with a reconstruction loss. In the revised manuscript we will add the exact loss formulation, the estimator details, the value of the trade-off coefficient, and explicit confirmation that training uses only clean data, thereby clarifying how the adapter promotes robustness to unseen corruptions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical robustness gains rest on direct evaluation, not self-referential definitions or fitted predictions

full rationale

The paper's central claims consist of an empirical study showing performance drops under unseen visual disturbances, followed by the introduction of an IB-Adapter module whose benefits are demonstrated through direct comparisons on baseline VLA models, synthetic/physical corruptions, and long-horizon tasks. No derivation chain is presented that reduces a 'prediction' or first-principles result to its own inputs by construction; the information-bottleneck grounding is invoked as a design principle for the adapter rather than a mathematical identity that forces the reported 30% gains. The method adds parameters and is trained on the original trajectories, with all reported improvements arising from explicit experimental measurement rather than renaming or self-citation load-bearing. This is a standard empirical contribution whose validity can be checked against external benchmarks without internal reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the effectiveness of the newly proposed IB-Adapter and the premise that information theory can be directly applied to filter visual noise in this domain without side effects.

axioms (1)
  • domain assumption The information bottleneck principle can be applied via a lightweight adapter to selectively filter noise from visual inputs while retaining task-relevant features in VLA models.
    Grounding of the IB-Adapter in information theory as stated in the abstract.
invented entities (1)
  • IB-Adapter no independent evidence
    purpose: Lightweight module to filter potential noise from visual inputs in VLA models.
    New module introduced to address robustness without extra data.

pith-pipeline@v0.9.0 · 5776 in / 1316 out tokens · 62132 ms · 2026-05-20T11:50:22.616289+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 13 internal anchors

  1. [1]

    Alemi, Ian Fischer, Joshua V

    Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id=HyxQzBceg

  2. [2]

    Xcit: Cross-covariance image transformers.Advances in neural information processing systems, 34:20014–20027, 2021

    Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image transformers.Advances in neural information processing systems, 34:20014–20027, 2021

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

    Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021

  5. [5]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025

  6. [6]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

  7. [7]

    Rt-1: Robotics transformer for real-world control at scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023

  8. [8]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.CoRR, abs/2410.06158, 2024. doi: 10.48550/ARXIV.2410.06158. URL https://doi.org/10.48550/arXiv.2410.06158

  9. [9]

    Diffusion policy: Visuomotor policy learning via action diffusion.Int

    Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.Int. J. Robotics Res., 44(10-11):1684–1704,

  10. [10]

    Diffusion policy: Visuomotor policy learning via action diffusion

    doi: 10.1177/02783649241273668. URLhttps://doi.org/10.1177/02783649241273668

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  12. [12]

    Agibot world colosseum

    AgiBot World Colosseum contributors. Agibot world colosseum. https://github.com/OpenDriveLab/ AgiBot-World, 2024

  13. [13]

    HumanNet: Scaling Human-centric Video Learning to One Million Hours

    Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours, 2026. URL https://arxiv.org/abs/2605.06747

  14. [14]

    Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026

  15. [15]

    Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation

    Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors,Conference on Robot Learning, 6-9 November 2024, Munich, Germany, Proceedings of Machine Learning Research, pages 496–51...

  16. [16]

    Towards human-level intelligence via human-like whole-body manipulation.arXiv preprint arXiv:2507.17141, 2025

    Guang Gao, Jianan Wang, Jinbo Zuo, Junnan Jiang, Jingfan Zhang, Xianwen Zeng, Yuejiang Zhu, Lianyang Ma, Ke Chen, Minhua Sheng, et al. Towards human-level intelligence via human-like whole-body manipulation.arXiv preprint arXiv:2507.17141, 2025

  17. [17]

    Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019

  18. [18]

    The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021

    Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021. 11

  19. [19]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...

  20. [20]

    Prismatic vlms: Investigating the design space of visually-conditioned language models

    Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InInternational Conference on Machine Learning (ICML), 2024

  21. [21]

    Sanketi, Archit Sharma, Cody Simpson, Quan Vuong, Homer Rich Walke, Blake Wulfe, Ted Xiao, Jonathan Heewon Yang, Arefeh Yavary, Tony Z

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  22. [22]

    URLhttps://doi.org/10.15607/RSS.2024.XX.120

    doi: 10.15607/RSS.2024.XX.120. URLhttps://doi.org/10.15607/RSS.2024.XX.120

  23. [23]

    Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InCoRL, 2024

  24. [24]

    Fine-tuning vision-language-action models: Optimizing speed and success.CoRR, 2025

    Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.CoRR, 2025

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  26. [26]

    Roboflamingo-plus: Fusion of depth and RGB perception with vision-language models for enhanced robotic manipulation

    Xiaojian Li, Sheng Wang, Chao Chen, Hailong Wei, Yudong Shi, and Hangjie Mo. Roboflamingo-plus: Fusion of depth and RGB perception with vision-language models for enhanced robotic manipulation. InIEEE International Conference on Real-time Computing and Robotics, RCAR 2025, Toyama, Japan, June 1-6, 2025, pages 311–

  27. [27]

    doi: 10.1109/RCAR65431.2025.11139480

    IEEE, 2025. doi: 10.1109/RCAR65431.2025.11139480. URLhttps://doi.org/10.1109/RCAR65431.2025. 11139480

  28. [28]

    Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

    Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023

  29. [29]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  30. [30]

    NVILA: Efficient Frontier Visual Language Models

    Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024

  31. [31]

    Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

    Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022

  32. [32]

    Ecker, Matthias Bethge, and Wieland Brendel

    Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming.arXiv preprint arXiv:1907.07484, 2019. 12

  33. [33]

    Robotwin: Dual-arm robot benchmark with generative digital twins (early version)

    Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). InEuropean Conference on Computer Vision, pages 264–273. Springer, 2024

  34. [34]

    Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration. InICRA, 2024

  35. [35]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  36. [36]

    Vision transformers are robust learners

    Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InProceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081, 2022

  37. [37]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

  38. [38]

    The information bottleneck method

    Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000

  39. [39]

    Domain randomization for transferring deep neural networks from simulation to the real world

    Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017

  40. [40]

    Augmax: Adversarial composition of random augmentations for robust training

    Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and Zhangyang Wang. Augmax: Adversarial composition of random augmentations for robust training. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on N...

  41. [41]

    Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

    Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025

  42. [42]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  43. [43]

    Robotic control via embodied chain-of-thought reasoning

    Michal Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Conference on Robot Learning, 6-9 November 2024, Munich, Germany, Proceedings of Machine Learning Research, pages 3157–3181. PMLR, 2024. URLhttps:...

  44. [44]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023

  45. [45]

    Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn

    Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors,Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi: 10.15607/RSS.2023.XIX.016. URLhttps://doi.org/10.15607/RSS.20...

  46. [46]

    Understanding the robustness in vision transformers

    Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. InInternational conference on machine learning, pages 27378–27394. PMLR, 2022

  47. [47]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025

  48. [48]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 13 StableVLA: Towards Robust Vision-Language-Action Models without Extra Data Appendix A Theoretical Derivation. . . . . . . . . . . ...

  49. [49]

    attributes this property to the self-attention mechanism, which promotesvisual groupingwhere tokens aggregate into semantic clusters. This phenomenon is theoretically grounded in the Information Bottleneck (IB) principle [35], which optimizes the trade-off between input compression and relevant information preservation. Notably, [43] proves that under Gau...

  50. [50]

    introduced Cross-Covariance Attention to compute channel-wise interactions, significantly reducing computational complexity. FAN [43] further establishes that this mechanism acts as subspace clustering; by applying the IB principle to the channel dimension, the model identifies coherent semantic subspaces while suppressing noisy channels.StableVLAextends ...