StableVLA: Towards Robust Vision-Language-Action Models without Extra Data
Pith reviewed 2026-05-20 11:50 UTC · model grok-4.3
The pith
A small information bottleneck adapter improves VLA robustness to unseen visual disturbances by 30 percent without extra data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that an Information Bottleneck Adapter can be inserted into VLA models to selectively filter potential noise from visual inputs according to information-theoretic criteria, producing an average 30 percent robustness gain over baselines, fewer than 10 million added parameters, and no extra data or augmentation, while enabling a 14 times smaller backbone model to achieve competitive robustness to 7B-scale VLAs without Open X-Embodiment pre-training and while preserving long-horizon accuracy.
What carries the argument
The Information Bottleneck Adapter, a lightweight module that applies the information bottleneck principle to compress visual features and discard noise while retaining task-relevant information.
If this is right
- Robustness gains apply to existing VLA baselines without any new data collection or augmentation.
- Much smaller backbone sizes can deliver comparable robustness, lowering deployment hardware costs.
- Long-horizon task performance remains stable or improves together with the robustness gains.
- The same adapter works under both synthetic corruptions and physical real-world visual disturbances.
Where Pith is reading between the lines
- Robotic systems could reach high robustness with far less pre-training data and compute.
- The bottleneck filtering idea might extend to other vision-language models that encounter noisy inputs.
- Direct comparisons on additional physical robot platforms would reveal how far the noise removal generalizes.
Load-bearing premise
The information bottleneck can be tuned to remove only visual noise while keeping every detail required for correct long-horizon action sequences.
What would settle it
A controlled test in which the IB-Adapter is inserted but either long-horizon accuracy falls or robustness to a suite of visual disturbances shows no improvement.
read the original abstract
It is infeasible to encompass all possible disturbances within the training dataset. This raises a critical question regarding the robustness of Vision-Language-Action (VLA) models when encountering unseen real-world visual disturbances, particularly under imperfect visual conditions. In this work, we conduct a systematic study based on recent state-of-the-art VLA models and reveal a significant performance drop when visual disturbances absent from the training data are introduced. To mitigate this issue, we propose a lightweight adapter module grounded in information theory, termed the Information Bottleneck Adapter (IB-Adapter), which selectively filters potential noise from visual inputs. Without requiring any extra data or augmentation strategies, IB-Adapter consistently improves over the baseline by an average of 30%, while adding fewer than 10M parameters, demonstrating notable efficiency and effectiveness. Furthermore, even with a 14x smaller backbone (0.5B parameters) and no pre-training on the Open X-Embodiment dataset, our model StableVLA achieves robustness competitive with 7B-scale state-of-the-art VLAs. With negligible parameter overhead (<10M), our approach maintains accuracy on long-horizon tasks and surpasses OpenPi under both synthetic and physical visual corruptions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines the vulnerability of Vision-Language-Action (VLA) models to unseen visual disturbances absent from training data and proposes the Information Bottleneck Adapter (IB-Adapter), a lightweight module (<10M parameters) that uses information-theoretic principles to filter noise from visual inputs. Without extra data or augmentations, the adapter is claimed to yield an average 30% performance gain over baselines; the resulting StableVLA model (0.5B backbone, no Open X-Embodiment pre-training) is reported to match the robustness of 7B-scale VLAs on both synthetic and physical corruptions while preserving long-horizon accuracy.
Significance. If the empirical gains prove robust and the mechanism is shown to generalize beyond incidental regularization, the work would be significant for practical VLA deployment under imperfect real-world visuals. The data-free efficiency and competitive smaller-model results would offer a practical route to robustness without the cost of large-scale pre-training or augmentation pipelines.
major comments (2)
- [Abstract] Abstract: the central claim of an average 30% improvement and competitive robustness for the 0.5B StableVLA rests on empirical results, yet the manuscript supplies no details on experimental setup, exact baselines, disturbance types/severities, number of trials, or error bars; this omission directly affects confidence in the reported gains and the long-horizon maintenance assertion.
- [IB-Adapter] IB-Adapter description: the claim that the information-bottleneck objective selectively removes unseen visual noise while preserving I(Z; action) is load-bearing for attributing the 30% gain to the adapter rather than parameter addition or implicit regularization, but the manuscript does not specify the concrete loss (variational bound, estimator for mutual informations, or training procedure on clean trajectories only), leaving generalization to out-of-distribution corruptions as an unverified assumption.
minor comments (2)
- Add a table or figure summarizing the precise visual corruption types, severity levels, and per-task metrics with standard deviations to support the aggregate 30% figure.
- Clarify whether the IB-Adapter is inserted only at inference or also during fine-tuning, and how its parameters are optimized without any auxiliary data.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major comment below and indicate the changes we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim of an average 30% improvement and competitive robustness for the 0.5B StableVLA rests on empirical results, yet the manuscript supplies no details on experimental setup, exact baselines, disturbance types/severities, number of trials, or error bars; this omission directly affects confidence in the reported gains and the long-horizon maintenance assertion.
Authors: We agree that the abstract would benefit from additional context to support the reported gains. The full experimental details—including baselines (OpenVLA, RT-2, and 7B-scale VLAs), disturbance types and severities (synthetic corruptions such as Gaussian noise and blur, plus physical corruptions such as lighting changes), evaluation on long-horizon tasks, and results averaged over multiple trials with standard deviations—are provided in Section 4 and the supplementary material. In the revised manuscript we will expand the abstract to briefly reference the evaluation protocol, the use of multiple trials, and the reporting of error bars, thereby increasing confidence in the 30% average improvement and the preservation of long-horizon accuracy. revision: yes
-
Referee: [IB-Adapter] IB-Adapter description: the claim that the information-bottleneck objective selectively removes unseen visual noise while preserving I(Z; action) is load-bearing for attributing the 30% gain to the adapter rather than parameter addition or implicit regularization, but the manuscript does not specify the concrete loss (variational bound, estimator for mutual informations, or training procedure on clean trajectories only), leaving generalization to out-of-distribution corruptions as an unverified assumption.
Authors: We concur that a precise description of the objective is necessary to attribute the gains specifically to the information-bottleneck mechanism. The IB-Adapter is trained exclusively on clean trajectories using a variational approximation to the information-bottleneck objective that minimizes a variational upper bound on I(Z; X) while maximizing a lower bound on I(Z; A) via a task-specific action head; the implementation employs a KL-divergence term together with a reconstruction loss. In the revised manuscript we will add the exact loss formulation, the estimator details, the value of the trade-off coefficient, and explicit confirmation that training uses only clean data, thereby clarifying how the adapter promotes robustness to unseen corruptions. revision: yes
Circularity Check
No circularity: empirical robustness gains rest on direct evaluation, not self-referential definitions or fitted predictions
full rationale
The paper's central claims consist of an empirical study showing performance drops under unseen visual disturbances, followed by the introduction of an IB-Adapter module whose benefits are demonstrated through direct comparisons on baseline VLA models, synthetic/physical corruptions, and long-horizon tasks. No derivation chain is presented that reduces a 'prediction' or first-principles result to its own inputs by construction; the information-bottleneck grounding is invoked as a design principle for the adapter rather than a mathematical identity that forces the reported 30% gains. The method adds parameters and is trained on the original trajectories, with all reported improvements arising from explicit experimental measurement rather than renaming or self-citation load-bearing. This is a standard empirical contribution whose validity can be checked against external benchmarks without internal reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The information bottleneck principle can be applied via a lightweight adapter to selectively filter noise from visual inputs while retaining task-relevant features in VLA models.
invented entities (1)
-
IB-Adapter
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
min_ϕ(Z|Xv) LIB = I(Xv;Z) − β I(Z;S) ... Z = V · σ(β Q⊤ K) (Proposition 3.1)
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leanalpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Fused IB-Adapter ... dual-pathway architecture ... Stochastic Pathway Dropout
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Alexander A. Alemi, Ian Fischer, Joshua V. Dillon, and Kevin Murphy. Deep variational information bottleneck. In5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. OpenReview.net, 2017. URLhttps://openreview.net/forum?id=HyxQzBceg
work page 2017
-
[2]
Alaaeldin Ali, Hugo Touvron, Mathilde Caron, Piotr Bojanowski, Matthijs Douze, Armand Joulin, Ivan Laptev, Natalia Neverova, Gabriel Synnaeve, Jakob Verbeek, et al. Xcit: Cross-covariance image transformers.Advances in neural information processing systems, 34:20014–20027, 2021
work page 2021
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Yutong Bai, Jieru Mei, Alan L Yuille, and Cihang Xie. Are transformers more robust than cnns?Advances in neural information processing systems, 34:26831–26843, 2021
work page 2021
-
[5]
GR00T N1: An Open Foundation Model for Generalist Humanoid Robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Rt-1: Robotics transformer for real-world control at scale
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, et al. Rt-1: Robotics transformer for real-world control at scale. InRobotics: Science and Systems, 2023
work page 2023
-
[8]
GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation
Chilam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. GR-2: A generative video-language-action model with web-scale knowledge for robot manipulation.CoRR, abs/2410.06158, 2024. doi: 10.48550/ARXIV.2410.06158. URL https://doi.org/10.48550/arXiv.2410.06158
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2410.06158 2024
-
[9]
Diffusion policy: Visuomotor policy learning via action diffusion.Int
Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.Int. J. Robotics Res., 44(10-11):1684–1704,
-
[10]
Diffusion policy: Visuomotor policy learning via action diffusion
doi: 10.1177/02783649241273668. URLhttps://doi.org/10.1177/02783649241273668
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
AgiBot World Colosseum contributors. Agibot world colosseum. https://github.com/OpenDriveLab/ AgiBot-World, 2024
work page 2024
-
[13]
HumanNet: Scaling Human-centric Video Learning to One Million Hours
Yufan Deng and Daquan Zhou. Humannet: Scaling human-centric video learning to one million hours, 2026. URL https://arxiv.org/abs/2605.06747
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282, 2026
-
[15]
Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation
Ria Doshi, Homer Rich Walke, Oier Mees, Sudeep Dasari, and Sergey Levine. Scaling cross-embodied learning: One policy for manipulation, navigation, locomotion and aviation. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors,Conference on Robot Learning, 6-9 November 2024, Munich, Germany, Proceedings of Machine Learning Research, pages 496–51...
work page 2024
-
[16]
Guang Gao, Jianan Wang, Jinbo Zuo, Junnan Jiang, Jingfan Zhang, Xianwen Zeng, Yuejiang Zhu, Lianyang Ma, Ke Chen, Minhua Sheng, et al. Towards human-level intelligence via human-like whole-body manipulation.arXiv preprint arXiv:2507.17141, 2025
-
[17]
Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations.Proceedings of the International Conference on Learning Representations, 2019
work page 2019
-
[18]
The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021
Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.ICCV, 2021. 11
work page 2021
-
[19]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
work page 2025
-
[20]
Prismatic vlms: Investigating the design space of visually-conditioned language models
Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Prismatic vlms: Investigating the design space of visually-conditioned language models. InInternational Conference on Machine Learning (ICML), 2024
work page 2024
-
[21]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
work page 2024
-
[22]
URLhttps://doi.org/10.15607/RSS.2024.XX.120
doi: 10.15607/RSS.2024.XX.120. URLhttps://doi.org/10.15607/RSS.2024.XX.120
-
[23]
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Paul Foster, Pannag R. Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. InCoRL, 2024
work page 2024
-
[24]
Fine-tuning vision-language-action models: Optimizing speed and success.CoRR, 2025
Moo Jin Kim, Chelsea Finn, and Percy Liang. Fine-tuning vision-language-action models: Optimizing speed and success.CoRR, 2025
work page 2025
-
[25]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Xiaojian Li, Sheng Wang, Chao Chen, Hailong Wei, Yudong Shi, and Hangjie Mo. Roboflamingo-plus: Fusion of depth and RGB perception with vision-language models for enhanced robotic manipulation. InIEEE International Conference on Real-time Computing and Robotics, RCAR 2025, Toyama, Japan, June 1-6, 2025, pages 311–
work page 2025
-
[27]
doi: 10.1109/RCAR65431.2025.11139480
IEEE, 2025. doi: 10.1109/RCAR65431.2025.11139480. URLhttps://doi.org/10.1109/RCAR65431.2025. 11139480
-
[28]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023
work page 2023
-
[29]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
work page 2023
-
[30]
NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, et al. Nvila: Efficient frontier visual language models.arXiv preprint arXiv:2412.04468, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Oier Mees, Lukas Hermann, Erick Rosete-Beas, and Wolfram Burgard. Calvin: A benchmark for language- conditioned policy learning for long-horizon robot manipulation tasks.IEEE Robotics and Automation Letters (RA-L), 7(3):7327–7334, 2022
work page 2022
-
[32]
Ecker, Matthias Bethge, and Wieland Brendel
Claudio Michaelis, Benjamin Mitzkus, Robert Geirhos, Evgenia Rusak, Oliver Bringmann, Alexander S. Ecker, Matthias Bethge, and Wieland Brendel. Benchmarking robustness in object detection: Autonomous driving when winter is coming.arXiv preprint arXiv:1907.07484, 2019. 12
-
[33]
Robotwin: Dual-arm robot benchmark with generative digital twins (early version)
Yao Mu, Tianxing Chen, Shijia Peng, Zanxin Chen, Zeyu Gao, Yude Zou, Lunkai Lin, Zhiqiang Xie, and Ping Luo. Robotwin: Dual-arm robot benchmark with generative digital twins (early version). InEuropean Conference on Computer Vision, pages 264–273. Springer, 2024
work page 2024
-
[34]
Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, et al. Open x-embodiment: Robotic learning datasets and rt-x models : Open x-embodiment collaboration. InICRA, 2024
work page 2024
-
[35]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[36]
Vision transformers are robust learners
Sayak Paul and Pin-Yu Chen. Vision transformers are robust learners. InProceedings of the AAAI conference on Artificial Intelligence, volume 36, pages 2071–2081, 2022
work page 2071
-
[37]
Octo: An Open-Source Generalist Robot Policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
The information bottleneck method
Naftali Tishby, Fernando C Pereira, and William Bialek. The information bottleneck method.arXiv preprint physics/0004057, 2000
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[39]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. In2017 IEEE/RSJ international conference on intelligent robots and systems (IROS), pages 23–30. IEEE, 2017
work page 2017
-
[40]
Augmax: Adversarial composition of random augmentations for robust training
Haotao Wang, Chaowei Xiao, Jean Kossaifi, Zhiding Yu, Anima Anandkumar, and Zhangyang Wang. Augmax: Adversarial composition of random augmentations for robust training. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan, editors,Advances in Neural Information Processing Systems 34: Annual Conference on N...
work page 2021
-
[41]
Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025
Yihao Wang, Pengxiang Ding, Lingxiao Li, Can Cui, Zirui Ge, Xinyang Tong, Wenxuan Song, Han Zhao, Wei Zhao, Pengxu Hou, Siteng Huang, Yifan Tang, Wenhui Wang, Ru Zhang, Jianyi Liu, and Donglin Wang. Vla-adapter: An effective paradigm for tiny-scale vision-language-action model.CoRR, 2025
work page 2025
-
[42]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[43]
Robotic control via embodied chain-of-thought reasoning
Michal Zawalski, William Chen, Karl Pertsch, Oier Mees, Chelsea Finn, and Sergey Levine. Robotic control via embodied chain-of-thought reasoning. In Pulkit Agrawal, Oliver Kroemer, and Wolfram Burgard, editors, Conference on Robot Learning, 6-9 November 2024, Munich, Germany, Proceedings of Machine Learning Research, pages 3157–3181. PMLR, 2024. URLhttps:...
work page 2024
-
[44]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023
work page 2023
-
[45]
Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn
Tony Z. Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware. In Kostas E. Bekris, Kris Hauser, Sylvia L. Herbert, and Jingjin Yu, editors,Robotics: Science and Systems XIX, Daegu, Republic of Korea, July 10-14, 2023, 2023. doi: 10.15607/RSS.2023.XIX.016. URLhttps://doi.org/10.15607/RSS.20...
-
[46]
Understanding the robustness in vision transformers
Daquan Zhou, Zhiding Yu, Enze Xie, Chaowei Xiao, Animashree Anandkumar, Jiashi Feng, and Jose M Alvarez. Understanding the robustness in vision transformers. InInternational conference on machine learning, pages 27378–27394. PMLR, 2022
work page 2022
-
[47]
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCoRL, 2023. 13 StableVLA: Towards Robust Vision-Language-Action Models without Extra Data Appendix A Theoretical Derivation. . . . . . . . . . . ...
work page 2023
-
[49]
attributes this property to the self-attention mechanism, which promotesvisual groupingwhere tokens aggregate into semantic clusters. This phenomenon is theoretically grounded in the Information Bottleneck (IB) principle [35], which optimizes the trade-off between input compression and relevant information preservation. Notably, [43] proves that under Gau...
-
[50]
introduced Cross-Covariance Attention to compute channel-wise interactions, significantly reducing computational complexity. FAN [43] further establishes that this mechanism acts as subspace clustering; by applying the IB principle to the channel dimension, the model identifies coherent semantic subspaces while suppressing noisy channels.StableVLAextends ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.