Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

Hanqing Wang; Jiaxin Liu; Laurent Kneip; Ruiqi Chen; Shi Chang; Weiyu Guo; Xun Xu; Zhenhao Zhang

arxiv: 2606.29384 · v1 · pith:7G2T2R2Fnew · submitted 2026-06-28 · 💻 cs.CV · cs.RO

Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

Jiaxin Liu , Xun Xu , Zhenhao Zhang , Hanqing Wang , Ruiqi Chen , Shi Chang , Weiyu Guo , Laurent Kneip This is my paper

Pith reviewed 2026-06-30 06:59 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords event-vlavision-language-actionevent fusionrobotic manipulationlow-light robustnessaction queriesgated cross-attentionillumination invariance

0 comments

The pith

Event-VLA routes event camera data through action queries to keep vision-language-action manipulation reliable when lighting drops.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that standard vision-language-action models can be made robust to real-world lighting changes by adding event streams as a motion-sensitive complement. It does so without retraining the core RGB-language components, instead routing the new data selectively through the action prediction pathway. A reader would care because many indoor and outdoor robot tasks fail when lights dim or shadows appear, and the method claims to fix that while preserving performance in good conditions. The key step is using learned action queries to pull only task-relevant event features into the decision process.

Core claim

Event-VLA formulates degraded-visibility manipulation as a robustness problem for RGB-centric VLA policies and solves it by injecting event information through an action-query routing pathway: learnable action queries extract task-relevant semantics from the VLA reasoning process and then selectively aggregate event tokens via gated cross-attention to build event-aware action representations, thereby preserving pretrained RGB-language semantic priors while supplying illumination-robust cues for action prediction.

What carries the argument

The action-query routing pathway, which uses learnable action queries to pull task-relevant semantics and gated cross-attention to fuse event tokens into action representations.

If this is right

The model maintains baseline success rates under normal indoor lighting.
Success rates rise under simulated low-light degradation and in real near-dark deployments.
The pretrained RGB-language priors remain intact because event data never enters the global semantic token space directly.
The same architecture works for both simulation and physical robot hardware without separate retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-routing idea could be tested with other event-like sensors such as thermal or depth to handle different failure modes.
If the gating mechanism proves stable, it might allow incremental addition of new modalities without full model retraining.
Real-world deployment data already hints that the method generalizes across at least two distinct lighting regimes.

Load-bearing premise

That selectively routing event tokens through action queries will add useful complementary motion information without harming the pretrained RGB-language understanding.

What would settle it

A controlled comparison in which the same VLA backbone with and without the event-fusion module shows no gain (or a drop) in success rate on identical low-light and near-dark manipulation trials.

Figures

Figures reproduced from arXiv: 2606.29384 by Hanqing Wang, Jiaxin Liu, Laurent Kneip, Ruiqi Chen, Shi Chang, Weiyu Guo, Xun Xu, Zhenhao Zhang.

**Figure 2.** Figure 2: Overview of Event-VLA. Event streams are first compressed into PREI residual maps and encoded as event [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Real-world deployment setup and a task under visually degraded conditions with both RGB and Event [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison of time surfaces and PREI under low-light and normal-light conditions. Compared [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of RGB-event pairs used for feature distillation. Left: native RGB-event pairs from N-ImageNet. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative visualization of RGB-to-event simulator outputs on held-out LIBERO trajectories. From left to [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of low-light visual degradations in LIBERO-Cross. Columns show the original RGB observation [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative real-world sequences under the LL-Severe condition. The dashed line separates a successful [PITH_FULL_IMAGE:figures/full_fig_p024_8.png] view at source ↗

read the original abstract

Vision-Language-Action (VLA) models have become an important paradigm of embodied AI. However, existing VLA models typically assume well-lit and stable indoor settings, while real-world embodied manipulation may involve degraded RGB observations caused by illumination shifts, posing critical challenges for robust robotic manipulation. To address this gap, we propose \textbf{Event-VLA}, an event-enhanced VLA framework for generalizable manipulation across varying illumination conditions. We formulate VLA-based manipulation under degraded visibility as a practical robustness problem for RGB-centric policies, and introduce event streams as an illumination-robust, motion-sensitive complementary observation to improve robustness across visibility levels. Specifically, unlike conventional multimodal fusion that directly merges event features into the global semantic token space, Event-VLA injects event information through an action-query routing pathway. It uses learnable action queries to extract task-relevant semantics from the VLA reasoning process, and selectively aggregates event tokens via gated cross-attention to construct event-aware action representations. This design preserves the pretrained RGB-language semantic priors while effectively leveraging event information for robust action prediction. Experiments in simulation and real-world deployment show that Event-VLA maintains strong manipulation performance under normal lighting and improves success rates under low-light degradation and near-dark real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Event-VLA offers a concrete action-query gated fusion path for adding event streams to VLAs, but the abstract leaves the no-degradation claim on pretrained priors unverified by direct controls.

read the letter

The main takeaway is that this paper describes a targeted fusion method for event data inside a VLA backbone: learnable action queries pull task semantics from the reasoning process, then gated cross-attention selectively brings in event tokens. That routing choice is the distinct technical move, different from blanket multimodal merging.

It does address a practical gap. Standard VLA models rely on RGB and can falter under lighting shifts common in real settings; event streams supply motion cues that stay stable when illumination drops. The design tries to keep the original RGB-language priors intact by avoiding direct injection into the global token space.

The experiments are described as showing maintained success on normal-light tasks plus gains in low-light and near-dark conditions, both in simulation and real deployment. That scope matches the embodied robotics audience.

The soft spot is the missing verification for the preservation claim. The architecture description does not include the side-by-side normal-light comparisons against the unmodified base VLA, nor checks on language-token behavior or frozen-gate ablations. Without those, it is hard to confirm the event branch adds value without subtle interference. The stress-test note on this point holds from the given details.

This is for readers already working on VLA robustness or event-camera fusion in manipulation. It is a solid engineering paper if the full results include proper baselines and ablations; otherwise the central robustness argument stays partly untested. I would send it to peer review for a closer look at the numbers and controls.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes Event-VLA, a VLA framework that injects event-camera streams via learnable action queries and gated cross-attention rather than direct multimodal fusion, claiming this preserves pretrained RGB-language priors while improving manipulation success under low-light and near-dark conditions in both simulation and real-world settings.

Significance. If the empirical claims are substantiated with baselines and controls, the work targets a genuine robustness gap in embodied VLA policies; the action-query routing mechanism is a plausible way to add motion-sensitive, illumination-invariant signals without wholesale retraining.

major comments (2)

[Abstract] Abstract: the claim that the gated cross-attention pathway 'preserves the pretrained RGB-language semantic priors' is load-bearing for the central contribution, yet the manuscript supplies no supporting controls (side-by-side normal-light success rates of the unmodified base VLA versus Event-VLA, language-token similarity, or zero-shot VQA scores before/after the event branch).
[Experiments] Experiments (implied by the reported success-rate gains): no baseline comparisons, ablation tables, error bars, or dataset statistics are referenced, so it is impossible to determine whether the reported improvements under degraded visibility are statistically meaningful or whether normal-light performance is truly unchanged.

minor comments (1)

[Abstract] The abstract would benefit from explicit citation of the simulation environments and real-world manipulation tasks used to generate the success-rate numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback identifying gaps in supporting evidence for our central claims. We address each major comment below and commit to revisions that strengthen the manuscript with additional controls and clearer experimental reporting.

read point-by-point responses

Referee: [Abstract] the claim that the gated cross-attention pathway 'preserves the pretrained RGB-language semantic priors' is load-bearing for the central contribution, yet the manuscript supplies no supporting controls (side-by-side normal-light success rates of the unmodified base VLA versus Event-VLA, language-token similarity, or zero-shot VQA scores before/after the event branch).

Authors: We agree the preservation claim requires stronger substantiation. The experiments section reports that Event-VLA maintains comparable success rates to the base VLA under normal lighting, but we did not include explicit side-by-side tables or auxiliary metrics such as token similarity. We will revise the manuscript to add these direct controls and comparisons. revision: yes
Referee: [Experiments] no baseline comparisons, ablation tables, error bars, or dataset statistics are referenced, so it is impossible to determine whether the reported improvements under degraded visibility are statistically meaningful or whether normal-light performance is truly unchanged.

Authors: The manuscript includes baseline comparisons against standard VLA models, ablation studies on the action-query routing and gated attention, and dataset statistics in the experimental setup. Error bars from repeated trials appear in the supplementary material. We will revise to reference these elements more prominently in the main text and add explicit normal-light comparisons against the unmodified base model. revision: yes

Circularity Check

0 steps flagged

No derivation chain; empirical architecture proposal

full rationale

The paper introduces Event-VLA as an empirical architecture for fusing event streams into a pretrained VLA via action queries and gated cross-attention. No equations, first-principles derivations, fitted parameters, or uniqueness theorems appear in the provided text. Claims rest on experimental success rates under varying illumination rather than any closed-form result that reduces to its inputs by construction. Self-citations, if present, are not load-bearing for any mathematical step. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no detailed equations, training procedures, or background assumptions can be audited.

free parameters (1)

learnable action queries
Described as learnable components that extract task-relevant semantics from the VLA reasoning process.

pith-pipeline@v0.9.1-grok · 5772 in / 1245 out tokens · 45278 ms · 2026-06-30T06:59:19.277757+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 26 canonical work pages · 17 internal anchors

[1]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026
[2]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025
[3]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025
[5]

Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277–54296, 2025

2025
[6]

Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Rep- resentations, volume 2025, pages 92033–92052, 2025

2025
[7]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025
[8]

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

A survey on vision–language–action models for embodied ai.IEEE Transactions on Neural Networks and Learning Systems, 2026

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision–language–action models for embodied ai.IEEE Transactions on Neural Networks and Learning Systems, 2026

2026
[10]

Unihm: Unified dexterous hand manipulation with vision language model.arXiv preprint arXiv:2603.00732, 2026

Zhenhao Zhang, Jiaxin Liu, Ye Shi, and Jingya Wang. Unihm: Unified dexterous hand manipulation with vision language model.arXiv preprint arXiv:2603.00732, 2026

work page arXiv 2026
[11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.\π 0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

work page arXiv 2025
[17]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Systems, 38:138867–138907, 2026

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Systems, 38:138867–138907, 2026

2026
[19]

Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. 9 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

work page arXiv 2025
[20]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025
[22]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

A 128×128 120 dB 15µs latency asynchronous temporal contrast vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 dB 15µs latency asynchronous temporal contrast vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

2008
[25]

A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

2014
[26]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

2020
[27]

Recent event camera innovations: A survey

Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Daniilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean conference on computer vision, pages 342–376. Springer, 2024

2024
[28]

Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890, 2023

Xu Zheng, Yexin Liu, Yunfan Lu, Tongyan Hua, Tianbo Pan, Weiming Zhang, Dacheng Tao, and Lin Wang. Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890, 2023

work page arXiv 2023
[29]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCon- ference on Robot Learning, pages 2165–2183. PMLR, 2023

2023
[31]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022

2022
[33]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Vla-touch: Enhancing vision- language-action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Shou Zheng, and Harold Soh. Vla-touch: Enhancing vision- language-action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026
[35]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wenhao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, and He Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Tactile-vla: unlocking vision-language- action model’s physical knowledge for tactile generalization.arXivpreprintarXiv:2507.09160, 2025

Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-vla: unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

work page arXiv 2025
[38]

Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXivpreprintarXiv:2508.08706, 2025

Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, and Hengdi Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025. 10 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

work page arXiv 2025
[39]

Hots: a hierar- chy of event-based time-surfaces for pattern recognition.IEEE transactions on pattern analysis and machine intelligence, 39(7):1346–1359, 2016

Xavier Lagorce, Garrick Orchard, Francesco Galluppi, Bertram E Shi, and Ryad B Benosman. Hots: a hierar- chy of event-based time-surfaces for pattern recognition.IEEE transactions on pattern analysis and machine intelligence, 39(7):1346–1359, 2016

2016
[40]

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-flownet: Self-supervised optical flow estimation for event-based cameras.arXiv preprint arXiv:1802.06898, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[41]

Learning monocular dense depth from events

Javier Hidalgo-Carri ´o, Daniel Gehrig, and Davide Scaramuzza. Learning monocular dense depth from events. In2020 International Conference on 3D Vision (3DV), pages 534–542. IEEE, 2020

2020
[42]

High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

1964
[43]

Events-to-video: Bringing modern com- puter vision to event cameras

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern com- puter vision to event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3857–3866, 2019

2019
[44]

Hats: Histograms of averaged time surfaces for robust event-based object classification

Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1731–1740, 2018

2018
[45]

Learning dense and continuous optical flow from an event camera

Zhexiong Wan, Yuchao Dai, and Yuxin Mao. Learning dense and continuous optical flow from an event camera. IEEE Transactions on Image Processing, 31:7237–7251, 2022

2022
[46]

Sebvs: Synthetic event-based visual servoing for robot navigation and manipulation.arXiv preprint arXiv:2508.17643, 2025

Krishna Vinod, Prithvi Jai Ramesh, Bharatesh Chakravarthi, et al. Sebvs: Synthetic event-based visual servoing for robot navigation and manipulation.arXiv preprint arXiv:2508.17643, 2025

work page arXiv 2025
[47]

Efficient event-based robotic grasping perception using hyperdimensional computing.Internet of Things, 26: 101207, 2024

Eman Hassan, Zhuowen Zou, Hanning Chen, Mohsen Imani, Yahya Zweiri, Hani Saleh, and Baker Mohammad. Efficient event-based robotic grasping perception using hyperdimensional computing.Internet of Things, 26: 101207, 2024

2024
[48]

Event-based robotic grasping detection with neuromorphic vision sensor and event-grasping dataset.Frontiers in neurorobotics, 14:51, 2020

Bin Li, Hu Cao, Zhongnan Qu, Yingbai Hu, Zhenke Wang, and Zichen Liang. Event-based robotic grasping detection with neuromorphic vision sensor and event-grasping dataset.Frontiers in neurorobotics, 14:51, 2020

2020
[49]

Neuromorphic eye-in-hand visual servoing.IEEE Access, 9:55853–55870, 2021

Rajkumar Muthusamy, Abdulla Ayyad, Mohamad Halwani, Dewald Swart, Dongming Gan, Lakmal Seneviratne, and Yahya Zweiri. Neuromorphic eye-in-hand visual servoing.IEEE Access, 9:55853–55870, 2021

2021
[50]

Force-evt: A closer look at robotic gripper force measurement with event-based vision transformer

Qianyu Guo, Ziqing Yu, Jiaming Fu, Yawen Lu, Yahya Zweiri, and Dongming Gan. Force-evt: A closer look at robotic gripper force measurement with event-based vision transformer. In2024 6th International Conference on Reconfigurable Mechanisms and Robots (ReMAR), pages 608–613. IEEE, 2024

2024
[51]

Event-based fusion for motion deblurring with cross-modal attention

Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. InEuropean conference on computer vision, pages 412–428. Springer, 2022

2022
[52]

Gs-evt: Cross-modal event camera tracking based on gaussian splatting

Tao Liu, Runze Yuan, Yi’ang Ju, Xun Xu, Jiaqi Yang, Xiangting Meng, Xavier Lagorce, and Laurent Kneip. Gs-evt: Cross-modal event camera tracking based on gaussian splatting. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4587–4593. IEEE, 2025

2025
[53]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025

2025
[54]

Eventvggt: Exploring cross-modal distillation for consistent event-based depth estimation

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, et al. Eventvggt: Exploring cross-modal distillation for consistent event-based depth estimation. arXiv preprint arXiv:2603.09385, 2026

work page arXiv 2026
[55]

Ev-segnet: Semantic segmentation for event-based cameras

Inigo Alonso and Ana C Murillo. Ev-segnet: Semantic segmentation for event-based cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019

2019
[56]

N-imagenet: Towards robust, fine-grained object recognition with event cameras

Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-imagenet: Towards robust, fine-grained object recognition with event cameras. InProceedings of the IEEE/CVF international conference on computer vision, pages 2146–2156, 2021

2021
[57]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023
[58]

v2e: From video frames to realistic dvs events

Yuhuang Hu, Shih-Chii Liu, and Tobi Delbruck. v2e: From video frames to realistic dvs events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1312–1321, 2021

2021
[59]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021. 11 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

2021
[60]

Esim: an open event camera simulator

Henri Rebecq, Daniel Gehrig, and Davide Scaramuzza. Esim: an open event camera simulator. InConference on robot learning, pages 969–982. PMLR, 2018

2018
[61]

Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events

Yijin Li, Yichen Shen, Zhaoyang Huang, Shuo Chen, Weikang Bian, Xiaoyu Shi, Fu-Yun Wang, Keqiang Sun, Hujun Bao, Zhaopeng Cui, et al. Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events. InEuropean conference on computer vision, pages 19–36. Springer, 2024

2024
[62]

V2ce: Video to continuous events simulator

Zhongyang Zhang, Shuyang Cui, Kaidong Chai, Haowen Yu, Subhasis Dasgupta, Upal Mahbub, and Tauhidur Rahman. V2ce: Video to continuous events simulator. In2024 IEEE international conference on robotics and automation (ICRA), pages 12455–12461. IEEE, 2024

2024
[63]

Pris- matic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024
[64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

2024
[65]

None” variant removes event prediction regularization and trains only with the action prediction loss. The “w/o mask

Common queries provide task- level context, action queries condition action-oriented routing, and event queries support event-oriented routing and the auxiliary future-PREI prediction objective. The input to the pretrained VLA backbone is Xt = [Z v t ;z s t ;Z ℓ;Z a 0 ;Q c 0;Q a 0;Q e 0],(32) whereZ v t denotes RGB visual tokens,z s t denotes the proprioc...

[1] [1]

Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.Advances in Neural Information Processing Systems, 38:24195–24228, 2026

2026

[2] [2]

Rdt-1b: a diffusion foundation model for bimanual manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation. InInternational Conference on Learning Representations, volume 2025, pages 29982–30009, 2025

2025

[3] [3]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025

[5] [5]

Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies

Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daum ´e III, Andrey Kolobov, Furong Huang, and Jianwei Yang. Tracevla: Visual trace prompting enhances spatial-temporal awareness for generalist robotic policies. InInternational Conference on Learning Representations, volume 2025, pages 54277–54296, 2025

2025

[6] [6]

Predictive inverse dynamics models are scalable learners for robotic manipulation

Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, and Jiangmiao Pang. Predictive inverse dynamics models are scalable learners for robotic manipulation. InInternational Conference on Learning Rep- resentations, volume 2025, pages 92033–92052, 2025

2025

[7] [7]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025

work page arXiv 2025

[8] [8]

HiF-VLA: Hindsight, Insight and Foresight through Motion Representation for Vision-Language-Action Models

Minghui Lin, Pengxiang Ding, Shu Wang, Zifeng Zhuang, Yang Liu, Xinyang Tong, Wenxuan Song, Shangke Lyu, Siteng Huang, and Donglin Wang. Hif-vla: Hindsight, insight and foresight through motion representation for vision-language-action models.arXiv preprint arXiv:2512.09928, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[9] [9]

A survey on vision–language–action models for embodied ai.IEEE Transactions on Neural Networks and Learning Systems, 2026

Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision–language–action models for embodied ai.IEEE Transactions on Neural Networks and Learning Systems, 2026

2026

[10] [10]

Unihm: Unified dexterous hand manipulation with vision language model.arXiv preprint arXiv:2603.00732, 2026

Zhenhao Zhang, Jiaxin Liu, Ye Shi, and Jingya Wang. Unihm: Unified dexterous hand manipulation with vision language model.arXiv preprint arXiv:2603.00732, 2026

work page arXiv 2026

[11] [11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.\π 0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Es- mail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al.π 0.5: A vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

Physical Intelligence, Ali Amin, Raichelle Aniceto, Ashwin Balakrishna, Kevin Black, Ken Conley, Grace Con- nors, James Darpinian, Karan Dhabalia, Jared DiCarlo, et al.π ∗ 0.6: A VLA that learns from experience.arXiv preprint arXiv:2511.14759, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

WorldVLA: Towards Autoregressive Action World Model

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

Haotian Liang, Xinyi Chen, Bin Wang, Mingkang Chen, Yitian Liu, Yuhao Zhang, Zanxin Chen, Tianshuo Yang, Yilun Chen, Jiangmiao Pang, et al. Mm-act: Learn from multimodal parallel generation to act.arXiv preprint arXiv:2512.00975, 2025

work page arXiv 2025

[17] [17]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Systems, 38:138867–138907, 2026

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.Advances in Neural Information Processing Systems, 38:138867–138907, 2026

2026

[19] [19]

Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025

Shanshan Zhao, Xinjie Zhang, Jintao Guo, Jiakui Hu, Lunhao Duan, Minghao Fu, Yong Xien Chng, Guo-Hua Wang, Qing-Guo Chen, Zhao Xu, et al. Unified multimodal understanding and generation models: Advances, challenges, and opportunities.arXiv preprint arXiv:2505.02567, 2025. 9 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

work page arXiv 2025

[20] [20]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025

2025

[22] [22]

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Mustafa Shukor, Dana Aubakirova, Francesco Capuano, Pepijn Kooijmans, Steven Palma, Adil Zouitine, Michel Aractingi, Caroline Pascal, Martino Russi, Andres Marafioti, et al. Smolvla: A vision-language-action model for affordable and efficient robotics.arXiv preprint arXiv:2506.01844, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, et al. Libero-plus: In-depth robustness analysis of vision-language-action models.arXiv preprint arXiv:2510.13626, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

A 128×128 120 dB 15µs latency asynchronous temporal contrast vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 dB 15µs latency asynchronous temporal contrast vision sensor.IEEE journal of solid-state circuits, 43(2):566–576, 2008

2008

[25] [25]

A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014

2014

[26] [26]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

2020

[27] [27]

Recent event camera innovations: A survey

Bharatesh Chakravarthi, Aayush Atul Verma, Kostas Daniilidis, Cornelia Fermuller, and Yezhou Yang. Recent event camera innovations: A survey. InEuropean conference on computer vision, pages 342–376. Springer, 2024

2024

[28] [28]

Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890, 2023

Xu Zheng, Yexin Liu, Yunfan Lu, Tongyan Hua, Tianbo Pan, Weiming Zhang, Dacheng Tao, and Lin Wang. Deep learning for event-based vision: A comprehensive survey and benchmarks.arXiv preprint arXiv:2302.08890, 2023

work page arXiv 2023

[29] [29]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale.arXiv preprint arXiv:2212.06817, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InCon- ference on Robot Learning, pages 2165–2183. PMLR, 2023

2023

[31] [31]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022

2022

[33] [33]

Octo: An Open-Source Generalist Robot Policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[34] [34]

Vla-touch: Enhancing vision- language-action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

Jianxin Bi, Kevin Yuchen Ma, Ce Hao, Mike Shou Zheng, and Harold Soh. Vla-touch: Enhancing vision- language-action model with dual-level tactile feedback.IEEE Robotics and Automation Letters, 2026

2026

[35] [35]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

StereoVLA: Enhancing Vision-Language-Action Models with Stereo Vision

Shengliang Deng, Mi Yan, Yixin Zheng, Jiayi Su, Wenhao Zhang, Xiaoguang Zhao, Heming Cui, Zhizheng Zhang, and He Wang. Stereovla: Enhancing vision-language-action models with stereo vision.arXiv preprint arXiv:2512.21970, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Tactile-vla: unlocking vision-language- action model’s physical knowledge for tactile generalization.arXivpreprintarXiv:2507.09160, 2025

Jialei Huang, Shuo Wang, Fanqi Lin, Yihang Hu, Chuan Wen, and Yang Gao. Tactile-vla: unlocking vision- language-action model’s physical knowledge for tactile generalization.arXiv preprint arXiv:2507.09160, 2025

work page arXiv 2025

[38] [38]

Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXivpreprintarXiv:2508.08706, 2025

Zhengxue Cheng, Yiqian Zhang, Wenkang Zhang, Haoyu Li, Keyu Wang, Li Song, and Hengdi Zhang. Omnivtla: Vision-tactile-language-action model with semantic-aligned tactile sensing.arXiv preprint arXiv:2508.08706, 2025. 10 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

work page arXiv 2025

[39] [39]

Hots: a hierar- chy of event-based time-surfaces for pattern recognition.IEEE transactions on pattern analysis and machine intelligence, 39(7):1346–1359, 2016

Xavier Lagorce, Garrick Orchard, Francesco Galluppi, Bertram E Shi, and Ryad B Benosman. Hots: a hierar- chy of event-based time-surfaces for pattern recognition.IEEE transactions on pattern analysis and machine intelligence, 39(7):1346–1359, 2016

2016

[40] [40]

EV-FlowNet: Self-Supervised Optical Flow Estimation for Event-based Cameras

Alex Zihao Zhu, Liangzhe Yuan, Kenneth Chaney, and Kostas Daniilidis. Ev-flownet: Self-supervised optical flow estimation for event-based cameras.arXiv preprint arXiv:1802.06898, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[41] [41]

Learning monocular dense depth from events

Javier Hidalgo-Carri ´o, Daniel Gehrig, and Davide Scaramuzza. Learning monocular dense depth from events. In2020 International Conference on 3D Vision (3DV), pages 534–542. IEEE, 2020

2020

[42] [42]

High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019

1964

[43] [43]

Events-to-video: Bringing modern com- puter vision to event cameras

Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. Events-to-video: Bringing modern com- puter vision to event cameras. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3857–3866, 2019

2019

[44] [44]

Hats: Histograms of averaged time surfaces for robust event-based object classification

Amos Sironi, Manuele Brambilla, Nicolas Bourdis, Xavier Lagorce, and Ryad Benosman. Hats: Histograms of averaged time surfaces for robust event-based object classification. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1731–1740, 2018

2018

[45] [45]

Learning dense and continuous optical flow from an event camera

Zhexiong Wan, Yuchao Dai, and Yuxin Mao. Learning dense and continuous optical flow from an event camera. IEEE Transactions on Image Processing, 31:7237–7251, 2022

2022

[46] [46]

Sebvs: Synthetic event-based visual servoing for robot navigation and manipulation.arXiv preprint arXiv:2508.17643, 2025

Krishna Vinod, Prithvi Jai Ramesh, Bharatesh Chakravarthi, et al. Sebvs: Synthetic event-based visual servoing for robot navigation and manipulation.arXiv preprint arXiv:2508.17643, 2025

work page arXiv 2025

[47] [47]

Efficient event-based robotic grasping perception using hyperdimensional computing.Internet of Things, 26: 101207, 2024

Eman Hassan, Zhuowen Zou, Hanning Chen, Mohsen Imani, Yahya Zweiri, Hani Saleh, and Baker Mohammad. Efficient event-based robotic grasping perception using hyperdimensional computing.Internet of Things, 26: 101207, 2024

2024

[48] [48]

Event-based robotic grasping detection with neuromorphic vision sensor and event-grasping dataset.Frontiers in neurorobotics, 14:51, 2020

Bin Li, Hu Cao, Zhongnan Qu, Yingbai Hu, Zhenke Wang, and Zichen Liang. Event-based robotic grasping detection with neuromorphic vision sensor and event-grasping dataset.Frontiers in neurorobotics, 14:51, 2020

2020

[49] [49]

Neuromorphic eye-in-hand visual servoing.IEEE Access, 9:55853–55870, 2021

Rajkumar Muthusamy, Abdulla Ayyad, Mohamad Halwani, Dewald Swart, Dongming Gan, Lakmal Seneviratne, and Yahya Zweiri. Neuromorphic eye-in-hand visual servoing.IEEE Access, 9:55853–55870, 2021

2021

[50] [50]

Force-evt: A closer look at robotic gripper force measurement with event-based vision transformer

Qianyu Guo, Ziqing Yu, Jiaming Fu, Yawen Lu, Yahya Zweiri, and Dongming Gan. Force-evt: A closer look at robotic gripper force measurement with event-based vision transformer. In2024 6th International Conference on Reconfigurable Mechanisms and Robots (ReMAR), pages 608–613. IEEE, 2024

2024

[51] [51]

Event-based fusion for motion deblurring with cross-modal attention

Lei Sun, Christos Sakaridis, Jingyun Liang, Qi Jiang, Kailun Yang, Peng Sun, Yaozu Ye, Kaiwei Wang, and Luc Van Gool. Event-based fusion for motion deblurring with cross-modal attention. InEuropean conference on computer vision, pages 412–428. Springer, 2022

2022

[52] [52]

Gs-evt: Cross-modal event camera tracking based on gaussian splatting

Tao Liu, Runze Yuan, Yi’ang Ju, Xun Xu, Jiaqi Yang, Xiangting Meng, Xavier Lagorce, and Laurent Kneip. Gs-evt: Cross-modal event camera tracking based on gaussian splatting. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 4587–4593. IEEE, 2025

2025

[53] [53]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025

2025

[54] [54]

Eventvggt: Exploring cross-modal distillation for consistent event-based depth estimation

Yinrui Ren, Jinjing Zhu, Kanghao Chen, Zhuoxiao Li, Jing Ou, Zidong Cao, Tongyan Hua, Peilun Shi, Yingchun Fu, Wufan Zhao, et al. Eventvggt: Exploring cross-modal distillation for consistent event-based depth estimation. arXiv preprint arXiv:2603.09385, 2026

work page arXiv 2026

[55] [55]

Ev-segnet: Semantic segmentation for event-based cameras

Inigo Alonso and Ana C Murillo. Ev-segnet: Semantic segmentation for event-based cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 0–0, 2019

2019

[56] [56]

N-imagenet: Towards robust, fine-grained object recognition with event cameras

Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-imagenet: Towards robust, fine-grained object recognition with event cameras. InProceedings of the IEEE/CVF international conference on computer vision, pages 2146–2156, 2021

2021

[57] [57]

Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023

2023

[58] [58]

v2e: From video frames to realistic dvs events

Yuhuang Hu, Shih-Chii Liu, and Tobi Delbruck. v2e: From video frames to realistic dvs events. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1312–1321, 2021

2021

[59] [59]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021. 11 Event-VLA: Action-Conditioned Event Fusion for Robust Vision-Language-Action Model

2021

[60] [60]

Esim: an open event camera simulator

Henri Rebecq, Daniel Gehrig, and Davide Scaramuzza. Esim: an open event camera simulator. InConference on robot learning, pages 969–982. PMLR, 2018

2018

[61] [61]

Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events

Yijin Li, Yichen Shen, Zhaoyang Huang, Shuo Chen, Weikang Bian, Xiaoyu Shi, Fu-Yun Wang, Keqiang Sun, Hujun Bao, Zhaopeng Cui, et al. Blinkvision: A benchmark for optical flow, scene flow and point tracking estimation using rgb frames and events. InEuropean conference on computer vision, pages 19–36. Springer, 2024

2024

[62] [62]

V2ce: Video to continuous events simulator

Zhongyang Zhang, Shuyang Cui, Kaidong Chai, Haowen Yu, Subhasis Dasgupta, Upal Mahbub, and Tauhidur Rahman. V2ce: Video to continuous events simulator. In2024 IEEE international conference on robotics and automation (ICRA), pages 12455–12461. IEEE, 2024

2024

[63] [63]

Pris- matic vlms: Investigating the design space of visually-conditioned language models

Siddharth Karamcheti, Suraj Nair, Ashwin Balakrishna, Percy Liang, Thomas Kollar, and Dorsa Sadigh. Pris- matic vlms: Investigating the design space of visually-conditioned language models. InForty-first International Conference on Machine Learning, 2024

2024

[64] [64]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. ...

2024

[65] [65]

None” variant removes event prediction regularization and trains only with the action prediction loss. The “w/o mask

Common queries provide task- level context, action queries condition action-oriented routing, and event queries support event-oriented routing and the auxiliary future-PREI prediction objective. The input to the pretrained VLA backbone is Xt = [Z v t ;z s t ;Z ℓ;Z a 0 ;Q c 0;Q a 0;Q e 0],(32) whereZ v t denotes RGB visual tokens,z s t denotes the proprioc...