RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Chuang Zhu; Donghong Jiang; Endian Lin; Hanqing Liu; Luoping Cui; Mingjie Liu

arxiv: 2605.19329 · v1 · pith:TBWQ7MEGnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI

RE-VLM: Event-Augmented Vision-Language Model for Scene Understanding

Hanqing Liu , Mingjie Liu , Luoping Cui , Endian Lin , Donghong Jiang , Chuang Zhu This is my paper

Pith reviewed 2026-05-20 06:46 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords event camerasvision-language modelsRGB-event fusionscene understandingsynthetic data generationmultimodal alignmentchallenging environments

0 comments

The pith

RE-VLM fuses RGB images with event camera streams to improve vision-language performance under poor lighting and fast motion.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard vision-language models rely on RGB images that lose detail in low light, high contrast, or rapid movement. Event cameras record brightness changes asynchronously and retain motion information where frames fail. RE-VLM runs parallel RGB and event encoders, aligns their features to language through staged training, and generates its own training captions and QA pairs by first building scene graphs from paired streams. Two new datasets support evaluation on illumination-challenged and general scenes. The resulting model exceeds prior RGB-only and event-only baselines on captioning and visual question answering, with the largest margins appearing precisely when conventional images degrade.

Core claim

RE-VLM is the first dual-stream vision-language model that jointly processes synchronized RGB images and event streams through parallel encoders and progressive cross-modal alignment, while a graph-driven pipeline converts the paired visual input into scene graphs from which synthetic yet verifiable captions and QA pairs are generated, yielding consistent gains over RGB-only and event-only models on captioning and VQA benchmarks especially in challenging conditions.

What carries the argument

Parallel RGB and event encoders whose heterogeneous features are aligned to language via progressive training, plus a graph-driven pipeline that extracts scene graphs from RGB-Event streams to synthesize captions and QA pairs.

If this is right

Scene understanding remains reliable when RGB frames suffer from low light, high dynamic range, or fast motion.
Synthetic yet verifiable supervision can substitute for scarce human-annotated RGB-Event-Text data.
Event streams supply complementary motion cues that standard VLMs currently lack.
The dual-stream design scales to additional challenging environments beyond the two new datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph-to-text synthesis method could be applied to other sparse modalities such as lidar or thermal data to bootstrap multimodal models.
Performance gains in adverse conditions suggest the approach may reduce the need for specialized hardware in outdoor robotics or surveillance.
If the event branch can be optionally disabled at inference time, the model could serve as a drop-in upgrade for existing RGB-only VLMs.

Load-bearing premise

The graph-driven pipeline produces accurate scene graphs and high-quality synthetic captions or QA pairs that faithfully represent real scene content without introducing artifacts that would inflate measured performance.

What would settle it

Train an otherwise identical model on the same real paired RGB-Event streams but without the graph-synthesis step and compare its captioning and VQA scores directly against the full RE-VLM on the held-out portions of PEOD-Chat and RGBE-Chat.

Figures

Figures reproduced from arXiv: 2605.19329 by Chuang Zhu, Donghong Jiang, Endian Lin, Hanqing Liu, Luoping Cui, Mingjie Liu.

**Figure 2.** Figure 2: Construction of RE-VLM: data generation pipeline and model. Left: A graph-driven pipeline converts synchronized RGB frames and event streams into a graph, extracts verifiable scene facts, and synthesizes reliable caption and QA supervision. Center: Representative examples from the datasets yielded by the pipeline: PEOD-Chat (illumination-challenged scenes) and RGBE-Chat (general scenarios). Right: The RE-V… view at source ↗

**Figure 3.** Figure 3: Data generation pipeline overview. From reconstructed event frames and RGB images, two modality-specific graphs are constructed. A degradation-aware fusion then merges them into a single RGB-event graph (nodes: entities, edges: relations). Finally, captions and VQA items are synthesized from the fused graph. (S: subject, P: place, D: direction, T: target; H: hierarchical relation; A: attribute.) attributes… view at source ↗

**Figure 4.** Figure 4: RE-VLM model architecture. Synchronized RGB and event streams are encoded. During [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Training pipeline. Three compact stages: (1) Initial event– language alignment, (2) Align the event and RGB modalities with STAM, (3) End-to-end instruction tuning. We adopt a concise three-stage curriculum that first aligns event representations with language, then aligns it with the RGB representation via STAM, and finally performs lightweight instruction tuning on the LLM. Stage 1: Event-Language alignm… view at source ↗

**Figure 6.** Figure 6: Qualitative VQA comparison in an overexposed traffic [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Conventional vision-language models (VLMs) struggle to interpret scenes captured under adverse conditions (e.g., low light, high dynamic range, or fast motion) because standard RGB images degrade in such environments. Event cameras provide a complementary modality: they asynchronously record per-pixel brightness changes with high temporal resolution and wide dynamic range, preserving motion cues where frames fail. We propose RE-VLM, the first dual-stream vision-language model that jointly leverages RGB images and event streams for robust scene understanding across both normal and challenging conditions. RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language. To address the scarcity of RGB-Event-Text supervision, we further propose a graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs, from which we synthesize captions and question-answer (QA) pairs. To develop and evaluate RE-VLM, we construct two datasets: PEOD-Chat, targeting illumination-challenged scenes, and RGBE-Chat, covering diverse scenarios. On captioning and VQA benchmarks, RE-VLM consistently outperforms state-of-the-art RGB-only and event-only models with comparable parameter counts, with particularly large gains under challenging conditions. These results demonstrate the effectiveness of event-augmented VLMs in achieving robust vision-language understanding across a wide range of real-world environments. Code and datasets are available at https://github.com/bupt-ai-cz/RE-VLM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a dual-stream RGB-event VLM with graph-synthesized training data and shows gains on new benchmarks, though the reliance on synthetic data calls for extra validation.

read the letter

The key takeaway is that this paper presents the first dual-stream VLM for RGB and event data fusion, using a graph-based method to create training data from scene graphs, and reports better results on captioning and VQA especially when conditions are bad for regular cameras. What stands out as new is the specific architecture with parallel encoders for the two visual streams and the progressive alignment to text. The graph-driven pipeline to turn synchronized RGB-event into scene graphs and then into captions and QA pairs is a practical way around the lack of real paired supervision. They also release two datasets, PEOD-Chat for tough lighting and RGBE-Chat for broader cases, along with code. The work does well in framing a clear problem: standard VLMs fail in low light or high motion because RGB images lose information, while event cameras capture changes with high temporal resolution and dynamic range. The idea of augmenting with events makes sense for safety-critical uses like driving or robotics. On the downside, the performance numbers come from benchmarks built with the same graph pipeline. The abstract does not include any checks on graph accuracy, human ratings of the synthetic text, or whether the test data was generated separately. This raises the possibility that some of the reported gains come from the model learning artifacts in the generated data rather than true multimodal understanding. It would help to see ablations that isolate the event stream contribution and proper statistical reporting. Readers working on robust perception for real-world environments or on event-based vision would get the most out of this. It is not a complete overhaul of VLMs but a targeted extension that could matter in specific domains. The paper shows clear thinking on the motivation and a workable pipeline, so it is worth sending to peer review for a closer look at the experiments and data construction.

Referee Report

1 major / 1 minor

Summary. The paper claims to introduce RE-VLM, a novel dual-stream vision-language model that integrates RGB and event data for improved scene understanding in normal and challenging conditions. It proposes a graph-driven pipeline to generate scene graphs from RGB-Event streams and synthesize captions and QA pairs to overcome data scarcity, resulting in two new datasets: PEOD-Chat for illumination-challenged scenes and RGBE-Chat for diverse scenarios. Experimental results indicate that RE-VLM outperforms state-of-the-art RGB-only and event-only models on captioning and VQA tasks, with larger gains in adverse conditions, supported by the release of code and datasets.

Significance. Should the empirical results prove robust upon validation of the data generation process, this work would be significant for advancing multimodal AI by incorporating high-temporal-resolution event sensing into VLMs. This could lead to more reliable vision-language systems for real-world applications involving motion, low light, or high dynamic range. The open-sourcing of code and datasets is a commendable strength that enhances the paper's impact and allows for independent verification.

major comments (1)

[Graph-driven pipeline and dataset construction] The description of the graph-driven pipeline for creating verifiable scene graphs and synthesizing captions/QA pairs does not include any quantitative evaluation of graph accuracy (such as node/edge precision or F1 scores) or human evaluation of the quality and faithfulness of the generated text. Since the performance gains are evaluated on PEOD-Chat and RGBE-Chat, which are derived from this pipeline, this omission is critical as it leaves open the possibility that reported improvements stem from artifacts or biases in the synthetic data rather than genuine multimodal advantages.

minor comments (1)

[Abstract] The abstract mentions 'verifiable scene graphs' but does not elaborate on the verification process; this could be clarified for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and insightful comments on our manuscript. We address the major comment point by point below and outline the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: The description of the graph-driven pipeline for creating verifiable scene graphs and synthesizing captions/QA pairs does not include any quantitative evaluation of graph accuracy (such as node/edge precision or F1 scores) or human evaluation of the quality and faithfulness of the generated text. Since the performance gains are evaluated on PEOD-Chat and RGBE-Chat, which are derived from this pipeline, this omission is critical as it leaves open the possibility that reported improvements stem from artifacts or biases in the synthetic data rather than genuine multimodal advantages.

Authors: We agree that quantitative and human evaluations of the graph-driven pipeline would further strengthen the paper and help rule out potential data artifacts. The pipeline builds on established, off-the-shelf detectors and relation extractors applied to synchronized RGB-Event streams, with scene graphs constructed to be verifiable by design. However, we acknowledge the absence of explicit metrics in the current manuscript. In the revised version, we will add a dedicated subsection reporting node/edge precision, recall, and F1 scores on a manually annotated subset of 500 samples. We will also include results from a human evaluation study (with at least 3 annotators per sample) measuring faithfulness, grammatical correctness, and relevance of the synthesized captions and QA pairs, along with inter-annotator agreement (e.g., Cohen's kappa). These additions will directly address the concern that gains may arise from synthetic data biases rather than the dual-stream architecture. revision: yes

Circularity Check

0 steps flagged

No circularity: model and pipeline are self-contained against external benchmarks

full rationale

The paper introduces a dual-stream RGB-event VLM architecture with progressive training and a graph-driven synthesis pipeline that generates scene graphs, captions, and QA pairs to create the PEOD-Chat and RGBE-Chat datasets. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the model's own inputs or outputs. Performance claims rest on comparisons to prior RGB-only and event-only models on the newly constructed datasets and standard benchmarks, without self-citation chains, uniqueness theorems, or ansatzes that bear the central result. The approach is externally falsifiable via the released code and datasets rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from VLM training and event-camera literature plus the effectiveness of the proposed graph pipeline; no new physical entities or ad-hoc constants are introduced.

axioms (1)

domain assumption Event streams can be meaningfully aligned with language descriptions via progressive training on synthesized scene graphs.
Invoked in the description of the dual-stream encoders and training strategy.

pith-pipeline@v0.9.0 · 5802 in / 1245 out tokens · 39027 ms · 2026-05-20T06:46:34.063724+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RE-VLM employs parallel RGB and event encoders together with a progressive training strategy that aligns heterogeneous visual features with language... graph-driven pipeline that converts synchronized RGB-Event streams into verifiable scene graphs
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a Spatio-Temporal Alignment Module (STAM) ... relation loss ... L = L_LLM + λ L_CA-WTD

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 7 internal anchors

[1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

DDD17: End-To-End DAVIS Driving Dataset

Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Del- bruck. Ddd17: End-to-end davis driving dataset.arXiv preprint arXiv:1711.01458, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4016–4023, 2023. 5

work page 2023
[4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 7

work page 2024
[5]

Segment any event streams via weighted adaptation of pivotal tokens

Zhiwen Chen, Zhiyu Zhu, Yifan Zhang, Junhui Hou, Guang- ming Shi, and Jinjian Wu. Segment any event streams via weighted adaptation of pivotal tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3890–3900, 2024. 5

work page 2024
[6]

Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025

Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, and Chuang Zhu. Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025. 4, 5, 7

work page 2025
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009
[8]

Standard and event cameras fusion for feature tracking

Yan Dong and Tao Zhang. Standard and event cameras fusion for feature tracking. InProceedings of the 2021 International Conference on Machine Vision and Applications, pages 55–60,

work page 2021
[9]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1, 2

work page 2020
[10]

Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

Daniel Gehrig and Davide Scaramuzza. Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

work page
[11]

Asynchronous, photometric feature tracking using events and frames

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Davide Scaramuzza. Asynchronous, photometric feature tracking using events and frames. InProceedings of the European Conference on Computer Vision (ECCV), pages 750–765,

work page
[12]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021. 5

work page 2021
[13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Real-time 3d reconstruction and 6-dof tracking with an event camera

Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. InEuropean conference on computer vision, pages 349–364. Springer, 2016. 2

work page 2016
[15]

Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

Byounghwa Lee, Hwa Jeon Song, Young-Jin Park, and Byung Ok Kang. Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

work page
[16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 7

work page 2023
[17]

Seeing motion at nighttime with an event camera

Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, and Luxin Yan. Seeing motion at nighttime with an event camera. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25648–25658,

work page
[18]

Enhancing Event-based Object Detection with Monocular Normal Maps

Mingjie Liu, Hanqing Liu, and Chuang Zhu. Beyond rgb and events: Enhancing object detection under adverse lighting with monocular normal maps.arXiv preprint arXiv:2508.02127, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025. 2, 4, 5, 7

work page 2025
[20]

Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, and Fei-Yue Wang. Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024. 3

work page 2024
[21]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

View selection for 3d captioning via diffusion ranking

Tiange Luo, Justin Johnson, and Honglak Lee. View selection for 3d captioning via diffusion ranking. InEuropean Con- ference on Computer Vision, pages 180–197. Springer, 2024. 4

work page 2024
[23]

Video-chatgpt: Towards detailed video understand- ing via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video understand- ing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12585– 12602, 2024. 7 9

work page 2024
[24]

Fast event-based corner detection

Elias Mueggler, Chiara Bartolozzi, and Davide Scaramuzza. Fast event-based corner detection. 2017. 2

work page 2017
[25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[26]

Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018

Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Da- vide Scaramuzza. Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018. 2

work page 2018
[27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018

Antoni Rosinol Vidal, Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018. 2

work page 2018
[29]

Eventclip: Adapting clip for event-based object recognition

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023. 2

work page arXiv 2023
[30]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Ezsr: Event- based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event- based zero-shot recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4628–4638,

work page
[32]

Frame-event alignment and fusion network for high frame rate tracking

Jiqing Zhang, Yuanchen Wang, Wenxi Liu, Meng Li, Jinpeng Bai, Baocai Yin, and Xin Yang. Frame-event alignment and fusion network for high frame rate tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9781–9790, 2023. 3

work page 2023
[33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 8

work page 2023
[34]

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Eventbind: Learning a unified representation to bind them all for event-based open-world understanding. InEuropean Conference on Computer Vision, pages 477–494. Springer,

work page
[35]

Rgb-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. arXiv preprint arXiv:2209.08323, 2022. 2, 3

work page arXiv 2022
[36]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018

Alex Zihao Zhu, Dinesh Thakur, Tolga¨Ozaslan, Bernd Pfrom- mer, Vijay Kumar, and Kostas Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018. 5 10

work page 2032

[1] [1]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

DDD17: End-To-End DAVIS Driving Dataset

Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Del- bruck. Ddd17: End-to-end davis driving dataset.arXiv preprint arXiv:1711.01458, 2017. 5

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

M3ed: Multi-robot, multi-sensor, multi-environment event dataset

Kenneth Chaney, Fernando Cladera, Ziyun Wang, Anthony Bisulco, M Ani Hsieh, Christopher Korpela, Vijay Kumar, Camillo J Taylor, and Kostas Daniilidis. M3ed: Multi-robot, multi-sensor, multi-environment event dataset. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4016–4023, 2023. 5

work page 2023

[4] [4]

Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24185–24198, 2024. 1, 2, 7

work page 2024

[5] [5]

Segment any event streams via weighted adaptation of pivotal tokens

Zhiwen Chen, Zhiyu Zhu, Yifan Zhang, Junhui Hou, Guang- ming Shi, and Jinjian Wu. Segment any event streams via weighted adaptation of pivotal tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3890–3900, 2024. 5

work page 2024

[6] [6]

Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025

Luoping Cui, Hanqing Liu, Mingjie Liu, Endian Lin, Donghong Jiang, Yuhao Wang, and Chuang Zhu. Peod: A pixel-aligned event-rgb benchmark for object detection under challenging conditions, 2025. 4, 5, 7

work page 2025

[7] [7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009

[8] [8]

Standard and event cameras fusion for feature tracking

Yan Dong and Tao Zhang. Standard and event cameras fusion for feature tracking. InProceedings of the 2021 International Conference on Machine Vision and Applications, pages 55–60,

work page 2021

[9] [9]

Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020

Guillermo Gallego, Tobi Delbr¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1, 2

work page 2020

[10] [10]

Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

Daniel Gehrig and Davide Scaramuzza. Low-latency automo- tive vision with event cameras.Nature, 629(8014):1034–1040,

work page

[11] [11]

Asynchronous, photometric feature tracking using events and frames

Daniel Gehrig, Henri Rebecq, Guillermo Gallego, and Davide Scaramuzza. Asynchronous, photometric feature tracking using events and frames. InProceedings of the European Conference on Computer Vision (ECCV), pages 750–765,

work page

[12] [12]

Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021

Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947– 4954, 2021. 5

work page 2021

[13] [13]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Real-time 3d reconstruction and 6-dof tracking with an event camera

Hanme Kim, Stefan Leutenegger, and Andrew J Davison. Real-time 3d reconstruction and 6-dof tracking with an event camera. InEuropean conference on computer vision, pages 349–364. Springer, 2016. 2

work page 2016

[15] [15]

Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

Byounghwa Lee, Hwa Jeon Song, Young-Jin Park, and Byung Ok Kang. Multimodal alzheimer’s disease recognition from image, text and audio.Scientific Reports, 15(1):29038,

work page

[16] [16]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2, 7

work page 2023

[17] [17]

Seeing motion at nighttime with an event camera

Haoyue Liu, Shihan Peng, Lin Zhu, Yi Chang, Hanyu Zhou, and Luxin Yan. Seeing motion at nighttime with an event camera. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pages 25648–25658,

work page

[18] [18]

Enhancing Event-based Object Detection with Monocular Normal Maps

Mingjie Liu, Hanqing Liu, and Chuang Zhu. Beyond rgb and events: Enhancing object detection under adverse lighting with monocular normal maps.arXiv preprint arXiv:2508.02127, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Eventgpt: Event stream understanding with multimodal large language models

Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025. 2, 4, 5, 7

work page 2025

[20] [20]

Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024

Zhanwen Liu, Nan Yang, Yang Wang, Yuke Li, Xiangmo Zhao, and Fei-Yue Wang. Enhancing traffic object detec- tion in variable illumination with rgb-event fusion.IEEE Transactions on Intelligent Transportation Systems, 2024. 3

work page 2024

[21] [21]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024. 1, 2, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

View selection for 3d captioning via diffusion ranking

Tiange Luo, Justin Johnson, and Honglak Lee. View selection for 3d captioning via diffusion ranking. InEuropean Con- ference on Computer Vision, pages 180–197. Springer, 2024. 4

work page 2024

[23] [23]

Video-chatgpt: Towards detailed video understand- ing via large vision and language models

Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fa- had Khan. Video-chatgpt: Towards detailed video understand- ing via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers), pages 12585– 12602, 2024. 7 9

work page 2024

[24] [24]

Fast event-based corner detection

Elias Mueggler, Chiara Bartolozzi, and Davide Scaramuzza. Fast event-based corner detection. 2017. 2

work page 2017

[25] [25]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021

[26] [26]

Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018

Henri Rebecq, Guillermo Gallego, Elias Mueggler, and Da- vide Scaramuzza. Emvs: Event-based multi-view stereo—3d reconstruction with an event camera in real-time.Interna- tional Journal of Computer Vision, 126(12):1394–1414, 2018. 2

work page 2018

[27] [27]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, An- drew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018

Antoni Rosinol Vidal, Henri Rebecq, Timo Horstschaefer, and Davide Scaramuzza. Ultimate slam? combining events, images, and imu for robust visual slam in hdr and high-speed scenarios.IEEE Robotics and Automation Letters, 3(2):994– 1001, 2018. 2

work page 2018

[29] [29]

Eventclip: Adapting clip for event-based object recognition

Ziyi Wu, Xudong Liu, and Igor Gilitschenski. Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023. 2

work page arXiv 2023

[30] [30]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Ezsr: Event- based zero-shot recognition

Yan Yang, Liyuan Pan, Dongxu Li, and Liu Liu. Ezsr: Event- based zero-shot recognition. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 4628–4638,

work page

[32] [32]

Frame-event alignment and fusion network for high frame rate tracking

Jiqing Zhang, Yuanchen Wang, Wenxi Liu, Meng Li, Jinpeng Bai, Baocai Yin, and Xin Yang. Frame-event alignment and fusion network for high frame rate tracking. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9781–9790, 2023. 3

work page 2023

[33] [33]

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023. 8

work page 2023

[34] [34]

Eventbind: Learning a unified representation to bind them all for event-based open-world understanding

Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Eventbind: Learning a unified representation to bind them all for event-based open-world understanding. InEuropean Conference on Computer Vision, pages 477–494. Springer,

work page

[35] [35]

Rgb-event fusion for moving object detection in autonomous driving

Zhuyun Zhou, Zongwei Wu, R ´emi Boutteau, Fan Yang, C´edric Demonceaux, and Dominique Ginhac. Rgb-event fusion for moving object detection in autonomous driving. arXiv preprint arXiv:2209.08323, 2022. 2, 3

work page arXiv 2022

[36] [36]

The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018

Alex Zihao Zhu, Dinesh Thakur, Tolga¨Ozaslan, Bernd Pfrom- mer, Vijay Kumar, and Kostas Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3(3): 2032–2039, 2018. 5 10

work page 2032