Recognition: 2 theorem links
· Lean TheoremGenerative Event Pretraining with Foundation Model Alignment
Pith reviewed 2026-05-15 00:45 UTC · model grok-4.3
The pith
Event camera features aligned to frozen image foundation models through joint regression and contrastive losses, then autoregressively pretrained on mixed sequences, yield models that outperform prior event pretraining methods on object, 3D
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GEP uses VFM-guided alignment via joint regression-contrastive objective to ground event features in image semantics, followed by autoregressive pretraining on mixed event-image sequences to learn temporal structure, resulting in a semantically rich, temporally aware event model that generalizes robustly and outperforms state-of-the-art event pretraining methods on object recognition, segmentation, and depth estimation.
What carries the argument
VFM-guided joint regression-contrastive alignment of an event encoder combined with autoregressive generative pretraining on mixed event-image sequences using a transformer backbone.
If this is right
- Event-based object recognition achieves higher accuracy by leveraging transferred image semantics.
- Semantic segmentation from event data improves due to the temporally aware representations.
- Depth estimation in high-speed or low-light scenarios benefits from the robust generalization.
- The approach enables better transfer across domains for various event vision applications.
Where Pith is reading between the lines
- Similar alignment strategies could be applied to pretrain models for other sparse or asynchronous sensors like LiDAR.
- The mixed sequence pretraining suggests potential for hybrid models that process both event and frame data seamlessly in real-time systems.
- Extending the framework to allow partial unfreezing of the VFM might yield even richer semantic alignments in future iterations.
Load-bearing premise
That the joint regression-contrastive alignment to a frozen image VFM effectively grounds event features in semantic knowledge transferable to event-specific tasks and that the autoregressive pretraining captures unique temporal dynamics that transfer effectively.
What would settle it
Experiments showing that on standard event benchmarks like object recognition or depth estimation, the GEP method does not exceed the performance of current state-of-the-art event pretraining techniques would disprove the central effectiveness claim.
Figures
read the original abstract
Event cameras provide robust visual signals under fast motion and challenging illumination conditions thanks to their microsecond latency and high dynamic range. However, their unique sensing characteristics and limited labeled data make it challenging to train event-based visual foundation models (VFMs), which are crucial for learning visual features transferable across tasks. To tackle this problem, we propose GEP (Generative Event Pretraining), a two-stage framework that transfers semantic knowledge learned from internet-scale image datasets to event data while learning event-specific temporal dynamics. First, an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective, grounding event features in image semantics. Second, a transformer backbone is autoregressively pretrained on mixed event-image sequences to capture the temporal structure unique to events. Our approach outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks, including object recognition, segmentation, and depth estimation. Together, VFM-guided alignment and generative sequence modeling yield a semantically rich, temporally aware event model that generalizes robustly across domains.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Generative Event Pretraining (GEP), a two-stage framework for learning event-based visual foundation models. Stage 1 aligns an event encoder to a frozen image VFM via joint regression-contrastive loss to transfer semantic knowledge from internet-scale images. Stage 2 autoregressively pretrains a transformer on mixed event-image sequences to capture event-specific temporal dynamics. The method is claimed to outperform prior event pretraining approaches on object recognition, segmentation, and depth estimation.
Significance. If the results hold, this would represent a meaningful contribution to event-based vision by addressing data scarcity through transfer from image VFMs, potentially improving robustness in high-speed and challenging illumination settings where event cameras excel.
major comments (2)
- [Abstract] Abstract: the central claim that the approach 'outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks' is load-bearing but unsupported by any quantitative results, baselines, error analysis, or experimental details in the provided description, preventing verification of whether the data supports the assertion.
- [Method (Stage 1)] Stage 1 (alignment): the joint regression-contrastive objective is presented as grounding event features in image semantics, yet no specifics are given on event representation (voxel grid, event frame, or raw tokenization) or how the losses bridge the domain gap between sparse asynchronous (x,y,t,p) events and dense RGB frames; without this, downstream gains on recognition/segmentation/depth cannot be confidently attributed to VFM-guided transfer rather than low-level statistics.
minor comments (1)
- [Abstract] Abstract: consider adding one sentence on the chosen event representation and key loss weighting to aid immediate reproducibility assessment.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive review. We address each major comment below with clarifications from the full manuscript and indicate planned revisions to improve clarity and verifiability.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the approach 'outperforms state-of-the-art event pretraining methods on a diverse range of downstream tasks' is load-bearing but unsupported by any quantitative results, baselines, error analysis, or experimental details in the provided description, preventing verification of whether the data supports the assertion.
Authors: The full manuscript contains extensive quantitative support for this claim. Section 4 reports results on object recognition across multiple datasets (e.g., N-Caltech101, N-ImageNet) with Table 1 showing GEP outperforming prior event pretraining baselines by 3-7% top-1 accuracy. Section 5 presents segmentation and depth estimation results in Tables 2 and 3, including comparisons to methods such as EventCLIP and EV-FlowNet with error analysis via standard deviations over 3 runs. We will revise the abstract to briefly reference these key gains (e.g., 'outperforms by up to 5.2% on recognition tasks') while keeping it concise, and ensure the experimental details are cross-referenced more explicitly. revision: partial
-
Referee: [Method (Stage 1)] Stage 1 (alignment): the joint regression-contrastive objective is presented as grounding event features in image semantics, yet no specifics are given on event representation (voxel grid, event frame, or raw tokenization) or how the losses bridge the domain gap between sparse asynchronous (x,y,t,p) events and dense RGB frames; without this, downstream gains on recognition/segmentation/depth cannot be confidently attributed to VFM-guided transfer rather than low-level statistics.
Authors: We agree that additional detail is warranted for reproducibility. The manuscript (Section 3.1) specifies voxel-grid representation with 5 time bins, polarity channels, and spatial resolution matching the image VFM input; the joint loss is defined as L = L_reg + λ L_contr where L_reg is MSE on aligned feature maps and L_contr is InfoNCE with temperature 0.07. To address the domain gap, we project event features via a lightweight adapter and use paired event-image data from the same scenes. We will expand Section 3 with a dedicated paragraph on representation choices, include an ablation on loss components (new Table in revision), and add a figure illustrating the alignment process to better attribute gains to semantic transfer. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper describes a two-stage pretraining pipeline: (1) joint regression-contrastive alignment of an event encoder to a frozen external image VFM, and (2) autoregressive pretraining on mixed event-image sequences. These steps rely on standard contrastive/regression losses and transformer-based sequence modeling applied to external frozen models and raw data, without any self-definitional reduction, fitted parameters renamed as predictions, or load-bearing self-citations. Downstream task claims are presented as empirical outcomes rather than derived by construction from the inputs. No instances of the enumerated circular patterns appear in the abstract or described framework.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
an event encoder is aligned to a frozen VFM through a joint regression-contrastive objective... autoregressively pretrained on mixed event-image sequences
-
IndisputableMonolith/Foundation/DimensionForcing.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
accumulate events within a fixed temporal window Δt into a pseudo-frame Xe ∈ R^{H×W×3}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
BEiT: BERT Pre-Training of Image Transformers
Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254, 2021. 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[2]
Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation
Luca Bartolomei, Enrico Mannocci, Fabio Tosi, Matteo Poggi, and Stefano Mattoccia. Depth anyevent: A cross- modal distillation paradigm for event-based monocular depth estimation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19669–19678, 2025. 2
work page 2025
-
[3]
DDD17: End-To-End DAVIS Driving Dataset
Jonathan Binas, Daniel Neil, Shih-Chii Liu, and Tobi Del- bruck. Ddd17: End-to-end davis driving dataset.arXiv preprint arXiv:1711.01458, 2017. 2, 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[4]
Christian Brandli, Raphael Berner, Minhao Yang, Shih-Chii Liu, and Tobi Delbruck. A 240×180 130 db 3µs latency global shutter spatiotemporal vision sensor.IEEE Journal of Solid-State Circuits, 49(10):2333–2341, 2014. 5
work page 2014
-
[5]
Tadeusz Cali ´nski and Jerzy Harabasz. A dendrite method for cluster analysis.Communications in Statistics-theory and Methods, 3(1):1–27, 1974. 7
work page 1974
-
[6]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 4
work page 2020
-
[7]
An empirical study of training self-supervised vision transformers
Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9640–9649, 2021. 2, 5
work page 2021
-
[8]
Vision Transformers Need Registers
Timoth ´ee Darcet, Maxime Oquab, Julien Mairal, and Pi- otr Bojanowski. Vision transformers need registers.arXiv preprint arXiv:2309.16588, 2023. 7
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
David L Davies and Donald W Bouldin. A cluster separation measure.IEEE transactions on pattern analysis and machine intelligence, pages 224–227, 2009. 7
work page 2009
-
[10]
Imagenet: A large-scale hierarchical image database
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5
work page 2009
-
[11]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 1, 5, 6
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[12]
David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep net- work.Advances in neural information processing systems, 27, 2014. 10, 11
work page 2014
-
[13]
Eva: Exploring the limits of masked visual representa- tion learning at scale
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang, and Yue Cao. Eva: Exploring the limits of masked visual representa- tion learning at scale. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 19358–19369, 2023. 1
work page 2023
-
[14]
Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5
work page 2004
-
[15]
Guillermo Gallego, Tobi Delbr ¨uck, Garrick Orchard, Chiara Bartolozzi, Brian Taba, Andrea Censi, Stefan Leutenegger, Andrew J Davison, J ¨org Conradt, Kostas Daniilidis, et al. Event-based vision: A survey.IEEE transactions on pattern analysis and machine intelligence, 44(1):154–180, 2020. 1
work page 2020
-
[16]
End-to-end learning of repre- sentations for asynchronous event-based data
Daniel Gehrig, Antonio Loquercio, Konstantinos G Derpa- nis, and Davide Scaramuzza. End-to-end learning of repre- sentations for asynchronous event-based data. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 5633–5643, 2019. 5
work page 2019
-
[17]
Daniel Gehrig, Michelle R ¨uegg, Mathias Gehrig, Javier Hidalgo-Carri´o, and Davide Scaramuzza. Combining events and frames using recurrent asynchronous multimodal net- works for monocular depth prediction.IEEE Robotics and Automation Letters, 6(2):2822–2829, 2021. 2, 5, 7
work page 2021
-
[18]
Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driv- ing scenarios.IEEE Robotics and Automation Letters, 6(3): 4947–4954, 2021. 2, 5
work page 2021
-
[19]
Efficiently Modeling Long Sequences with Structured State Spaces
Albert Gu, Karan Goel, and Christopher R ´e. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396, 2021. 3
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[20]
Hierarchical neural memory network for low latency event processing
Ryuhei Hamaguchi, Yasutaka Furukawa, Masaki Onishi, and Ken Sakurada. Hierarchical neural memory network for low latency event processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22867–22876, 2023. 6
work page 2023
-
[21]
Maximizing asyn- chronicity in event-based neural networks.arXiv preprint arXiv:2505.11165, 2025
Haiqing Hao, Nikola Zubi ´c, Weihua He, Zhipeng Sui, Da- vide Scaramuzza, and Wenhui Wang. Maximizing asyn- chronicity in event-based neural networks.arXiv preprint arXiv:2505.11165, 2025. 3
-
[22]
Masked autoencoders are scalable vision learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 5, 6
work page 2022
-
[23]
Learning monocular dense depth from events
Javier Hidalgo-Carri ´o, Daniel Gehrig, and Davide Scara- muzza. Learning monocular dense depth from events. In 2020 International Conference on 3D Vision (3DV), pages 534–542. IEEE, 2020. 10
work page 2020
-
[24]
Learning to exploit multiple vision modalities by using grafted networks
Yuhuang Hu, Tobi Delbruck, and Shih-Chii Liu. Learning to exploit multiple vision modalities by using grafted networks. InEuropean Conference on Computer Vision, pages 85–101. Springer, 2020. 2
work page 2020
-
[25]
N-imagenet: Towards robust, fine-grained object recognition with event cameras
Junho Kim, Jaehyeok Bae, Gangin Park, Dongsu Zhang, and Young Min Kim. N-imagenet: Towards robust, fine-grained object recognition with event cameras. InProceedings of the IEEE/CVF international conference on computer vision, pages 2146–2156, 2021. 2, 5, 11
work page 2021
-
[26]
Masked event modeling: Self-supervised pretraining for event cameras
Simon Klenk, David Bonello, Lukas Koestler, Nikita Araslanov, and Daniel Cremers. Masked event modeling: Self-supervised pretraining for event cameras. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2378–2388, 2024. 1, 2, 5, 6, 7
work page 2024
-
[27]
Pengteng Li, Yunfan Lu, Pinghao Song, Wuyang Li, Huizai Yao, and Hui Xiong. Eventvl: Understand event streams via multimodal large language model.arXiv preprint arXiv:2501.13707, 2025. 3 13
-
[28]
Efficient event camera data pretraining with adaptive prompt fusion
Quanmin Liang, Qiang Li, Shuai Liu, Xinzi Cao, Jinyi Lu, Feidiao Yang, Wei Zhang, Kai Huang, and Yonghong Tian. Efficient event camera data pretraining with adaptive prompt fusion. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 8656–8667, 2025. 5, 6, 12
work page 2025
-
[29]
Patrick Lichtsteiner, Christoph Posch, and Tobi Delbruck. A 128×128 120 db 15 us latency asynchronous temporal con- trast vision sensor.IEEE Journal of Solid-State Circuits, 43 (2):566–576, 2008. 1
work page 2008
-
[30]
Eventgpt: Event stream understanding with multimodal large language models
Shaoyu Liu, Jianing Li, Guanghui Zhao, Yunjian Zhang, Xin Meng, Fei Richard Yu, Xiangyang Ji, and Ming Li. Eventgpt: Event stream understanding with multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 29139–29149, 2025. 3, 8
work page 2025
-
[31]
Revisiting token pruning for object detection and instance segmentation
Yifei Liu, Mathias Gehrig, Nico Messikommer, Marco Can- nici, and Davide Scaramuzza. Revisiting token pruning for object detection and instance segmentation. InProceed- ings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2658–2668, 2024. 3
work page 2024
-
[32]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 9
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[33]
Nico Messikommer, Daniel Gehrig, Mathias Gehrig, and Davide Scaramuzza. Bridging the gap between events and frames through unsupervised domain adaptation.IEEE Robotics and Automation Letters, 7(2):3515–3522, 2022. 1, 2
work page 2022
-
[34]
Student-informed teacher training.Interna- tional Conference on Learning Representations, 2025
Nico Messikommer, Jiaxu Xing, Elie Aljalbout, and Davide Scaramuzza. Student-informed teacher training.Interna- tional Conference on Learning Representations, 2025. 2
work page 2025
-
[35]
Nico Messikommer, Jiaxu Xing, Leonard Bauersfeld, Marco Cannici, Elie Aljalbout, and Davide Scaramuzza. Approx- imate imitation learning for event-based quadrotor flight in cluttered environments.arXiv preprint arXiv:2603.07578,
-
[36]
V-net: Fully convolutional neural networks for volumetric medical image segmentation
Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV). Ieee, 2016. 6
work page 2016
-
[37]
Tespec: Temporally-enhanced self-supervised pretrain- ing for event cameras
Mohammad Mohammadi, Ziyi Wu, and Igor Gilitschen- ski. Tespec: Temporally-enhanced self-supervised pretrain- ing for event cameras. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7782– 7793, 2025. 6, 12
work page 2025
-
[38]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2, 4, 5, 6, 11, 12
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[40]
Garrick Orchard, Ajinkya Jayawant, Gregory K Cohen, and Nitish Thakor. Converting static image datasets to spiking neuromorphic datasets using saccades.Frontiers in neuro- science, 9:437, 2015. 2, 5
work page 2015
-
[41]
Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence
Bo Peng, Daniel Goldstein, Quentin Anthony, Alon Albalak, Eric Alcaide, Stella Biderman, Eugene Cheah, Xingjian Du, Teddy Ferdinan, Haowen Hou, et al. Eagle and finch: Rwkv with matrix-valued states and dynamic recurrence.arXiv preprint arXiv:2404.05892, 2024. 3
-
[42]
Etienne Perot, Pierre De Tournemire, Davide Nitti, Jonathan Masci, and Amos Sironi. Learning to detect objects with a 1 megapixel event camera.Advances in Neural Information Processing Systems, 33:16639–16652, 2020. 6
work page 2020
-
[43]
Event-priori- based vision-language model for efficient visual understand- ing
Haotong Qin, Cheng Hu, and Michele Magno. Event-priori- based vision-language model for efficient visual understand- ing. InInternational Joint Conference on Artificial Intelli- gence, pages 16–30. Springer, 2025. 3
work page 2025
-
[44]
Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsu- pervised multitask learners.OpenAI blog, 1(8):9, 2019. 9
work page 2019
-
[45]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1
work page 2021
-
[46]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 10
work page 2021
-
[47]
Henri Rebecq, Ren ´e Ranftl, Vladlen Koltun, and Davide Scaramuzza. High speed and high dynamic range video with an event camera.IEEE transactions on pattern analysis and machine intelligence, 43(6):1964–1980, 2019. 1
work page 1964
-
[48]
Sucheng Ren, Hongru Zhu, Chen Wei, Yijiang Li, Alan Yuille, and Cihang Xie. Arvideo: Autoregressive pretrain- ing for self-supervised video representation learning.arXiv preprint arXiv:2405.15160, 2024. 3, 8
-
[49]
Peter J Rousseeuw. Silhouettes: a graphical aid to the inter- pretation and validation of cluster analysis.Journal of com- putational and applied mathematics, 20:53–65, 1987. 7
work page 1987
-
[50]
Alberto Sabater, Luis Montesano, and Ana C Murillo. Event transformer. a sparse-aware solution for efficient event data processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2677– 2686, 2022. 3
work page 2022
-
[51]
Generalised dice overlap as a deep learning loss function for highly unbalanced seg- mentations
Carole H Sudre, Wenqi Li, Tom Vercauteren, Sebastien Ourselin, and M Jorge Cardoso. Generalised dice overlap as a deep learning loss function for highly unbalanced seg- mentations. InInternational Workshop on Deep Learning in Medical Image Analysis, pages 240–248. Springer, 2017. 9
work page 2017
-
[52]
Ess: Learning event-based semantic seg- mentation from still images
Zhaoning Sun, Nico Messikommer, Daniel Gehrig, and Da- vide Scaramuzza. Ess: Learning event-based semantic seg- mentation from still images. InEuropean Conference on Computer Vision, pages 341–357. Springer, 2022. 1, 2, 5, 6
work page 2022
-
[53]
Dynamic token pruning in plain vision transformers for semantic segmentation
Quan Tang, Bowen Zhang, Jiajun Liu, Fagui Liu, and Yifan Liu. Dynamic token pruning in plain vision transformers for semantic segmentation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 777– 786, 2023. 3
work page 2023
-
[54]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin 14 Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023
Ziyi Wu, Xudong Liu, and Igor Gilitschenski. Eventclip: Adapting clip for event-based object recognition.arXiv preprint arXiv:2306.06354, 2023. 2
-
[56]
Unified perceptual parsing for scene understand- ing
Tete Xiao, Yingcheng Liu, Bolei Zhou, Yuning Jiang, and Jian Sun. Unified perceptual parsing for scene understand- ing. InProceedings of the European conference on computer vision (ECCV), pages 418–434, 2018. 6
work page 2018
-
[57]
Chuyun Xie, Wei Gao, and Ren Guo. Cross-modal learn- ing for event-based semantic segmentation via attention soft alignment.IEEE Robotics and Automation Letters, 9(3): 2359–2366, 2024. 2
work page 2024
-
[58]
Event camera data pre- training
Yan Yang, Liyuan Pan, and Liu Liu. Event camera data pre- training. InProceedings of the IEEE/CVF international con- ference on computer vision, 2023. 1, 2, 5, 6, 9
work page 2023
-
[59]
Event camera data dense pre-training
Yan Yang, Liyuan Pan, and Liu Liu. Event camera data dense pre-training. InEuropean Conference on Computer Vision, pages 292–310. Springer, 2024. 1, 2, 5, 6, 7, 10
work page 2024
-
[60]
Vision trans- former with progressive sampling
Xiaoyu Yue, Shuyang Sun, Zhanghui Kuang, Meng Wei, Philip HS Torr, Wayne Zhang, and Dahua Lin. Vision trans- former with progressive sampling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 387–396, 2021. 6
work page 2021
-
[61]
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding
Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multi- modal foundation models for image and video understand- ing.arXiv preprint arXiv:2501.13106, 2025. 3, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Jiazhou Zhou, Xu Zheng, Yuanhuiyi Lyu, and Lin Wang. Eventbind: Learning a unified representation to bind them all for event-based open-world understanding. InEuropean Conference on Computer Vision, pages 477–494. Springer,
-
[63]
Alex Zihao Zhu, Dinesh Thakur, Tolga ¨Ozaslan, Bernd Pfrommer, Vijay Kumar, and Kostas Daniilidis. The multi- vehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robotics and Automation Letters, 3 (3):2032–2039, 2018. 2, 5, 12
work page 2032
-
[64]
State space models for event cameras
Nikola Zubic, Mathias Gehrig, and Davide Scaramuzza. State space models for event cameras. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5819–5828, 2024. 3 15
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.