Match-Any-Events: Zero-Shot Motion-Robust Feature Matching Across Wide Baselines for Event Cameras
Pith reviewed 2026-05-10 04:23 UTC · model grok-4.3
The pith
A single trained model performs zero-shot wide-baseline feature matching on event camera data from unseen datasets.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale and
What carries the argument
Motion-robust attention backbone that extracts multi-timescale features from event streams together with sparsity-aware token selection, powered by a robust event motion synthesis framework that augments viewpoints, modalities, and motions for training data.
If this is right
- A model trained once on synthesized data can be deployed on any new event-camera dataset without fine-tuning or target-domain adaptation.
- The efficient attention backbone makes large-scale training on diverse wide-baseline event streams computationally practical.
- Matching accuracy rises 37.7 percent over the previous best event feature matching methods across multiple benchmarks.
- Cross-dataset generalization removes the need to collect and annotate per-scenario event data for each new deployment.
Where Pith is reading between the lines
- The synthesis approach could be reused to create training sets for related event-camera tasks such as optical flow or ego-motion estimation.
- Zero-shot transfer suggests the architecture might serve as a backbone for hybrid systems that fuse event data with conventional frames.
- If the synthetic distribution gap proves small on further real-world tests, the method could accelerate adoption of event cameras in low-power robotics without repeated calibration.
Load-bearing premise
The robust event motion synthesis framework generates training data whose distribution is sufficiently close to real-world wide-baseline event streams that the learned model generalizes without domain adaptation.
What would settle it
Run the trained model on a previously unseen event-camera dataset containing motion speeds, baselines, or scene types absent from the synthetic training distribution and check whether matching precision falls below the reported 37.7 percent gain relative to prior methods.
Figures
read the original abstract
Event cameras have recently shown promising capabilities in instantaneous motion estimation due to their robustness to low light and fast motions. However, computing wide-baseline correspondence between two arbitrary views remains a significant challenge, since event appearance changes substantially with motion, and learning-based approaches are constrained by both scalability and limited wide-baseline supervision. We therefore introduce the first event matching model that achieves cross-dataset wide-baseline correspondence in a zero-shot manner: a single model trained once is deployed on unseen datasets without any target-domain fine-tuning or adaptation. To enable this capability, we introduce a motion-robust and computationally efficient attention backbone that learns multi-timescale features from event streams, augmented with sparsity-aware event token selection, making large-scale training on diverse wide-baseline supervision computationally feasible. To provide the supervision needed for wide-baseline generalization, we develop a robust event motion synthesis framework to generate large-scale event-matching datasets with augmented viewpoints, modalities, and motions. Extensive experiments across multiple benchmarks show that our framework achieves a 37.7% improvement over the previous best event feature matching methods. Code and data are available at: https://github.com/spikelab-jhu/Match-Any-Events.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Match-Any-Events, a zero-shot feature matching model for event cameras enabling wide-baseline correspondence across unseen datasets without fine-tuning or adaptation. It proposes a motion-robust attention backbone that extracts multi-timescale features from event streams, augmented by sparsity-aware event token selection to enable scalable training. Supervision comes from a custom robust event motion synthesis framework that generates large-scale training data by augmenting viewpoints, modalities, and motions. Experiments across multiple benchmarks report a 37.7% improvement over prior event feature matching methods, with code and data released.
Significance. If the zero-shot cross-dataset claim holds after verification, the work would advance event-based vision by addressing the core difficulty of motion-induced appearance variation and the lack of wide-baseline real supervision. The open release of code and data is a clear strength that supports reproducibility and future extensions. The efficient backbone design could also benefit other event-camera tasks requiring large-scale training on sparse data.
major comments (2)
- [§3] §3 (Robust Event Motion Synthesis Framework): The zero-shot generalization claim rests on the unverified assumption that synthetic event streams sufficiently match the distribution of real wide-baseline event data from unseen test datasets. No quantitative evidence is supplied, such as event-rate histograms, Wasserstein distances, or failure-case comparisons between synthetic training pairs and real benchmark streams. Without this, the 37.7% gains cannot be confidently attributed to cross-dataset transfer rather than in-distribution performance.
- [Abstract and §4] Abstract and §4 (Experiments): The headline 37.7% improvement is stated without identifying the exact baseline methods, the primary metric (e.g., AUC or precision at a fixed threshold), statistical significance tests, or data-selection/exclusion criteria. These omissions are load-bearing because the central claim is cross-dataset zero-shot superiority; readers cannot assess whether the comparison is fair or whether results are robust across the reported benchmarks.
minor comments (2)
- [Introduction] The abstract and introduction use the phrase 'first event matching model' for zero-shot wide-baseline capability; a brief related-work paragraph clarifying the precise novelty relative to prior event matching and zero-shot vision works would improve context.
- Figure captions and the token-selection description would benefit from explicit notation for the multi-timescale feature aggregation (e.g., an equation defining the attention scales) to aid readers unfamiliar with event sparsity patterns.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the clarity and support for our zero-shot claims.
read point-by-point responses
-
Referee: [§3] §3 (Robust Event Motion Synthesis Framework): The zero-shot generalization claim rests on the unverified assumption that synthetic event streams sufficiently match the distribution of real wide-baseline event data from unseen test datasets. No quantitative evidence is supplied, such as event-rate histograms, Wasserstein distances, or failure-case comparisons between synthetic training pairs and real benchmark streams. Without this, the 37.7% gains cannot be confidently attributed to cross-dataset transfer rather than in-distribution performance.
Authors: We agree that direct quantitative validation of the synthetic-to-real distribution match would provide stronger support for attributing the gains to zero-shot generalization. While the strong performance on multiple real unseen benchmarks already serves as indirect evidence of transfer, we will add event-rate histograms, basic motion statistic comparisons, and representative failure-case visualizations between synthetic training pairs and real test streams in the revised §3 and supplementary material. revision: yes
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The headline 37.7% improvement is stated without identifying the exact baseline methods, the primary metric (e.g., AUC or precision at a fixed threshold), statistical significance tests, or data-selection/exclusion criteria. These omissions are load-bearing because the central claim is cross-dataset zero-shot superiority; readers cannot assess whether the comparison is fair or whether results are robust across the reported benchmarks.
Authors: We acknowledge the need for explicit specification. The 37.7% figure represents the average relative improvement in AUC@5° across the evaluated benchmarks relative to the strongest prior event-based matching methods. In the revised abstract and §4 we will name the exact baselines, confirm the primary metric, report statistical significance where applicable, and detail benchmark data selection/exclusion criteria to ensure the comparisons are fully transparent and reproducible. revision: yes
Circularity Check
No circularity: training on synthetic data and zero-shot evaluation on external benchmarks are independent
full rationale
The paper trains a model on data from its own robust event motion synthesis framework and evaluates zero-shot transfer on multiple real-world benchmarks, reporting a 37.7% improvement. No equations, parameters, or claims reduce by construction to the inputs (no self-definitional loops, no fitted quantities renamed as predictions, no load-bearing self-citations). The synthesis step generates training supervision; the generalization claim is tested against external datasets rather than being tautological with the generator. This is a standard supervised-learning setup with synthetic augmentation and is self-contained against the stated benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Alvar, S.R., Singh, G., Akbari, M., Zhang, Y.: Divprune: Diversity-based visual tokenpruningforlargemultimodalmodels.In:ProceedingsoftheComputerVision and Pattern Recognition Conference. pp. 9392–9401 (2025)
work page 2025
-
[2]
Alzugaray, I., Chli, M.: HASTE: multi-hypothesis asynchronous speeded-up track- ingofevents.In:31stBritishMachineVisionConference2020,BMVC2020.BMVA Press (2020),https://www.bmvc2020-conference.com/assets/papers/0744.pdf
work page 2020
-
[3]
IEEE Robotics and Automation Letters3(4), 3177–3184 (2018)
Alzugaray, I., Chli, M.: Asynchronous corner detection and tracking for event cam- eras in real time. IEEE Robotics and Automation Letters3(4), 3177–3184 (2018)
work page 2018
-
[4]
IEEE TPAMI45(2), 2519–2532 (2022)
Baldwin, R.W., Liu, R., Almatrafi, M., Asari, V., Hirakawa, K.: Time-ordered recent event (tore) volumes for event cameras. IEEE TPAMI45(2), 2519–2532 (2022)
work page 2022
- [5]
-
[6]
IEEE transactions on neural networks and learning systems25(2), 407–417 (2013)
Benosman, R., Clercq, C., Lagorce, X., Ieng, S.H., Bartolozzi, C.: Event-based visual flow. IEEE transactions on neural networks and learning systems25(2), 407–417 (2013)
work page 2013
-
[7]
Born, R.T., Bradley, D.C.: Structure and function of visual area mt. Annu. Rev. Neurosci.28(1), 157–189 (2005)
work page 2005
-
[8]
Brändli, C., Strubel, J., Keller, S., Scaramuzza, D., Delbruck, T.: Elised—an event- based line segment detector. In: 2016 Second International Conference on Event- based Control, Communication, and Signal Processing (EBCCSP). pp. 1–7. IEEE (2016)
work page 2016
-
[9]
arXiv preprint arXiv:2504.00139 (2025)
Burkhardt, Y., Schaefer, S., Leutenegger, S.: Superevent: Cross-modal learning of event-based keypoint detection. arXiv preprint arXiv:2504.00139 (2025)
-
[10]
IEEE transactions on robotics37(6), 1874–1890 (2021)
Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M., Tardós, J.D.: Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam. IEEE transactions on robotics37(6), 1874–1890 (2021)
work page 2021
-
[11]
In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition
Chaney, K., Cladera, F., Wang, Z., Bisulco, A., Hsieh, M.A., Korpela, C., Ku- mar, V., Taylor, C.J., Daniilidis, K.: M3ed: Multi-robot, multi-sensor, multi- environment event dataset. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 4016–4023 (2023)
work page 2023
- [12]
-
[13]
arXiv preprint arXiv:2509.25146 (2025)
Das, R., Daniilidis, K., Chaudhari, P.: Fast feature field (F3): A predictive repre- sentation of events. arXiv preprint arXiv:2509.25146 (2025)
- [14]
-
[15]
DiCarlo, J.J., Zoccolan, D., Rust, N.C.: How does the brain solve visual object recognition? Neuron73(3), 415–434 (2012)
work page 2012
- [16]
- [17]
- [18]
-
[19]
Nature392(6676), 598–601 (1998) 16 Zhang et al
Epstein, R., Kanwisher, N.: A cortical representation of the local visual environ- ment. Nature392(6676), 598–601 (1998) 16 Zhang et al
work page 1998
- [20]
- [21]
- [22]
-
[23]
Gehrig, D., Rebecq, H., Gallego, G., Scaramuzza, D.: Eklt: Asynchronous photo- metric feature tracking using events and frames. IJCV128(3), 601–618 (2020)
work page 2020
-
[24]
Gibson, J.J.: The perception of the visual world. Houghton Mifflin (1950)
work page 1950
-
[25]
Adaptive Computation Time for Recurrent Neural Networks
Graves, A.: Adaptive computation time for recurrent neural networks. arXiv preprint arXiv:1603.08983 (2016)
work page internal anchor Pith review arXiv 2016
- [26]
-
[27]
In: Eu- ropean Conference on Computer Vision
Hamann,F.,Wang,Z.,Asmanis,I.,Chaney,K.,Gallego,G.,Daniilidis,K.:Motion- prior contrast maximization for dense continuous-time motion estimation. In: Eu- ropean Conference on Computer Vision. pp. 18–37. Springer (2024)
work page 2024
-
[28]
arXiv preprint arXiv:2501.07556 (2025)
He, X., Yu, H., Peng, S., Tan, D., Shen, Z., Bao, H., Zhou, X.: Matchanything: Uni- versal cross-modality image matching with large-scale pre-training. arXiv preprint arXiv:2501.07556 (2025)
- [29]
-
[30]
In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Hu, S., Kim, Y., Lim, H., Lee, A.J., Myung, H.: ecdt: Event clustering for simulta- neous feature detection and tracking. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 3808–3815. IEEE (2022)
work page 2022
-
[31]
In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
Huang, Z., Sun, L., Zhao, C., Li, S., Su, S.: Eventpoint: Self-supervised interest point detection and description for event-based camera. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 5396– 5405 (2023)
work page 2023
- [32]
-
[33]
In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Ikura,M.,LeGentil,C.,Müller,M.G.,Schuler,F.,Yamashita,A.,Stürzl,W.:Rate: Real-time asynchronous feature tracking with event cameras. In: 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11662– 11669. IEEE (2024)
work page 2024
- [34]
-
[35]
IEEE Robotics and Automation Letters8(1), 416–423 (2022)
Kim, H., Lee, S., Kim, J., Kim, H.J.: Real-time hetero-stereo matching for event and frame camera with aligned events using maximum shift distance. IEEE Robotics and Automation Letters8(1), 416–423 (2022)
work page 2022
-
[36]
In: 2024 International conference on 3D vision (3DV)
Klenk, S., Motzet, M., Koestler, L., Cremers, D.: Deep event visual odometry. In: 2024 International conference on 3D vision (3DV). pp. 739–749. IEEE (2024)
work page 2024
-
[37]
IEEE TPAMI39(7), 1346–1359 (2016)
Lagorce, X., Orchard, G., Galluppi, F., Shi, B.E., Benosman, R.B.: Hots: a hier- archy of event-based time-surfaces for pattern recognition. IEEE TPAMI39(7), 1346–1359 (2016)
work page 2016
-
[38]
arXiv preprint arXiv:2503.05122 (2025)
Li, X., Rao, T., Pan, C.: Edm: Efficient deep feature matching. arXiv preprint arXiv:2503.05122 (2025)
- [39]
- [40]
-
[41]
In: IEEE International Conference on Robotics and Automation (ICRA)
Liu, P., Chen, G., Li, Z., Tang, H., Knoll, A.: Learning local event-based descriptor for patch-based stereo matching. In: IEEE International Conference on Robotics and Automation (ICRA). pp. 412–418. IEEE (2022)
work page 2022
-
[42]
Annual review of neuroscience19, 577–621 (1996)
Logothetis, N.K., Sheinberg, D.L.: Visual object recognition. Annual review of neuroscience19, 577–621 (1996)
work page 1996
-
[43]
Advances in Neural Information Processing Systems37, 13274–13301 (2024)
Lou, H., Liang, J., Teng, M., Fan, B., Xu, Y., Shi, B.: Zero-shot event-intensity asymmetric stereo via visual prompting from image domain. Advances in Neural Information Processing Systems37, 13274–13301 (2024)
work page 2024
-
[44]
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV60(2), 91–110 (2004)
work page 2004
-
[45]
In: British Machine Vision Conference (BMVC) (2017)
Mueggler, E., Bartolozzi, C., Scaramuzza, D.: Fast event-based corner detection. In: British Machine Vision Conference (BMVC) (2017)
work page 2017
-
[46]
The International journal of robotics research36(2), 142–149 (2017)
Mueggler, E., Rebecq, H., Gallego, G., Delbruck, T., Scaramuzza, D.: The event- camera dataset and simulator: Event-based data for pose estimation, visual odome- try, and slam. The International journal of robotics research36(2), 142–149 (2017)
work page 2017
-
[47]
Nature neuroscience7(1), 70–74 (2004)
Murray, S.O., Wojciulik, E.: Attention increases neural selectivity in the human lateral occipital complex. Nature neuroscience7(1), 70–74 (2004)
work page 2004
- [48]
-
[49]
Niu, J., Zhong, S., Lu, X., Shen, S., Gallego, G., Zhou, Y.: Esvo2: Direct visual- inertialodometrywithstereoeventcameras.IEEETransactionsonRobotics(2025)
work page 2025
-
[50]
Frontiers in neuroscience 9, 437 (2015)
Orchard, G., Jayawant, A., Cohen, G.K., Thakor, N.: Converting static image datasets to spiking neuromorphic datasets using saccades. Frontiers in neuroscience 9, 437 (2015)
work page 2015
- [51]
- [52]
-
[53]
Qu, Q., Chen, X., Chung, Y.Y., Shen, Y.: Evrepsl: Event-stream representation via self-supervised learning for event-based vision. IEEE TIP (2024)
work page 2024
-
[54]
Advances in neural infor- mation processing systems34, 13937–13949 (2021)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural infor- mation processing systems34, 13937–13949 (2021)
work page 2021
-
[55]
IEEE TPAMI43(6), 1964–1980 (2019)
Rebecq, H., Ranftl, R., Koltun, V., Scaramuzza, D.: High speed and high dynamic range video with an event camera. IEEE TPAMI43(6), 1964–1980 (2019)
work page 1964
- [56]
-
[57]
Revaud, J., De Souza, C., Humenberger, M., Weinzaepfel, P.: R2d2: Reliable and repeatable detector and descriptor. Adv. Neural Inform. Process. Syst.32(2019)
work page 2019
- [58]
- [59]
- [60]
-
[61]
In: CVPR (2016) 18 Zhang et al
Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: CVPR (2016) 18 Zhang et al
work page 2016
-
[62]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [63]
-
[64]
Tyszkiewicz, M., Fua, P., Trulls, E.: Disk: Learning local features with policy gra- dient. Adv. Neural Inform. Process. Syst.33, 14254–14265 (2020)
work page 2020
- [65]
- [66]
-
[67]
IEEE Robotics and Automation Letters (2024)
Wang, X., Yu, H., Yu, L., Yang, W., Xia, G.S.: Towards robust keypoint detection and tracking: A fusion approach with event-aligned image features. IEEE Robotics and Automation Letters (2024)
work page 2024
- [68]
-
[69]
In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
Wang, Z., Pan, L., Ng, Y., Zhuang, Z., Mahony, R.: Stereo hybrid event-frame (shef) cameras for 3d perception. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 9758–9764. IEEE (2021)
work page 2021
- [70]
-
[71]
IEEE Robotics and Automation Letters7(4), 8737–8744 (2022)
Wang, Z., Cladera, F., Bisulco, A., Lee, D., Taylor, C.J., Daniilidis, K., Hsieh, M.A., Lee, D.D., Isler, V.: EV-Catcher: High-speed object catching using low- latency event-based neural networks. IEEE Robotics and Automation Letters7(4), 8737–8744 (2022)
work page 2022
-
[72]
arXiv preprint arXiv:2312.00113 (2023)
Wang, Z., Hamann, F., Chaney, K., Jiang, W., Gallego, G., Daniilidis, K.: Event- based continuous color video decompression from single frames. arXiv preprint arXiv:2312.00113 (2023)
-
[73]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wang, Z., Zhang, R., Liu, Z.Y., Wang, Y., Daniilidis, K.: Continuous-time human motion field from event cameras. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 11502–11512 (2025)
work page 2025
-
[74]
Wang, Z.C.: Beyond Frames: Learning to Perceive With Event-Based Vision. Ph.D. thesis, University of Pennsylvania (2025)
work page 2025
- [75]
-
[76]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10809–10818 (2022)
work page 2022
- [77]
-
[78]
In: 2017 IEEE International Conference on Robotics and Automation (ICRA)
Zhu, A.Z., Atanasov, N., Daniilidis, K.: Event-based feature tracking with proba- bilistic data association. In: 2017 IEEE International Conference on Robotics and Automation (ICRA). pp. 4465–4470. IEEE (2017)
work page 2017
- [79]
-
[80]
IEEE Robotics and Automation Letters3(3), 2032–2039 (2018) Match-Any-Events 19
Zhu, A.Z., Thakur, D., Özaslan, T., Pfrommer, B., Kumar, V., Daniilidis, K.: The multivehicle stereo event camera dataset: An event camera dataset for 3d perception. IEEE Robotics and Automation Letters3(3), 2032–2039 (2018) Match-Any-Events 19
work page 2032
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.