OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Anthony Bisulco; Jeremy Wang; Kostas Daniilidis; Pratik Chaudhari; Randall Balestriero

arxiv: 2606.27317 · v1 · pith:V2NXY2WEnew · submitted 2026-06-25 · 💻 cs.CV · cs.RO

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Anthony Bisulco , Jeremy Wang , Kostas Daniilidis , Randall Balestriero , Pratik Chaudhari This is my paper

Pith reviewed 2026-06-26 05:23 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords multimodal self-supervised learningmasked autoencodersensor fusionrobot perceptionevent cameraLiDARthermal cameraego-motion estimation

0 comments

The pith

A late-fusion masked autoencoder with modality-specific tokenizers produces fast multimodal representations that outperform image-only models on robot perception tasks and remain robust when sensors degrade.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a sensor platform and 59-hour driving dataset that records synchronized data from stereo RGB, event cameras, LiDAR, thermal imaging, IMU, GPS, and proprioception across varied conditions including night and sensor failure. It trains a masked autoencoder that tokenizes each sensor type separately before late fusion, then caches those tokens so new measurements can be encoded without reprocessing the full history. This yields representations that improve accuracy on optical flow, depth, semantic segmentation, and ego-motion estimation compared with image-only baselines while running in milliseconds on both desktop and embedded GPUs. The approach matters because real robots must integrate heterogeneous sensors without dense labels and must continue to function when individual modalities become unreliable.

Core claim

By applying a late-fusion masked autoencoder that uses separate tokenizers for each sensor modality to account for their distinct spatiotemporal properties, the model learns unified representations from the OctoSense dataset; these representations support faster inference through token caching and deliver higher performance than image-only foundation models on downstream tasks while maintaining robustness under nighttime conditions or sensor degradation.

What carries the argument

Late-fusion masked autoencoder with modality-specific tokenizers and cached token inference

If this is right

Representations can be computed in 6.68 ms on a high-end GPU and 112 ms on an embedded Orin NX board.
Performance exceeds image-only models on optical flow, depth, semantic segmentation, and ego-motion estimation.
Predictions remain reliable at night and when individual sensors are degraded.
New measurements can be incorporated by caching modality-specific tokens without recomputing the entire sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-caching mechanism could support online adaptation on a moving robot by updating only the newest modality tokens.
The 59-hour dataset spanning day, night, and degraded conditions provides a ready benchmark for testing whether other multimodal architectures also gain robustness from late fusion.
Because each tokenizer is trained independently before fusion, the method could be extended by adding new sensor types without retraining the entire model from scratch.

Load-bearing premise

That separate tokenizers per sensor plus late fusion inside a masked autoencoder will automatically produce representations that transfer to better performance on the listed downstream tasks than single-modality training.

What would settle it

An evaluation on the same test splits that shows the multimodal model achieving equal or lower accuracy than the best image-only baseline on optical flow, depth estimation, semantic segmentation, and ego-motion metrics.

Figures

Figures reproduced from arXiv: 2606.27317 by Anthony Bisulco, Jeremy Wang, Kostas Daniilidis, Pratik Chaudhari, Randall Balestriero.

**Figure 1.** Figure 1: A) OctoSense has been deployed across a car and a Unitree Go2-W. This initial dataset release focuses on car driving sequences with a very small amount of data collected using the Unitree. B) This platform is unique because it contains diverse sensors such as stereo RGB and event cameras, a thermal imager, LiDAR, IMU, GPS and CAN bus data. These sensors have very different data rates, frequencies, and info… view at source ↗

**Figure 2.** Figure 2: Ground-truth of different downstream tasks in OctoSense We next discuss a multi-modal MAE architecture for sensors in OctoSense. These sensors have different spatiotemporal characteristics: dense 2D arrays for RGB, a high-frequency point process for the event camera, an unordered and sparse point cloud for LiDAR and a rapidly varying multi-channel time series for the IMU. This heterogeneity makes it chal… view at source ↗

**Figure 3.** Figure 3: A schematic of the late-fusion MAE encoder and probe architecture. The text provides more details. Sec. F elaborates upon the architecture with detailed schematics in Fig. S10. The multi-modal MAE ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OctoSense supplies a genuinely new eight-sensor platform and 59-hour synchronized multimodal dataset that robotics people can use, but the superiority claims for the late-fusion MAE rest on assertions without numbers or baselines.

read the letter

OctoSense stands out for releasing both the physical sensor rig (stereo RGB, events, LiDAR, thermal, IMU, RTK-GPS, proprioception) and 59 hours of time-synchronized driving data that includes nighttime and degraded conditions. That scale and coverage across environments is new and fills a real gap for people who need multimodal data under realistic noise.

The late-fusion masked autoencoder with modality-specific tokenizers and token caching is a practical engineering choice for handling mismatched frequencies and latencies. The concrete inference times (6.68 ms on 5090, 112 ms on Orin NX) show the method was built to run online, which is useful.

The soft spot is the performance story. The abstract states the model beats image-only foundation models on optical flow, depth, segmentation, and ego-motion while staying robust at night, yet supplies no losses, splits, metrics, error bars, or named baselines. Without those details the central claim that late fusion plus caching produces the gains cannot be checked. If the full paper contains tables and ablations they need to be prominent; otherwise the empirical contribution stays untestable.

The dataset and platform alone make this worth attention for anyone working on self-supervised multimodal perception in robotics. A reader who wants to train or benchmark on real heterogeneous sensor streams will find the release directly usable.

It deserves peer review because the data contribution is substantial and the architecture is described enough to be implemented, even if the results section will need quantitative support to hold up.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OctoSense, an open-source multimodal sensor platform and 59-hour dataset of time-synchronized driving data (stereo RGB, event cameras, LiDAR, thermal, IMU, RTK-GPS, proprioception) collected across varied environments and times of day. It proposes a late-fusion masked autoencoder that employs modality-specific tokenizers to handle differing spatiotemporal characteristics and caches modality-specific tokens for efficient online inference. The central claims are that this architecture runs at 6.68 ms (NVIDIA 5090) / 112 ms (Orin NX), outperforms existing image-only foundation models on optical flow, depth, semantic segmentation, and ego-motion (translation/rotation/steering), and remains robust under nighttime or degraded-sensor conditions.

Significance. If the empirical superiority and robustness claims are substantiated with quantitative evidence, the work would supply a large-scale, real-world multimodal robotics dataset and a practical architecture for heterogeneous sensor fusion in self-supervised learning, with potential impact on robust perception for autonomous driving and legged robots.

major comments (1)

[Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.

minor comments (1)

[Abstract] The manuscript references a project page for dataset, code, and videos but does not include any quantitative results or experimental protocol in the provided text, which should be added to the main body or supplementary material.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for verifiable support of the abstract claims. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.

Authors: We agree that the abstract claims require direct quantitative support to be verifiable. The manuscript contains an Experiments section with the requested elements (comparisons against image-only MAE/DINO baselines on optical flow EPE, depth RMSE, segmentation mIoU, and ego-motion errors; robustness ablations under nighttime/degraded conditions; data splits; and training details). However, these were not sufficiently cross-referenced from the abstract. In the revision we will (i) insert key numerical results and error bars into the abstract, (ii) add an explicit pointer to the Experiments section and supplementary tables, and (iii) ensure all baselines, splits, and ablation results are clearly tabulated. This addresses the verifiability concern without altering the underlying claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical.

full rationale

The paper presents an empirical architecture (late-fusion masked autoencoder with modality-specific tokenizers and caching) and reports performance gains on downstream tasks versus image-only models. No load-bearing derivation, prediction, or uniqueness result reduces by construction to fitted inputs, self-citations, or ansatzes. All central claims rest on external experimental comparisons on the 59-hour dataset rather than any self-referential equation or parameter renaming. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are explicitly introduced or fitted; the work relies on standard assumptions of self-supervised masked autoencoders and the utility of multimodal fusion.

pith-pipeline@v0.9.1-grok · 5800 in / 1245 out tokens · 23466 ms · 2026-06-26T05:23:38.929924+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

101 extracted references · 9 linked inside Pith

[1]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, 2021

2021
[2]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, 2023

2023
[3]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. M. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H’enaff, J. Harmsen, A. Steiner, and X.-Q. Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint 2502.14786, 2025. 8Cross-dataset generaliz...

Pith/arXiv arXiv 2025
[4]

Ryali, Y.-T

C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, 2023

2023
[5]

Y. Liu, S. Wang, Y. Xie, T. Xiong, and M. Wu. A review of sensing technologies for indoor autonomous mobile robots.Sensors, 24, 2024

2024
[6]

H. I. Christensen. Global robotics technology roadmap 2025–2035: A multi-regional, cross-domain strategic perspective for europe, asia, and the united states. Technology roadmap, University of California San Diego, April 2026. Version 1.02

2025
[7]

Bachmann, D

R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean Conference on Computer Vision, 2022

2022
[8]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra. ImageBind one embedding space to bind them all.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[9]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without superv...

2025
[10]

Sim´eoni, H

O. Sim´eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint 2508.10104, 2025

Pith/arXiv arXiv 2025
[11]

Bolya, P.-Y

D. Bolya, P.-Y. Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. A. Rasheed, J. Wang, M. Monteiro, H. Xu, S. Dong, N. Ravi, S.-W. Li, P. Doll’ar, and C. Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InAdvances in Neural Information Processing Systems, 2025

2025
[12]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. K. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll’ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025

2025
[13]

K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[14]

Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. SimMIM: a simple framework for masked image modeling.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[15]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019

2019
[16]

van den Oord, Y

A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint 1807.03748, 2018

Pith/arXiv arXiv 2018
[17]

T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, 2020

2020
[18]

J. Cao, J. Xing, N. Messikommer, and D. Scaramuzza. Generative event pretraining with foundation model alignment.arXiv preprint 2603.23032, 2026

Pith/arXiv arXiv 2026
[19]

Klenk, D

S. Klenk, D. Bonello, L. Koestler, and D. Cremers. Masked Event Modeling: Self-supervised pretraining for event cameras.IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

2022
[20]

Y. Yang, L. Pan, and L. Liu. Event camera data dense pre-training. InEuropean Conference on Computer Vision, 2024

2024
[21]

R. Das, K. Daniilidis, and P. Chaudhari. Fast feature field (F3): A predictive representation of events.arXiv preprint 2509.25146, 2025. 12

arXiv 2025
[22]

Patel, J

M. Patel, J. Frey, M. Mittal, F. Yang, A. Hansson, A. Bar, C. Cadena, and M. Hutter. DeFM: Learning foundation representations from depth for robotics.arXiv preprint 2601.18923, abs/2601.18923, 2026

arXiv 2026
[23]

Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InEuropean Conference on Computer Vision, 2022

2022
[24]

X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[25]

H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner. Unsupervised point cloud pre-training via occlusion completion. InIEEE/CVF International Conference on Computer Vision, 2021

2021
[26]

S. Xie, J. Gu, D. Guo, C. Qi, L. J. Guibas, and O. Litany. PointContrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean Conference on Computer Vision, 2020

2020
[27]

Munir, S

F. Munir, S. Azam, and M. Jeon. Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving.IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021

2021
[28]

Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving

J. Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving. PhD thesis, University of Freiburg, 2024

2024
[29]

Narayanswamy, X

G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, T. Althoff, S. Narayanan, P. Kohli, J. Zhan, M. Malhotra, S. N. Patel, S. Abdel-Ghaffar, and D. McDuff. Scaling wearable foundation models. InInternational Conference on Learning Representations, 2025

2025
[30]

H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen. LIMU-BERT: Unleashing the potential of unlabeled data for imu sensing applications. InACM Conference on Embedded Networked Sensor Systems, 2021

2021
[31]

Y. Zong, O. M. Aodha, and T. M. Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:5299–5318, 2023

2023
[32]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021

2021
[33]

X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

2021
[34]

Mizrahi, R

D. Mizrahi, R. Bachmann, O. F. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir. 4M: Massively multimodal masked modeling. InAdvances in Neural Information Processing Systems, 2023

2023
[35]

H. Bao, L. Dong, S. Piao, and F. Wei. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations, 2022

2022
[36]

J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. InInternational Conference on Learning Representations, 2023

2023
[37]

J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi. Unified-IO 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[38]

J. Zou, T. Huang, G. Yang, Z. Guo, and W. Zuo. UniM2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. InEuropean Conference on Computer Vision, 2024

2024
[39]

J. Sun, H. Zheng, Q. Zhang, A. Prakash, Z. M. Mao, and C. Xiao. CALICO: Self-supervised camera-lidar contrastive pre-training for bev perception. InInternational Conference on Learning Representations, 2024

2024
[40]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012

2012
[41]

Caesar, V

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. 13

2020
[42]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset.2020 IEEE/CVF Conference on Compu...

2020
[43]

Wilson, W

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting.ArXiv, abs/2301.00493, 2023

Pith/arXiv arXiv 2023
[44]

W. P. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset.The International Journal of Robotics Research, 36:15 – 3, 2017

2017
[45]

Lisus, K

D. Lisus, K. M. Papais, C. L. Gentil, E. Preston-Krebs, A. Lambert, K. Y. Leung, and T. D. Barfoot. Boreas Road Trip: A multi-sensor autonomous driving dataset on challenging roads.ArXiv, abs/2602.16870, 2026

arXiv 2026
[46]

Carlevaris-Bianco, A

N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice. University of Michigan North Campus long-term vision and lidar dataset.The International Journal of Robotics Research, 35:1023 – 1035, 2016

2016
[47]

Triest, M

S. Triest, M. Sivaprakasam, S. J. Wang, W. Wang, A. M. Johnson, and S. A. Scherer. TartanDrive: A large-scale dataset for learning off-road dynamics models.IEEE International Conference on Robotics and Automation, 2022

2022
[48]

Sivaprakasam, P

M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. A. Scherer. TartanDrive 2.0: More modalities and better infrastructure to further self-supervised learning research in off-road driving tasks.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12606–12606, 2024

2024
[49]

Diaz-Ruiz, Y

C. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond, W.-L. Chao, B. Hariharan, K. Q. Weinberger, and M. E. Campbell. Ithaca365: Dataset and driving perception under repeated and challenging weather conditions.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[50]

Schafer, E

H. Schafer, E. Santana, A. Haden, and R. Biasini. A commute in data: The comma2k19 dataset.ArXiv, abs/1812.05752, 2018

Pith/arXiv arXiv 2018
[51]

PhysicalAI-Autonomous-Vehicles dataset

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, 2025

2025
[52]

Gehrig, W

M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios.IEEE Robot. and Autom. Lett., March 2021

2021
[53]

A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robt. and Autom. Lett., 3:2032–2039, Feb. 2018

2032
[54]

L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip. VECtor: A versatile event-centric benchmark for multi-sensor slam.IEEE Robot. and Autom. Lett., 7(3):8217–8224, June 2022

2022
[55]

P. Chen, W. Guan, F. Huang, Y. Zhong, W. W. Wen, L.-T. Hsu, and P. Lu. ECMD: An event-centric multisensory driving dataset for slam.IEEE Transactions on Intelligent Vehicles, 9:407–416, 2023. URL https://api.semanticscholar.org/CorpusID:265033288

2023
[56]

Chaney, F

K. Chaney, F. Cladera, Z. Wang, A. Bisulco, M. A. Hsieh, C. Korpela, V. Kumar, C. J. Taylor, and K. Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE Conf. Comput. Vis. Pattern Recog. Workshop
[57]

A. J. Lee, Y. Cho, Y. sik Shin, A. Kim, and H. Myung. ViViD++ : Vision for visibility dataset.IEEE Robotics and Automation Letters, 7:6282–6289, 2022

2022
[58]

Perot, P

E. Perot, P. de Tournemire, D. O. Nitti, J. Masci, and A. Sironi. Learning to detect objects with a 1 megapixel event camera.Neural Information Processing Systems, 2020

2020
[59]

Binas, D

J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD17: End-to-end davis driving dataset.ArXiv, abs/1711.01458, 2017. 14

Pith/arXiv arXiv 2017
[60]

Y. Hu, J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD20 End-to-End Event Camera Driving Dataset: Fusing frames and events with deep learning for improved steering prediction.2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–6, 2020

2020
[61]

Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding

ITU-T. Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding. Technical Report ITU-T H.265, International Telecommunication Union, 2026. Version 01/2026

2026
[62]

E. Olson. AprilTag: A robust and flexible visual fiducial system.IEEE International Conference on Robotics and Automation, 2011

2011
[63]

Pfrommer

B. Pfrommer. Frequency cam: Imaging periodic signals in real-time.arXiv preprint 2211.00198, 2022

arXiv 2022
[64]

Rehder, J

J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart. Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. InIEEE International Conference on Robotics and Automation
[65]

Furgale, J

P. Furgale, J. Rehder, and R. Siegwart. Unified temporal and spatial calibration for multi-sensor systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1280–1286, 2013

2013
[66]

W. Kabsch. A solution for the best rotation to relate two sets of vectors.Acta Crystallographica Section A, 32:922–923, 1976

1976
[67]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:376–380, 1991

1991
[68]

Levenberg

K. Levenberg. A method for the solution of certain non – linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168, 1944

1944
[69]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Open model release

2026
[70]

Zhang, M

Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint 2506.05176, 2025

Pith/arXiv arXiv 2025
[71]

Douze, A

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J´egou. The Faiss library. 2024

2024
[72]

Malladi, T

M. Malladi, T. Guadagnino, L. Lobefaro, and C. Stachniss. A robust approach for lidar-inertial odometry without sensor-specific modeling.IEEE Robotics and Automation Letters, 11(6):7420–7427, 2026

2026
[73]

Sapkota, R

R. Sapkota, R. H. Cheppally, A. Sharda, and M. Karkee. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint 2509.25164, 2025

arXiv 2025
[74]

Kerssies, N

T. Kerssies, N. Cavagnero, A. Hermans, N. Norouzi, G. Averta, B. Leibe, G. Dubbelman, and D. de Geus. Your ViT is Secretly an Image Segmentation Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[75]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2021

2021
[76]

Hawkes and P

T. Hawkes and P. Simonpieri. Signal coding using asynchronous delta modulation.IEEE Trans. on Comm., 22(5):729–731, March 1974

1974
[77]

Gallego, T

G. Gallego, T. Delbr¨ uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

2022
[78]

Delbruck

T. Delbruck. Frame-free dynamic digital vision. InProceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, pages 21–26, 2008

2008
[79]

Gerstner, W

W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski.Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, 2014. 15

2014
[80]

Lagorce, G

X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman. HOTS: A hierarchy of event-based time-surfaces for pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1346–1359, 2017

2017

Showing first 80 references.

[1] [1]

Caron, H

M. Caron, H. Touvron, I. Misra, H. J´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, 2021

2021

[2] [2]

X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, 2023

2023

[3] [3]

Tschannen, A

M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. M. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H’enaff, J. Harmsen, A. Steiner, and X.-Q. Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint 2502.14786, 2025. 8Cross-dataset generaliz...

Pith/arXiv arXiv 2025

[4] [4]

Ryali, Y.-T

C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, 2023

2023

[5] [5]

Y. Liu, S. Wang, Y. Xie, T. Xiong, and M. Wu. A review of sensing technologies for indoor autonomous mobile robots.Sensors, 24, 2024

2024

[6] [6]

H. I. Christensen. Global robotics technology roadmap 2025–2035: A multi-regional, cross-domain strategic perspective for europe, asia, and the united states. Technology roadmap, University of California San Diego, April 2026. Version 1.02

2025

[7] [7]

Bachmann, D

R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean Conference on Computer Vision, 2022

2022

[8] [8]

Girdhar, A

R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra. ImageBind one embedding space to bind them all.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023

[9] [9]

Oquab, T

M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without superv...

2025

[10] [10]

Sim´eoni, H

O. Sim´eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint 2508.10104, 2025

Pith/arXiv arXiv 2025

[11] [11]

Bolya, P.-Y

D. Bolya, P.-Y. Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. A. Rasheed, J. Wang, M. Monteiro, H. Xu, S. Dong, N. Ravi, S.-W. Li, P. Doll’ar, and C. Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InAdvances in Neural Information Processing Systems, 2025

2025

[12] [12]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. K. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll’ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025

2025

[13] [13]

K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[14] [14]

Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. SimMIM: a simple framework for masked image modeling.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[15] [15]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019

2019

[16] [16]

van den Oord, Y

A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint 1807.03748, 2018

Pith/arXiv arXiv 2018

[17] [17]

T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, 2020

2020

[18] [18]

J. Cao, J. Xing, N. Messikommer, and D. Scaramuzza. Generative event pretraining with foundation model alignment.arXiv preprint 2603.23032, 2026

Pith/arXiv arXiv 2026

[19] [19]

Klenk, D

S. Klenk, D. Bonello, L. Koestler, and D. Cremers. Masked Event Modeling: Self-supervised pretraining for event cameras.IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

2022

[20] [20]

Y. Yang, L. Pan, and L. Liu. Event camera data dense pre-training. InEuropean Conference on Computer Vision, 2024

2024

[21] [21]

R. Das, K. Daniilidis, and P. Chaudhari. Fast feature field (F3): A predictive representation of events.arXiv preprint 2509.25146, 2025. 12

arXiv 2025

[22] [22]

Patel, J

M. Patel, J. Frey, M. Mittal, F. Yang, A. Hansson, A. Bar, C. Cadena, and M. Hutter. DeFM: Learning foundation representations from depth for robotics.arXiv preprint 2601.18923, abs/2601.18923, 2026

arXiv 2026

[23] [23]

Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InEuropean Conference on Computer Vision, 2022

2022

[24] [24]

X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[25] [25]

H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner. Unsupervised point cloud pre-training via occlusion completion. InIEEE/CVF International Conference on Computer Vision, 2021

2021

[26] [26]

S. Xie, J. Gu, D. Guo, C. Qi, L. J. Guibas, and O. Litany. PointContrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean Conference on Computer Vision, 2020

2020

[27] [27]

Munir, S

F. Munir, S. Azam, and M. Jeon. Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving.IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021

2021

[28] [28]

Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving

J. Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving. PhD thesis, University of Freiburg, 2024

2024

[29] [29]

Narayanswamy, X

G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, T. Althoff, S. Narayanan, P. Kohli, J. Zhan, M. Malhotra, S. N. Patel, S. Abdel-Ghaffar, and D. McDuff. Scaling wearable foundation models. InInternational Conference on Learning Representations, 2025

2025

[30] [30]

H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen. LIMU-BERT: Unleashing the potential of unlabeled data for imu sensing applications. InACM Conference on Embedded Networked Sensor Systems, 2021

2021

[31] [31]

Y. Zong, O. M. Aodha, and T. M. Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:5299–5318, 2023

2023

[32] [32]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021

2021

[33] [33]

X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

2021

[34] [34]

Mizrahi, R

D. Mizrahi, R. Bachmann, O. F. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir. 4M: Massively multimodal masked modeling. InAdvances in Neural Information Processing Systems, 2023

2023

[35] [35]

H. Bao, L. Dong, S. Piao, and F. Wei. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations, 2022

2022

[36] [36]

J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. InInternational Conference on Learning Representations, 2023

2023

[37] [37]

J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi. Unified-IO 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024

[38] [38]

J. Zou, T. Huang, G. Yang, Z. Guo, and W. Zuo. UniM2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. InEuropean Conference on Computer Vision, 2024

2024

[39] [39]

J. Sun, H. Zheng, Q. Zhang, A. Prakash, Z. M. Mao, and C. Xiao. CALICO: Self-supervised camera-lidar contrastive pre-training for bev perception. InInternational Conference on Learning Representations, 2024

2024

[40] [40]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012

2012

[41] [41]

Caesar, V

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. 13

2020

[42] [42]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset.2020 IEEE/CVF Conference on Compu...

2020

[43] [43]

Wilson, W

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting.ArXiv, abs/2301.00493, 2023

Pith/arXiv arXiv 2023

[44] [44]

W. P. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset.The International Journal of Robotics Research, 36:15 – 3, 2017

2017

[45] [45]

Lisus, K

D. Lisus, K. M. Papais, C. L. Gentil, E. Preston-Krebs, A. Lambert, K. Y. Leung, and T. D. Barfoot. Boreas Road Trip: A multi-sensor autonomous driving dataset on challenging roads.ArXiv, abs/2602.16870, 2026

arXiv 2026

[46] [46]

Carlevaris-Bianco, A

N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice. University of Michigan North Campus long-term vision and lidar dataset.The International Journal of Robotics Research, 35:1023 – 1035, 2016

2016

[47] [47]

Triest, M

S. Triest, M. Sivaprakasam, S. J. Wang, W. Wang, A. M. Johnson, and S. A. Scherer. TartanDrive: A large-scale dataset for learning off-road dynamics models.IEEE International Conference on Robotics and Automation, 2022

2022

[48] [48]

Sivaprakasam, P

M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. A. Scherer. TartanDrive 2.0: More modalities and better infrastructure to further self-supervised learning research in off-road driving tasks.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12606–12606, 2024

2024

[49] [49]

Diaz-Ruiz, Y

C. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond, W.-L. Chao, B. Hariharan, K. Q. Weinberger, and M. E. Campbell. Ithaca365: Dataset and driving perception under repeated and challenging weather conditions.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022

[50] [50]

Schafer, E

H. Schafer, E. Santana, A. Haden, and R. Biasini. A commute in data: The comma2k19 dataset.ArXiv, abs/1812.05752, 2018

Pith/arXiv arXiv 2018

[51] [51]

PhysicalAI-Autonomous-Vehicles dataset

NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, 2025

2025

[52] [52]

Gehrig, W

M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios.IEEE Robot. and Autom. Lett., March 2021

2021

[53] [53]

A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robt. and Autom. Lett., 3:2032–2039, Feb. 2018

2032

[54] [54]

L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip. VECtor: A versatile event-centric benchmark for multi-sensor slam.IEEE Robot. and Autom. Lett., 7(3):8217–8224, June 2022

2022

[55] [55]

P. Chen, W. Guan, F. Huang, Y. Zhong, W. W. Wen, L.-T. Hsu, and P. Lu. ECMD: An event-centric multisensory driving dataset for slam.IEEE Transactions on Intelligent Vehicles, 9:407–416, 2023. URL https://api.semanticscholar.org/CorpusID:265033288

2023

[56] [56]

Chaney, F

K. Chaney, F. Cladera, Z. Wang, A. Bisulco, M. A. Hsieh, C. Korpela, V. Kumar, C. J. Taylor, and K. Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE Conf. Comput. Vis. Pattern Recog. Workshop

[57] [57]

A. J. Lee, Y. Cho, Y. sik Shin, A. Kim, and H. Myung. ViViD++ : Vision for visibility dataset.IEEE Robotics and Automation Letters, 7:6282–6289, 2022

2022

[58] [58]

Perot, P

E. Perot, P. de Tournemire, D. O. Nitti, J. Masci, and A. Sironi. Learning to detect objects with a 1 megapixel event camera.Neural Information Processing Systems, 2020

2020

[59] [59]

Binas, D

J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD17: End-to-end davis driving dataset.ArXiv, abs/1711.01458, 2017. 14

Pith/arXiv arXiv 2017

[60] [60]

Y. Hu, J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD20 End-to-End Event Camera Driving Dataset: Fusing frames and events with deep learning for improved steering prediction.2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–6, 2020

2020

[61] [61]

Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding

ITU-T. Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding. Technical Report ITU-T H.265, International Telecommunication Union, 2026. Version 01/2026

2026

[62] [62]

E. Olson. AprilTag: A robust and flexible visual fiducial system.IEEE International Conference on Robotics and Automation, 2011

2011

[63] [63]

Pfrommer

B. Pfrommer. Frequency cam: Imaging periodic signals in real-time.arXiv preprint 2211.00198, 2022

arXiv 2022

[64] [64]

Rehder, J

J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart. Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. InIEEE International Conference on Robotics and Automation

[65] [65]

Furgale, J

P. Furgale, J. Rehder, and R. Siegwart. Unified temporal and spatial calibration for multi-sensor systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1280–1286, 2013

2013

[66] [66]

W. Kabsch. A solution for the best rotation to relate two sets of vectors.Acta Crystallographica Section A, 32:922–923, 1976

1976

[67] [67]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:376–380, 1991

1991

[68] [68]

Levenberg

K. Levenberg. A method for the solution of certain non – linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168, 1944

1944

[69] [69]

Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Open model release

2026

[70] [70]

Zhang, M

Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint 2506.05176, 2025

Pith/arXiv arXiv 2025

[71] [71]

Douze, A

M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J´egou. The Faiss library. 2024

2024

[72] [72]

Malladi, T

M. Malladi, T. Guadagnino, L. Lobefaro, and C. Stachniss. A robust approach for lidar-inertial odometry without sensor-specific modeling.IEEE Robotics and Automation Letters, 11(6):7420–7427, 2026

2026

[73] [73]

Sapkota, R

R. Sapkota, R. H. Cheppally, A. Sharda, and M. Karkee. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint 2509.25164, 2025

arXiv 2025

[74] [74]

Kerssies, N

T. Kerssies, N. Cavagnero, A. Hermans, N. Norouzi, G. Averta, B. Leibe, G. Dubbelman, and D. de Geus. Your ViT is Secretly an Image Segmentation Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025

[75] [75]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2021

2021

[76] [76]

Hawkes and P

T. Hawkes and P. Simonpieri. Signal coding using asynchronous delta modulation.IEEE Trans. on Comm., 22(5):729–731, March 1974

1974

[77] [77]

Gallego, T

G. Gallego, T. Delbr¨ uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

2022

[78] [78]

Delbruck

T. Delbruck. Frame-free dynamic digital vision. InProceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, pages 21–26, 2008

2008

[79] [79]

Gerstner, W

W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski.Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, 2014. 15

2014

[80] [80]

Lagorce, G

X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman. HOTS: A hierarchy of event-based time-surfaces for pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1346–1359, 2017

2017