pith. sign in

arxiv: 2606.27317 · v1 · pith:V2NXY2WEnew · submitted 2026-06-25 · 💻 cs.CV · cs.RO

OctoSense: Self-Supervised Learning for Multimodal Robot Perception

Pith reviewed 2026-06-26 05:23 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords multimodal self-supervised learningmasked autoencodersensor fusionrobot perceptionevent cameraLiDARthermal cameraego-motion estimation
0
0 comments X

The pith

A late-fusion masked autoencoder with modality-specific tokenizers produces fast multimodal representations that outperform image-only models on robot perception tasks and remain robust when sensors degrade.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a sensor platform and 59-hour driving dataset that records synchronized data from stereo RGB, event cameras, LiDAR, thermal imaging, IMU, GPS, and proprioception across varied conditions including night and sensor failure. It trains a masked autoencoder that tokenizes each sensor type separately before late fusion, then caches those tokens so new measurements can be encoded without reprocessing the full history. This yields representations that improve accuracy on optical flow, depth, semantic segmentation, and ego-motion estimation compared with image-only baselines while running in milliseconds on both desktop and embedded GPUs. The approach matters because real robots must integrate heterogeneous sensors without dense labels and must continue to function when individual modalities become unreliable.

Core claim

By applying a late-fusion masked autoencoder that uses separate tokenizers for each sensor modality to account for their distinct spatiotemporal properties, the model learns unified representations from the OctoSense dataset; these representations support faster inference through token caching and deliver higher performance than image-only foundation models on downstream tasks while maintaining robustness under nighttime conditions or sensor degradation.

What carries the argument

Late-fusion masked autoencoder with modality-specific tokenizers and cached token inference

If this is right

  • Representations can be computed in 6.68 ms on a high-end GPU and 112 ms on an embedded Orin NX board.
  • Performance exceeds image-only models on optical flow, depth, semantic segmentation, and ego-motion estimation.
  • Predictions remain reliable at night and when individual sensors are degraded.
  • New measurements can be incorporated by caching modality-specific tokens without recomputing the entire sequence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-caching mechanism could support online adaptation on a moving robot by updating only the newest modality tokens.
  • The 59-hour dataset spanning day, night, and degraded conditions provides a ready benchmark for testing whether other multimodal architectures also gain robustness from late fusion.
  • Because each tokenizer is trained independently before fusion, the method could be extended by adding new sensor types without retraining the entire model from scratch.

Load-bearing premise

That separate tokenizers per sensor plus late fusion inside a masked autoencoder will automatically produce representations that transfer to better performance on the listed downstream tasks than single-modality training.

What would settle it

An evaluation on the same test splits that shows the multimodal model achieving equal or lower accuracy than the best image-only baseline on optical flow, depth estimation, semantic segmentation, and ego-motion metrics.

Figures

Figures reproduced from arXiv: 2606.27317 by Anthony Bisulco, Jeremy Wang, Kostas Daniilidis, Pratik Chaudhari, Randall Balestriero.

Figure 1
Figure 1. Figure 1: A) OctoSense has been deployed across a car and a Unitree Go2-W. This initial dataset release focuses on car driving sequences with a very small amount of data collected using the Unitree. B) This platform is unique because it contains diverse sensors such as stereo RGB and event cameras, a thermal imager, LiDAR, IMU, GPS and CAN bus data. These sensors have very different data rates, frequencies, and info… view at source ↗
Figure 2
Figure 2. Figure 2: Ground-truth of different downstream tasks in OctoSense We next discuss a multi-modal MAE architecture for sensors in OctoSense. These sensors have different spatiotemporal characteristics: dense 2D arrays for RGB, a high-frequency point pro￾cess for the event camera, an unordered and sparse point cloud for LiDAR and a rapidly vary￾ing multi-channel time series for the IMU. This heterogeneity makes it chal… view at source ↗
Figure 3
Figure 3. Figure 3: A schematic of the late-fusion MAE encoder and probe architecture. The text provides more details. Sec. F elaborates upon the architecture with detailed schematics in Fig. S10. The multi-modal MAE ( [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

We present OctoSense, an open-source sensor platform with stereo RGB and event cameras, LiDAR, a thermal camera, an inertial measurement unit, RTK-corrected global positioning system, and proprioception (CAN bus data from a car, and joint angles for a quadruped robot). The eponymous OctoSense dataset contains 59 hours of time-synchronized driving data across different types of environments at different times of the day, including situations with highly degraded sensors. We demonstrate multi-modal self-supervised learning using such real-world robotics data, where sensors have different representations, frequencies, latencies and noise. Our approach, a "late-fusion" masked autoencoder, (i) uses modality-specific tokenizers to account for different spatiotemporal characteristics of these sensors, and (ii) caches modality-specific tokens at inference time to process new measurements as they come. This architecture (i) is fast (6.68 ms and 112 ms on NVIDIA 5090 and Orin NX respectively, to compute the representation), (ii) performs better than existing image-only foundation models on tasks such as estimation of optical flow, depth, semantic segmentation, and ego-motion (translation, rotation, and steering angle), and (iii) predicts robustly at nighttime or in situations where sensory data is degraded. See our project page for links to the dataset, code, and supplementary videos: https://abisulco.com/octosense/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OctoSense, an open-source multimodal sensor platform and 59-hour dataset of time-synchronized driving data (stereo RGB, event cameras, LiDAR, thermal, IMU, RTK-GPS, proprioception) collected across varied environments and times of day. It proposes a late-fusion masked autoencoder that employs modality-specific tokenizers to handle differing spatiotemporal characteristics and caches modality-specific tokens for efficient online inference. The central claims are that this architecture runs at 6.68 ms (NVIDIA 5090) / 112 ms (Orin NX), outperforms existing image-only foundation models on optical flow, depth, semantic segmentation, and ego-motion (translation/rotation/steering), and remains robust under nighttime or degraded-sensor conditions.

Significance. If the empirical superiority and robustness claims are substantiated with quantitative evidence, the work would supply a large-scale, real-world multimodal robotics dataset and a practical architecture for heterogeneous sensor fusion in self-supervised learning, with potential impact on robust perception for autonomous driving and legged robots.

major comments (1)
  1. [Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.
minor comments (1)
  1. [Abstract] The manuscript references a project page for dataset, code, and videos but does not include any quantitative results or experimental protocol in the provided text, which should be added to the main body or supplementary material.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their review and for highlighting the need for verifiable support of the abstract claims. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claims that the late-fusion MAE 'performs better than existing image-only foundation models' on optical flow, depth, semantic segmentation, and ego-motion and 'predicts robustly at nighttime or in situations where sensory data is degraded' are unsupported; no metrics, baselines, error bars, training losses, data splits, or ablation results are supplied, rendering the central performance assertions unverifiable from the manuscript.

    Authors: We agree that the abstract claims require direct quantitative support to be verifiable. The manuscript contains an Experiments section with the requested elements (comparisons against image-only MAE/DINO baselines on optical flow EPE, depth RMSE, segmentation mIoU, and ego-motion errors; robustness ablations under nighttime/degraded conditions; data splits; and training details). However, these were not sufficiently cross-referenced from the abstract. In the revision we will (i) insert key numerical results and error bars into the abstract, (ii) add an explicit pointer to the Experiments section and supplementary tables, and (iii) ensure all baselines, splits, and ablation results are clearly tabulated. This addresses the verifiability concern without altering the underlying claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims are empirical.

full rationale

The paper presents an empirical architecture (late-fusion masked autoencoder with modality-specific tokenizers and caching) and reports performance gains on downstream tasks versus image-only models. No load-bearing derivation, prediction, or uniqueness result reduces by construction to fitted inputs, self-citations, or ansatzes. All central claims rest on external experimental comparisons on the 59-hour dataset rather than any self-referential equation or parameter renaming. This matches the default expectation for non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are explicitly introduced or fitted; the work relies on standard assumptions of self-supervised masked autoencoders and the utility of multimodal fusion.

pith-pipeline@v0.9.1-grok · 5800 in / 1245 out tokens · 23466 ms · 2026-06-26T05:23:38.929924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

101 extracted references · 9 linked inside Pith

  1. [1]

    Caron, H

    M. Caron, H. Touvron, I. Misra, H. J´egou, J. Mairal, P. Bojanowski, and A. Joulin. Emerging properties in self-supervised vision transformers. InIEEE/CVF International Conference on Computer Vision, 2021

  2. [2]

    X. Zhai, B. Mustafa, A. Kolesnikov, and L. Beyer. Sigmoid loss for language image pre-training. In IEEE/CVF International Conference on Computer Vision, 2023

  3. [3]

    Tschannen, A

    M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. M. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, O. H’enaff, J. Harmsen, A. Steiner, and X.-Q. Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint 2502.14786, 2025. 8Cross-dataset generaliz...

  4. [4]

    Ryali, Y.-T

    C. Ryali, Y.-T. Hu, D. Bolya, C. Wei, H. Fan, P.-Y. Huang, V. Aggarwal, A. Chowdhury, O. Poursaeed, J. Hoffman, J. Malik, Y. Li, and C. Feichtenhofer. Hiera: A hierarchical vision transformer without the bells-and-whistles. InInternational Conference on Machine Learning, 2023

  5. [5]

    Y. Liu, S. Wang, Y. Xie, T. Xiong, and M. Wu. A review of sensing technologies for indoor autonomous mobile robots.Sensors, 24, 2024

  6. [6]

    H. I. Christensen. Global robotics technology roadmap 2025–2035: A multi-regional, cross-domain strategic perspective for europe, asia, and the united states. Technology roadmap, University of California San Diego, April 2026. Version 1.02

  7. [7]

    Bachmann, D

    R. Bachmann, D. Mizrahi, A. Atanov, and A. Zamir. Multimae: Multi-modal multi-task masked autoencoders. InEuropean Conference on Computer Vision, 2022

  8. [8]

    Girdhar, A

    R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra. ImageBind one embedding space to bind them all.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  9. [9]

    Oquab, T

    M. Oquab, T. Darcet, T. Moutakanni, H. V. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P.-Y. Huang, H. Xu, V. Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. DINOv2: Learning robust visual features without superv...

  10. [10]

    Sim´eoni, H

    O. Sim´eoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, F. Massa, D. Haziza, L. Wehrstedt, J. Wang, T. Darcet, T. Moutakanni, L. Sentana, C. Roberts, A. Vedaldi, J. Tolan, J. Brandt, C. Couprie, J. Mairal, H. J´egou, P. Labatut, and P. Bojanowski. DINOv3.arXiv preprint 2508.10104, 2025

  11. [11]

    Bolya, P.-Y

    D. Bolya, P.-Y. Huang, P. Sun, J. H. Cho, A. Madotto, C. Wei, T. Ma, J. Zhi, J. Rajasegaran, H. A. Rasheed, J. Wang, M. Monteiro, H. Xu, S. Dong, N. Ravi, S.-W. Li, P. Doll’ar, and C. Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the network. InAdvances in Neural Information Processing Systems, 2025

  12. [12]

    N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. K. Ryali, T. Ma, H. Khedr, R. R¨adle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C. Wu, R. B. Girshick, P. Doll’ar, and C. Feichtenhofer. SAM 2: Segment anything in images and videos. InInternational Conference on Learning Representations, 2025

  13. [13]

    K. He, X. Chen, S. Xie, Y. Li, P. Doll’ar, and R. B. Girshick. Masked autoencoders are scalable vision learners.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  14. [14]

    Z. Xie, Z. Zhang, Y. Cao, Y. Lin, J. Bao, Z. Yao, Q. Dai, and H. Hu. SimMIM: a simple framework for masked image modeling.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  15. [15]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. InNorth American Chapter of the Association for Computational Linguistics, 2019

  16. [16]

    van den Oord, Y

    A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive coding.arXiv preprint 1807.03748, 2018

  17. [17]

    T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton. A simple framework for contrastive learning of visual representations. InInternational Conference on Machine Learning, 2020

  18. [18]

    J. Cao, J. Xing, N. Messikommer, and D. Scaramuzza. Generative event pretraining with foundation model alignment.arXiv preprint 2603.23032, 2026

  19. [19]

    Klenk, D

    S. Klenk, D. Bonello, L. Koestler, and D. Cremers. Masked Event Modeling: Self-supervised pretraining for event cameras.IEEE/CVF Winter Conference on Applications of Computer Vision, 2022

  20. [20]

    Y. Yang, L. Pan, and L. Liu. Event camera data dense pre-training. InEuropean Conference on Computer Vision, 2024

  21. [21]

    R. Das, K. Daniilidis, and P. Chaudhari. Fast feature field (F3): A predictive representation of events.arXiv preprint 2509.25146, 2025. 12

  22. [22]

    Patel, J

    M. Patel, J. Frey, M. Mittal, F. Yang, A. Hansson, A. Bar, C. Cadena, and M. Hutter. DeFM: Learning foundation representations from depth for robotics.arXiv preprint 2601.18923, abs/2601.18923, 2026

  23. [23]

    Y. Pang, W. Wang, F. E. Tay, W. Liu, Y. Tian, and L. Yuan. Masked autoencoders for point cloud self-supervised learning. InEuropean Conference on Computer Vision, 2022

  24. [24]

    X. Yu, L. Tang, Y. Rao, T. Huang, J. Zhou, and J. Lu. Point-BERT: Pre-training 3d point cloud transformers with masked point modeling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  25. [25]

    H. Wang, Q. Liu, X. Yue, J. Lasenby, and M. J. Kusner. Unsupervised point cloud pre-training via occlusion completion. InIEEE/CVF International Conference on Computer Vision, 2021

  26. [26]

    S. Xie, J. Gu, D. Guo, C. Qi, L. J. Guibas, and O. Litany. PointContrast: Unsupervised pre-training for 3d point cloud understanding. InEuropean Conference on Computer Vision, 2020

  27. [27]

    Munir, S

    F. Munir, S. Azam, and M. Jeon. Sstn: Self-supervised domain adaptation thermal object detection for autonomous driving.IEEE/RSJ International Conference on Intelligent Robots and Systems, 2021

  28. [28]

    Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving

    J. Z¨ urn.Self-supervised and Multi-modal Learning for Perception in Mobile Robots and Autonomous Driving. PhD thesis, University of Freiburg, 2024

  29. [29]

    Narayanswamy, X

    G. Narayanswamy, X. Liu, K. Ayush, Y. Yang, X. Xu, S. Liao, J. Garrison, S. Tailor, J. Sunshine, Y. Liu, T. Althoff, S. Narayanan, P. Kohli, J. Zhan, M. Malhotra, S. N. Patel, S. Abdel-Ghaffar, and D. McDuff. Scaling wearable foundation models. InInternational Conference on Learning Representations, 2025

  30. [30]

    H. Xu, P. Zhou, R. Tan, M. Li, and G. Shen. LIMU-BERT: Unleashing the potential of unlabeled data for imu sensing applications. InACM Conference on Embedded Networked Sensor Systems, 2021

  31. [31]

    Y. Zong, O. M. Aodha, and T. M. Hospedales. Self-supervised multimodal learning: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 47:5299–5318, 2023

  32. [32]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021

  33. [33]

    X. Wang, R. Zhang, C. Shen, T. Kong, and L. Li. Dense contrastive learning for self-supervised visual pre-training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021

  34. [34]

    Mizrahi, R

    D. Mizrahi, R. Bachmann, O. F. Kar, T. Yeo, M. Gao, A. Dehghan, and A. Zamir. 4M: Massively multimodal masked modeling. InAdvances in Neural Information Processing Systems, 2023

  35. [35]

    H. Bao, L. Dong, S. Piao, and F. Wei. BEit: BERT pre-training of image transformers. InInternational Conference on Learning Representations, 2022

  36. [36]

    J. Lu, C. Clark, R. Zellers, R. Mottaghi, and A. Kembhavi. Unified-IO: A unified model for vision, language, and multi-modal tasks. InInternational Conference on Learning Representations, 2023

  37. [37]

    J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi. Unified-IO 2: Scaling autoregressive multimodal models with vision, language, audio, and action. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  38. [38]

    J. Zou, T. Huang, G. Yang, Z. Guo, and W. Zuo. UniM2AE: Multi-modal masked autoencoders with unified 3d representation for 3d perception in autonomous driving. InEuropean Conference on Computer Vision, 2024

  39. [39]

    J. Sun, H. Zheng, Q. Zhang, A. Prakash, Z. M. Mao, and C. Xiao. CALICO: Self-supervised camera-lidar contrastive pre-training for bev perception. InInternational Conference on Learning Representations, 2024

  40. [40]

    Geiger, P

    A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012

  41. [41]

    Caesar, V

    H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes: A multimodal dataset for autonomous driving.2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11618–11628, 2019. 13

  42. [42]

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. M. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset.2020 IEEE/CVF Conference on Compu...

  43. [43]

    Wilson, W

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting.ArXiv, abs/2301.00493, 2023

  44. [44]

    W. P. Maddern, G. Pascoe, C. Linegar, and P. Newman. 1 year, 1000 km: The oxford robotcar dataset.The International Journal of Robotics Research, 36:15 – 3, 2017

  45. [45]

    Lisus, K

    D. Lisus, K. M. Papais, C. L. Gentil, E. Preston-Krebs, A. Lambert, K. Y. Leung, and T. D. Barfoot. Boreas Road Trip: A multi-sensor autonomous driving dataset on challenging roads.ArXiv, abs/2602.16870, 2026

  46. [46]

    Carlevaris-Bianco, A

    N. Carlevaris-Bianco, A. K. Ushani, and R. M. Eustice. University of Michigan North Campus long-term vision and lidar dataset.The International Journal of Robotics Research, 35:1023 – 1035, 2016

  47. [47]

    Triest, M

    S. Triest, M. Sivaprakasam, S. J. Wang, W. Wang, A. M. Johnson, and S. A. Scherer. TartanDrive: A large-scale dataset for learning off-road dynamics models.IEEE International Conference on Robotics and Automation, 2022

  48. [48]

    Sivaprakasam, P

    M. Sivaprakasam, P. Maheshwari, M. G. Castro, S. Triest, M. Nye, S. Willits, A. Saba, W. Wang, and S. A. Scherer. TartanDrive 2.0: More modalities and better infrastructure to further self-supervised learning research in off-road driving tasks.2024 IEEE International Conference on Robotics and Automation (ICRA), pages 12606–12606, 2024

  49. [49]

    Diaz-Ruiz, Y

    C. Diaz-Ruiz, Y. Xia, Y. You, J. Nino, J. Chen, J. Monica, X. Chen, K. Luo, Y. Wang, M. Emond, W.-L. Chao, B. Hariharan, K. Q. Weinberger, and M. E. Campbell. Ithaca365: Dataset and driving perception under repeated and challenging weather conditions.IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

  50. [50]

    Schafer, E

    H. Schafer, E. Santana, A. Haden, and R. Biasini. A commute in data: The comma2k19 dataset.ArXiv, abs/1812.05752, 2018

  51. [51]

    PhysicalAI-Autonomous-Vehicles dataset

    NVIDIA Corporation. PhysicalAI-Autonomous-Vehicles dataset. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, 2025

  52. [52]

    Gehrig, W

    M. Gehrig, W. Aarents, D. Gehrig, and D. Scaramuzza. DSEC: A stereo event camera dataset for driving scenarios.IEEE Robot. and Autom. Lett., March 2021

  53. [53]

    A. Z. Zhu, D. Thakur, T. ¨Ozaslan, B. Pfrommer, V. Kumar, and K. Daniilidis. The multivehicle stereo event camera dataset: An event camera dataset for 3d perception.IEEE Robt. and Autom. Lett., 3:2032–2039, Feb. 2018

  54. [54]

    L. Gao, Y. Liang, J. Yang, S. Wu, C. Wang, J. Chen, and L. Kneip. VECtor: A versatile event-centric benchmark for multi-sensor slam.IEEE Robot. and Autom. Lett., 7(3):8217–8224, June 2022

  55. [55]

    P. Chen, W. Guan, F. Huang, Y. Zhong, W. W. Wen, L.-T. Hsu, and P. Lu. ECMD: An event-centric multisensory driving dataset for slam.IEEE Transactions on Intelligent Vehicles, 9:407–416, 2023. URL https://api.semanticscholar.org/CorpusID:265033288

  56. [56]

    Chaney, F

    K. Chaney, F. Cladera, Z. Wang, A. Bisulco, M. A. Hsieh, C. Korpela, V. Kumar, C. J. Taylor, and K. Daniilidis. M3ED: Multi-robot, multi-sensor, multi-environment event dataset. InIEEE Conf. Comput. Vis. Pattern Recog. Workshop

  57. [57]

    A. J. Lee, Y. Cho, Y. sik Shin, A. Kim, and H. Myung. ViViD++ : Vision for visibility dataset.IEEE Robotics and Automation Letters, 7:6282–6289, 2022

  58. [58]

    Perot, P

    E. Perot, P. de Tournemire, D. O. Nitti, J. Masci, and A. Sironi. Learning to detect objects with a 1 megapixel event camera.Neural Information Processing Systems, 2020

  59. [59]

    Binas, D

    J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD17: End-to-end davis driving dataset.ArXiv, abs/1711.01458, 2017. 14

  60. [60]

    Y. Hu, J. Binas, D. Neil, S.-C. Liu, and T. Delbruck. DDD20 End-to-End Event Camera Driving Dataset: Fusing frames and events with deep learning for improved steering prediction.2020 IEEE 23rd International Conference on Intelligent Transportation Systems (ITSC), pages 1–6, 2020

  61. [61]

    Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding

    ITU-T. Series h: Audiovisual and multimedia systems: Infrastructure of audiovisual services - cod- ing of moving video: High efficiency video coding. Technical Report ITU-T H.265, International Telecommunication Union, 2026. Version 01/2026

  62. [62]

    E. Olson. AprilTag: A robust and flexible visual fiducial system.IEEE International Conference on Robotics and Automation, 2011

  63. [63]

    Pfrommer

    B. Pfrommer. Frequency cam: Imaging periodic signals in real-time.arXiv preprint 2211.00198, 2022

  64. [64]

    Rehder, J

    J. Rehder, J. Nikolic, T. Schneider, T. Hinzmann, and R. Siegwart. Extending kalibr: Calibrating the extrinsics of multiple IMUs and of individual axes. InIEEE International Conference on Robotics and Automation

  65. [65]

    Furgale, J

    P. Furgale, J. Rehder, and R. Siegwart. Unified temporal and spatial calibration for multi-sensor systems. InIEEE/RSJ International Conference on Intelligent Robots and Systems, pages 1280–1286, 2013

  66. [66]

    W. Kabsch. A solution for the best rotation to relate two sets of vectors.Acta Crystallographica Section A, 32:922–923, 1976

  67. [67]

    S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13:376–380, 1991

  68. [68]

    Levenberg

    K. Levenberg. A method for the solution of certain non – linear problems in least squares.Quarterly of Applied Mathematics, 2:164–168, 1944

  69. [69]

    Google DeepMind. Gemma 4. https://deepmind.google/models/gemma/gemma-4/, 2026. Open model release

  70. [70]

    Zhang, M

    Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin, F. Huang, and J. Zhou. Qwen3 embedding: Advancing text embedding and reranking through foundation models.arXiv preprint 2506.05176, 2025

  71. [71]

    Douze, A

    M. Douze, A. Guzhva, C. Deng, J. Johnson, G. Szilvasy, P.-E. Mazar ´e, M. Lomeli, L. Hosseini, and H. J´egou. The Faiss library. 2024

  72. [72]

    Malladi, T

    M. Malladi, T. Guadagnino, L. Lobefaro, and C. Stachniss. A robust approach for lidar-inertial odometry without sensor-specific modeling.IEEE Robotics and Automation Letters, 11(6):7420–7427, 2026

  73. [73]

    Sapkota, R

    R. Sapkota, R. H. Cheppally, A. Sharda, and M. Karkee. YOLO26: Key architectural enhancements and performance benchmarking for real-time object detection.arXiv preprint 2509.25164, 2025

  74. [74]

    Kerssies, N

    T. Kerssies, N. Cavagnero, A. Hermans, N. Norouzi, G. Averta, B. Leibe, G. Dubbelman, and D. de Geus. Your ViT is Secretly an Image Segmentation Model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  75. [75]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.International Conference on Learning Representations, 2021

  76. [76]

    Hawkes and P

    T. Hawkes and P. Simonpieri. Signal coding using asynchronous delta modulation.IEEE Trans. on Comm., 22(5):729–731, March 1974

  77. [77]

    Gallego, T

    G. Gallego, T. Delbr¨ uck, G. Orchard, C. Bartolozzi, B. Taba, A. Censi, S. Leutenegger, A. J. Davison, J. Conradt, K. Daniilidis, and D. Scaramuzza. Event-based vision: A survey.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(1):154–180, 2022

  78. [78]

    Delbruck

    T. Delbruck. Frame-free dynamic digital vision. InProceedings of the International Symposium on Secure-Life Electronics, Advanced Electronics for Quality Life and Society, pages 21–26, 2008

  79. [79]

    Gerstner, W

    W. Gerstner, W. M. Kistler, R. Naud, and L. Paninski.Neuronal Dynamics: From Single Neurons to Networks and Models of Cognition. Cambridge University Press, 2014. 15

  80. [80]

    Lagorce, G

    X. Lagorce, G. Orchard, F. Galluppi, B. E. Shi, and R. B. Benosman. HOTS: A hierarchy of event-based time-surfaces for pattern recognition.IEEE Transactions on Pattern Analysis and Machine Intelligence, 39: 1346–1359, 2017

Showing first 80 references.