arxiv: 2605.08084 · v1 · submitted 2026-05-08 · 💻 cs.RO · cs.CV

Recognition: 1 theorem link

· Lean Theorem

123D: Unifying Multi-Modal Autonomous Driving Data at Scale

Daniel Dauner , Valentin Charraut , Bastian Berle , Tianyu Li , Long Nguyen , Jiabao Wang , Changhui Jing , Maximilian Igl

show 5 more authors

Holger Caesar Boris Ivanovic Yiyi Liao Andreas Geiger Kashyap Chitta

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords autonomous drivingmulti-modal datadataset unificationevent streams3D object detectionreinforcement learningsensor synchronizationdata consolidation

0 comments

The pith

Treating each sensor reading as an independent timestamped event stream lets one API handle eight incompatible autonomous driving datasets at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents 123D as a framework that stores every modality from driving sensors as its own independent stream of timestamped events rather than enforcing fixed rates or synchronization schemes. This design removes format barriers that have kept datasets separate, allowing eight real-world collections spanning 3,300 hours and 90,000 kilometers plus one synthetic set to be loaded and queried through identical calls. The unified access supports direct statistical comparisons of annotations, pose accuracy, and calibration across sources. It also enables new experiments such as training 3D object detectors on data from multiple collections and running reinforcement learning for planning on the combined corpus.

Core claim

By representing all modalities as independent timestamped event streams, 123D unifies multi-modal data from fragmented datasets into a single API that supports both synchronous and asynchronous access, enabling the consolidation of over 3,300 hours of real-world driving data and demonstrating applications in cross-dataset detection and reinforcement learning for planning.

What carries the argument

The independent timestamped event stream representation for each modality, which decouples timing from any prescribed rate and permits flexible synchronous or asynchronous querying across arbitrary datasets without custom loaders.

If this is right

Detectors trained on the combined data can be evaluated for generalization across different collection conditions and annotation conventions.
Reinforcement learning agents for driving policies gain access to a much larger and more diverse set of experiences drawn from the full 3,300-hour corpus.
Researchers can perform systematic audits of pose and calibration accuracy that were previously difficult to compare across sources.
Analysis and visualization tools become available for the entire collection without writing custom code for each original format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same event-stream abstraction could extend to other robotics domains such as robotic manipulation or aerial navigation where sensor rates also vary widely.
If adopted as a release format, future datasets could avoid the fragmentation problem from the start by providing data directly in this structure.
The unified real and synthetic data opens concrete paths for controlled sim-to-real transfer experiments in both perception and planning.
Large-scale pretraining of driving policies on the full 90,000 km collection becomes feasible in the same way language models pretrain on text corpora.

Load-bearing premise

That storing each modality as an independent timestamped event stream preserves all necessary information and allows accurate synchronization or asynchronous access without introducing errors or losing fidelity from the original datasets' different rates and annotation conventions.

What would settle it

A side-by-side comparison showing that 123D-loaded synchronized frames from two datasets produce object labels or timing offsets that differ from the native loaders of those datasets would falsify the claim of lossless unification.

Figures

Figures reproduced from arXiv: 2605.08084 by Andreas Geiger, Bastian Berle, Boris Ivanovic, Changhui Jing, Daniel Dauner, Holger Caesar, Jiabao Wang, Kashyap Chitta, Long Nguyen, Maximilian Igl, Tianyu Li, Valentin Charraut, Yiyi Liao.

**Figure 1.** Figure 1: 123D. An open-source toolkit to consolidate fragmented driving data through a unified format for modalities such as annotations, sensors, and HD maps. By overcoming this fragmentation, 123D enables a wide range of cross-dataset applications and research directions, including scene reconstruction, cross-vehicle learning, and reinforcement-learning-based planning. eral robotics has done the same with LeRobot… view at source ↗

**Figure 2.** Figure 2: Architecture. We parse existing datasets from cloud/local storage, or collect data in simulation that we write to our unified Apache Arrow [24] log format (Sec. 3.1). The scene and map API enable access to logs, and can be passed to a dataloader, viewer, or other application (Sec. 3.2). observations, but with fixed per-dataset sample rates at pre-processing time and provides no map or traffic-light abstrac… view at source ↗

**Figure 3.** Figure 3: 3D Viewer. Analyzing driving recordings requires frequent visual inspections. We show visualizations of supported datasets in 3a-3i from our interactive 3D viewer based on Viser [71]. lidar point clouds), to remain configurable along trade-offs between storage size and access latency. Importantly, sensor data access returns unified representations agnostic to the storage choice. Structural variance in HD m… view at source ↗

**Figure 4.** Figure 4: Annotation of bounding boxes. We compare ego distance, speed, and acceleration (rows) over different semantic categories, grouped into vehicle, person, two-wheeler, obstacles, and other miscellaneous classes (columns). The histograms show frequencies in the range of 0-1 on a log scale [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Multi-view 3D Object Detection. Per-dataset nuScenes detection score (NDS) for PETR [49] and BEVFormer-S [44] for vehicle detection. We evaluate on held-out validation splits of each dataset and train on nuScenes, WOD-Perc., Av2-Sens., nuPlan, CARLA, or a uniform mixture of these five (Mixed-5, dashed). PandaSet, KITTI-360, and PAI-AV are never seen during training. for novel-view reconstruction; nuPlan is… view at source ↗

**Figure 6.** Figure 6: PufferDrive Planning [16]. Results. We summarize the results on held-out test scenes in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

The pursuit of autonomous driving has produced one of the richest sensor data collections in all of robotics. However, its scale and diversity remain largely untapped. Each dataset adopts different 2D and 3D modalities, such as cameras, lidar, ego states, annotations, traffic lights, and HD maps, with different rates and synchronization schemes. They come in fragmented formats requiring complex dependencies that cannot natively coexist in the same development environment. Further, major inconsistencies in annotation conventions prevent training or measuring generalization across multiple datasets. We present 123D, an open-source framework that unifies such multi-modal driving data through a single API. To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets. Using 123D, we consolidate eight real-world driving datasets spanning 3,300 hours and 90,000 kilometers, together with a synthetic dataset with configurable collection scripts, and provide tools for data analysis and visualization. We conduct a systematic study comparing annotation statistics and assessing each dataset's pose and calibration accuracy. Further, we showcase two applications 123D enables: cross-dataset 3D object detection transfer and reinforcement learning for planning, and offer recommendations for future directions. Code and documentation are available at https://github.com/kesai-labs/py123d.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

123D gives a practical single-API wrapper over eight driving datasets using timestamped event streams, plus an accuracy study and two demos, but the lossless unification claim lacks visible quantitative checks.

read the letter

The main takeaway is that this paper ships a usable open-source framework for mixing real and synthetic driving data without writing per-dataset loaders. It consolidates eight datasets covering 3,300 hours and 90,000 km, stores each modality as an independent timestamped event stream, adds analysis and visualization tools, runs a systematic comparison of annotation statistics and pose/calibration accuracy, and shows two quick applications in cross-dataset 3D detection and RL planning. That scale and the released code are the concrete value here. The event-stream design is a reasonable way to avoid forcing a common rate across cameras, lidar, maps, and ego states that originally ran at different frequencies. The accuracy study and recommendations for future work also show the authors engaged with the practical messiness of these collections. The soft spot is exactly the one the stress-test flagged: the abstract asserts that the independent streams preserve all original information and allow exact reconstruction or async access without artifacts or semantic drift in annotations, yet no round-trip error numbers, timestamp precision checks, or before/after IoU on mapped boxes appear in the high-level description. If the full paper has those metrics for the eight datasets, the claim holds; if not, users will still need to verify fidelity for their own tasks. This is infrastructure work aimed at people already running multi-dataset experiments in perception or planning. A reader who wants to train or evaluate models across sources without custom glue code will find immediate use in the repo and the consolidated data. It is worth a serious referee because the fragmentation problem is real and the release lowers a barrier, even if reviewers should press on the validation of the unification step itself. I would send it out for review rather than desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents 123D, an open-source framework that unifies multi-modal autonomous driving data from eight real-world datasets (3,300 hours, 90,000 km) and one synthetic dataset via a single API. Each modality is stored as an independent timestamped event stream with no prescribed rate to support synchronous or asynchronous access despite differing original rates, formats, and annotation conventions. The work includes tools for analysis and visualization, a systematic comparison of annotation statistics and pose/calibration accuracy, and two applications: cross-dataset 3D object detection transfer and reinforcement learning for planning.

Significance. If the unification preserves fidelity, 123D would be a substantial contribution by lowering barriers to large-scale cross-dataset training and evaluation in autonomous driving. The open-source release, scale of consolidation, and demonstrated applications add practical value; the systematic accuracy study is a positive step toward reproducibility.

major comments (2)

[§3] §3 (Data Unification and Event Streams): The central claim that representing every modality as an independent timestamped event stream preserves all necessary information and permits exact reconstruction of original synchronous tuples without interpolation artifacts or loss of fidelity is load-bearing for the cross-dataset applications, yet the manuscript provides no quantitative validation (e.g., timestamp round-trip error, rate-mismatch reconstruction error, or annotation IoU before/after schema mapping) across the eight heterogeneous datasets.
[§4] §4 (Annotation Statistics and Accuracy Study): The systematic comparison of annotation conventions and pose/calibration accuracy is useful, but the paper does not report how inconsistent 3D box conventions or traffic-light taxonomies were mapped into the common schema, nor any drift metrics; this directly affects the reliability of the cross-dataset 3D detection transfer results shown later.

minor comments (2)

[§5] The abstract and §5 mention 'configurable collection scripts' for the synthetic dataset, but the manuscript does not specify the exact parameters or randomization ranges used, which would aid reproducibility.
Figure captions for the visualization tools could more explicitly state which modalities are overlaid in each panel to improve clarity for readers unfamiliar with the original dataset formats.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the practical value of 123D. We address the major comments point by point below, indicating where revisions will be made to improve rigor and transparency.

read point-by-point responses

Referee: [§3] §3 (Data Unification and Event Streams): The central claim that representing every modality as an independent timestamped event stream preserves all necessary information and permits exact reconstruction of original synchronous tuples without interpolation artifacts or loss of fidelity is load-bearing for the cross-dataset applications, yet the manuscript provides no quantitative validation (e.g., timestamp round-trip error, rate-mismatch reconstruction error, or annotation IoU before/after schema mapping) across the eight heterogeneous datasets.

Authors: We agree that explicit quantitative validation would strengthen the central claim. The event-stream design stores each modality with its native timestamps and raw content, performing no interpolation, resampling, or data alteration; reconstruction of original tuples is achieved by time-window queries on the independent streams. However, the submitted manuscript indeed lacks the requested metrics. In revision we will add a dedicated validation subsection to §3 that reports timestamp round-trip errors and synchronous reconstruction fidelity on representative subsets of the eight datasets, plus before/after annotation IoU statistics for the schema mappings. revision: partial
Referee: [§4] §4 (Annotation Statistics and Accuracy Study): The systematic comparison of annotation conventions and pose/calibration accuracy is useful, but the paper does not report how inconsistent 3D box conventions or traffic-light taxonomies were mapped into the common schema, nor any drift metrics; this directly affects the reliability of the cross-dataset 3D detection transfer results shown later.

Authors: We concur that the mapping procedures and any drift metrics must be documented for reproducibility. The current text describes the target schema but omits the concrete rules used for 3D box coordinate-frame unification, orientation conventions, and traffic-light category harmonization. In the revised §4 we will insert a table and textual description of these mappings together with any available quantitative drift or accuracy metrics drawn from the original dataset releases and our own analysis. This addition will directly support interpretation of the cross-dataset transfer experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: software framework with no derivations or self-referential reductions

full rationale

The paper describes a data unification framework that stores modalities as independent timestamped event streams to enable cross-dataset access. No mathematical derivations, fitted parameters, predictions, uniqueness theorems, or ansatzes are present. Claims concern the existence of the released tool, its coverage of eight datasets, and two downstream applications; these are externally verifiable via the open-source code and data rather than reducing to self-citations or inputs by construction. The central design choice is presented as an engineering decision, not derived from prior results in a load-bearing way.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a data unification framework the central claim rests on the domain assumption that multi-modal sensor streams can be losslessly represented as independent timestamped events and that the provided tools faithfully expose original dataset properties.

axioms (1)

domain assumption Multi-modal driving data from heterogeneous datasets can be represented as independent timestamped event streams without loss of synchronization or annotation fidelity.
Invoked to justify the single-API design and cross-dataset access.

pith-pipeline@v0.9.0 · 5587 in / 1424 out tokens · 47991 ms · 2026-05-11T01:51:58.627724+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

To handle synchronization, we store each modality as an independent timestamped event stream with no prescribed rate, enabling synchronous or asynchronous access across arbitrary datasets.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages

[1]

High performance i/o for large scale deep learning

Alex Aizman, Gavin Maltby, and Thomas Breuel. High performance i/o for large scale deep learning. In2019 IEEE International Conference on Big Data (Big Data), pages 5965–5967. IEEE, 2019

work page 2019
[2]

π0: A vision-language-action flow model for general robot control.arXiv.org, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv.org, 2024

work page 2024
[3]

G. Bradski. The OpenCV Library.Dr . Dobb’s Journal of Software Tools, 2000

work page 2000
[4]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwi...

work page 2020
[5]

Lerobot: State-of-the-art machine learning for real-world robotics in pytorch

Remi Cadene, Simon Alibert, Alexander Soare, Quentin Gallouedec, Adil Zouitine, Steven Palma, Pepijn Kooijmans, Michel Aractingi, Mustafa Shukor, Dana Aubakirova, Martino Russi, Francesco Capuano, Caroline Pascal, Jade Choghari, Jess Moss, and Thomas Wolf. Lerobot: State-of-the-art machine learning for real-world robotics in pytorch. https://github. com/h...

work page 2024
[6]

nuscenes: A multimodal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[7]

Pseudo-simulation for autonomous driving

Wei Cao, Marcel Hallgarten, Tianyu Li, Daniel Dauner, Xunjiang Gu, Caojun Wang, Yakov Miron, Marco Aiello, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Pseudo-simulation for autonomous driving. InProc. Conf. on Robot Learning (CoRL), 2025

work page 2025
[8]

Sam 3: Segment anything with concepts.Proc

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts.Proc. of the International Conf. on Learning Representations (ICLR), 2026

work page 2026
[9]

Unified domain generalization and adaptation for multi-view 3d object detection.Advances in Neural Information Processing Systems (NeurIPS), 2024

Gyusam Chang, Jiwon Lee, Donghyun Kim, Jinkyu Kim, Dongwook Lee, Daehyun Ji, Sujin Jang, and Sangpil Kim. Unified domain generalization and adaptation for multi-view 3d object detection.Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[10]

Argoverse: 3d tracking and forecasting with rich maps

Ming-Fang Chang, John Lambert, Patsorn Sangkloy, Jagjeet Singh, Slawomir Bak, Andrew Hartnett, De Wang, Peter Carr, Simon Lucey, Deva Ramanan, et al. Argoverse: 3d tracking and forecasting with rich maps. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[11]

Olmix: A framework for data mixing throughout lm development.arXiv.org, 2026

Mayee F Chen, Tyler Murray, David Heineman, Matt Jordan, Hannaneh Hajishirzi, Christo- pher Ré, Luca Soldaini, and Kyle Lo. Olmix: A framework for data mixing throughout lm development.arXiv.org, 2026. 10

work page 2026
[12]

Omnire: Omni urban scene reconstruction.Proc

Ziyu Chen, Jiawei Yang, Jiahui Huang, Riccardo de Lutio, Janick Martinez Esturo, Boris Ivanovic, Or Litany, Zan Gojcic, Sanja Fidler, Marco Pavone, et al. Omnire: Omni urban scene reconstruction.Proc. of the International Conf. on Learning Representations (ICLR), 2025

work page 2025
[13]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[14]

Open x-embodiment: Robotic learning datasets and rt-x models

OX-Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Mad- dukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, et al. Open x-embodiment: Robotic learning datasets and rt-x models. InProc. IEEE International Conf. on Robotics and Automation (ICRA), 2023

work page 2023
[15]

MMDetection3D: OpenMMLab next-generation platform for general 3D object detection

MMDetection3D Contributors. MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d, 2020

work page 2020
[16]

PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2026

Daphne Cornelisse*, Spencer Cheng*, Pragnay Mandavilli, Julian Hunt, Kevin Joseph, Waël Doulazmi, Valentin Charraut, Aditya Gupta, Joseph Suarez, and Eugene Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2026. URL https://github.com/Emerge-Lab/PufferDrive

work page 2026
[17]

Robust autonomy emerges from self-play.Proc

Marco Cusumano-Towner, David Hafner, Alex Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor Killian, Stuart Bowers, Ozan Sener, et al. Robust autonomy emerges from self-play.Proc. of the International Conf. on Machine learning (ICML), 2025

work page 2025
[18]

Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems (NeurIPS), 2024

Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, et al. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmarking.Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[19]

Refav: Towards planning-centric scenario mining

Cainan Davidson, Deva Ramanan, and Neehar Peri. Refav: Towards planning-centric scenario mining. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[20]

CARLA: An open urban driving simulator

Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InProc. Conf. on Robot Learning (CoRL), 2017

work page 2017
[21]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset

Scott Ettinger, Shuyang Cheng, Benjamin Caine, Chenxi Liu, Hang Zhao, Sabeek Pradhan, Yuning Chai, Ben Sapp, Charles R Qi, Yin Zhou, et al. Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset. InProc. of the IEEE International Conf. on Computer Vision (ICCV), 2021

work page 2021
[22]

Unitraj: A unified framework for scalable vehicle trajectory prediction

Lan Feng, Mohammadhossein Bahari, Kaouther Messaoud Ben Amor, Éloi Zablocki, Matthieu Cord, and Alexandre Alahi. Unitraj: A unified framework for scalable vehicle trajectory prediction. InProc. of the European Conf. on Computer Vision (ECCV), 2024

work page 2024
[23]

Common crawl

Common Crawl Foundation. Common crawl. https://commoncrawl.org, 2026

work page 2026
[24]

Apache arrow

The Apache Software Foundation. Apache arrow. https://github.com/apache/arrow, 2026

work page 2026
[25]

Are we ready for autonomous driving? The KITTI vision benchmark suite

Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? The KITTI vision benchmark suite. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012
[26]

The llama 3 herd of models.arXiv.org, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv.org, 2024

work page 2024
[27]

Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems (NeurIPS), 2023

Cole Gulino, Justin Fu, Wenjie Luo, George Tucker, Eli Bronstein, Yiren Lu, Jean Harb, Xinlei Pan, Yan Wang, Xiangyu Chen, et al. Waymax: An accelerated, data-driven simulator for large-scale autonomous driving research.Advances in Neural Information Processing Systems (NeurIPS), 2023. 11

work page 2023
[28]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[29]

One thousand and one hours: Self-driving motion prediction dataset

John Houston, Guido Zuidhof, Luca Bergamini, Yawei Ye, Long Chen, Ashesh Jain, Sammy Omari, Vladimir Iglovikov, and Peter Ondruska. One thousand and one hours: Self-driving motion prediction dataset. InProc. Conf. on Robot Learning (CoRL), 2021

work page 2021
[30]

Planning-oriented autonomous driving

Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[31]

Opengis® implementation standard for geographic informa- tion - simple feature access - part 1: Common architecture

Open Geospatial Consortium Inc. Opengis® implementation standard for geographic informa- tion - simple feature access - part 1: Common architecture. https://www.ogc.org/standards/sfa, 2011

work page 2011
[32]

ISO 8855:2011(en) Road vehicles — Vehicle dynamics and road-holding ability — V ocabulary

International Organization for Standardization. ISO 8855:2011(en) Road vehicles — Vehicle dynamics and road-holding ability — V ocabulary. https://www.iso.org/obp/ui/en/#iso:std:iso: 8855:, 2011

work page 2011
[33]

trajdata: A unified interface to multiple human trajectory datasets

Boris Ivanovic, Guanyu Song, Igor Gilitschenski, and Marco Pavone. trajdata: A unified interface to multiple human trajectory datasets. InAdvances in Neural Information Processing Systems (NIPS), 2023

work page 2023
[34]

Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems (NeurIPS), 2024

Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems (NeurIPS), 2024

work page 2024
[35]

Towards learning-based planning: The nuplan benchmark for real-world autonomous driving

Napat Karnchanachari, Dimitris Geromichalos, Kok Seang Tan, Nanxiang Li, Christopher Eriksen, Shakiba Yaghoubi, Noushin Mehdipour, Gianmarco Bernasconi, Whye Kit Fong, Yiluan Guo, et al. Towards learning-based planning: The nuplan benchmark for real-world autonomous driving. InProc. IEEE International Conf. on Robotics and Automation (ICRA), 2024

work page 2024
[36]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1, 2023

work page 2023
[37]

Coin3d: Revisiting configuration-invariant multi-camera 3d object detection.Proc

Zhaonian Kuang, Rui Ding, Haotian Wang, Xinhu Zheng, Meng Yang, and Gang Hua. Coin3d: Revisiting configuration-invariant multi-camera 3d object detection.Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[38]

Terraseg: Self- supervised ground segmentation for any lidar.Proc

Ted Lentsch, Santiago Montiel-Marín, Holger Caesar, and Dariu M Gavrila. Terraseg: Self- supervised ground segmentation for any lidar.Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[39]

Str: A simple and efficient algorithm for r-tree packing

Scott T Leutenegger, Mario A Lopez, and Jeffrey Edgington. Str: A simple and efficient algorithm for r-tree packing. InProceedings 13th international conference on data engineering, pages 497–506. IEEE, 1997

work page 1997
[40]

Datasets: A community library for natural language processing

Quentin Lhoest, Albert Villanova Del Moral, Yacine Jernite, Abhishek Thakur, Patrick V on Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, et al. Datasets: A community library for natural language processing. InConference on Empirical Methods in Natural Language Processing (EMNLP), 2021

work page 2021
[41]

Scenarionet: Open-source platform for large-scale traffic scenario simulation and modeling.Advances in Neural Information Processing Systems (NeurIPS), 2023

Quanyi Li, Zhenghao Mark Peng, Lan Feng, Zhizheng Liu, Chenda Duan, Wenjie Mo, and Bolei Zhou. Scenarionet: Open-source platform for large-scale traffic scenario simulation and modeling.Advances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[42]

Mtgs: Multi-traversal gaussian splatting.arXiv.org, 2025

Tianyu Li, Yihang Qiu, Zhenhua Wu, Carl Lindström, Peng Su, Matthias Nießner, and Hongyang Li. Mtgs: Multi-traversal gaussian splatting.arXiv.org, 2025

work page 2025
[43]

Tactics2d: A highly modular and extensible simulator for driving decision-making.IEEE Transactions on Intelligent V ehicles (T-IV), 2024

Yueyuan Li, Songan Zhang, Mingyang Jiang, Xingyuan Chen, Jing Yang, Yeqiang Qian, Chunxiang Wang, and Ming Yang. Tactics2d: A highly modular and extensible simulator for driving decision-making.IEEE Transactions on Intelligent V ehicles (T-IV), 2024. 12

work page 2024
[44]

Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.Proc

Zhiqi Li, Wenhai Wang, Hongyang Li, Enze Xie, Chonghao Sima, Tong Lu, Qiao Yu, and Jifeng Dai. Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal transformers.Proc. of the European Conf. on Computer Vision (ECCV), 2022

work page 2022
[45]

Is ego status all you need for open-loop end-to-end autonomous driving? InProc

Zhiqi Li, Zhiding Yu, Shiyi Lan, Jiahan Li, Jan Kautz, Tong Lu, and Jose M Alvarez. Is ego status all you need for open-loop end-to-end autonomous driving? InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[46]

Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Trans

Yiyi Liao, Jun Xie, and Andreas Geiger. Kitti-360: A novel dataset and benchmarks for urban scene understanding in 2d and 3d.IEEE Trans. on Pattern Analysis and Machine Intelligence (PAMI), 2022

work page 2022
[47]

Depth anything 3: Recovering the visual space from any views.arXiv.org, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv.org, 2025

work page 2025
[48]

A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook.IEEE Transactions on Intelligent V ehicles (T-IV), 2024

Mingyu Liu, Ekim Yurtsever, Jonathan Fossaert, Xingcheng Zhou, Walter Zimmer, Yuning Cui, Bare Luka Zagar, and Alois C Knoll. A survey on autonomous driving datasets: Statistics, annotation quality, and a future outlook.IEEE Transactions on Intelligent V ehicles (T-IV), 2024

work page 2024
[49]

Petr: Position embedding transfor- mation for multi-view 3d object detection

Yingfei Liu, Tiancai Wang, Xiangyu Zhang, and Jian Sun. Petr: Position embedding transfor- mation for multi-view 3d object detection. InProc. of the European Conf. on Computer Vision (ECCV), 2022

work page 2022
[50]

Lead: Minimizing learner-expert asymmetry in end-to-end driving.Proc

Long Nguyen, Micha Fauth, Bernhard Jaeger, Daniel Dauner, Maximilian Igl, Andreas Geiger, and Kashyap Chitta. Lead: Minimizing learner-expert asymmetry in end-to-end driving.Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[51]

PhysicalAI-Autonomous-Vehicles

NVIDIA. PhysicalAI-Autonomous-Vehicles. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles, 2025. Hugging Face dataset

work page 2025
[52]

NVIDIA DRIVE Hyperion: L4-Ready autonomous vehicle platform

NVIDIA. NVIDIA DRIVE Hyperion: L4-Ready autonomous vehicle platform. https://www. nvidia.com/en-us/solutions/autonomous-vehicles/drive-hyperion/, 2026. NVIDIA product page

work page 2026
[53]

PhysicalAI-Autonomous-Vehicles-NCore

NVIDIA. PhysicalAI-Autonomous-Vehicles-NCore. https://huggingface.co/datasets/nvidia/ PhysicalAI-Autonomous-Vehicles-NCore, 2026. Hugging Face dataset

work page 2026
[54]

Fastgs: Training 3d gaussian splatting in 100 seconds.Proc

Shiwei Ren, Tianci Wen, Yongchun Fang, and Biao Lu. Fastgs: Training 3d gaussian splatting in 100 seconds.Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2026

work page 2026
[55]

Drivelm: Driving with graph visual question answering

Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InProc. of the European Conf. on Computer Vision (ECCV), 2024

work page 2024
[56]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[57]

Openpcdet: An open-source toolbox for 3d object detection from point clouds

OpenPCDet Development Team. Openpcdet: An open-source toolbox for 3d object detection from point clouds. https://github.com/open-mmlab/OpenPCDet, 2020

work page 2020
[58]

Neurad: Neural rendering for autonomous driving

Adam Tonderski, Carl Lindström, Georg Hess, William Ljungbergh, Lennart Svensson, and Christoffer Petersson. Neurad: Neural rendering for autonomous driving. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[59]

Kiss-icp: In defense of point-to-point icp–simple, accurate, and robust registration if done the right way.IEEE Robotics and Automation Letters (RA-L), 8(2):1029–1036, 2023

Ignacio Vizzo, Tiziano Guadagnino, Benedikt Mersch, Louis Wiesmann, Jens Behley, and Cyrill Stachniss. Kiss-icp: In defense of point-to-point icp–simple, accurate, and robust registration if done the right way.IEEE Robotics and Automation Letters (RA-L), 8(2):1029–1036, 2023

work page 2023
[60]

Towards domain generalization for multi-view 3d object detection in bird-eye- view.Proc

Shuo Wang, Xinhai Zhao, Hai-Ming Xu, Zehui Chen, Dameng Yu, Jiahao Chang, Zhen Yang, and Feng Zhao. Towards domain generalization for multi-view 3d object detection in bird-eye- view.Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2023. 13

work page 2023
[61]

Train in germany, test in the usa: Making 3d object detectors generalize

Yan Wang, Xiangyu Chen, Yurong You, Li Erran Li, Bharath Hariharan, Mark Campbell, Kilian Q Weinberger, and Wei-Lun Chao. Train in germany, test in the usa: Making 3d object detectors generalize. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2020

work page 2020
[62]

Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv.org, 2025

Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv.org, 2025

work page 2025
[63]

Safe, routine, ready: Autonomous driving in five new cities

Waymo. Safe, routine, ready: Autonomous driving in five new cities. https://waymo.com/blog/ 2025/11/safe-routine-ready-autonomous-driving-in-new-cities/, 2025. Waymo blog post

work page 2025
[64]

Beginning fully autonomous operations with the 6th-generation Waymo driver

Waymo. Beginning fully autonomous operations with the 6th-generation Waymo driver. https: //waymo.com/blog/2026/02/ro-on-6th-gen-waymo-driver/, 2026. Waymo blog post

work page 2026
[65]

Crossing the pond and beyond: Generalizable AI driving for global deployment

Wayve. Crossing the pond and beyond: Generalizable AI driving for global deployment. https://wayve.ai/thinking/multi-country-generalization/, 2025. Wayve blog post

work page 2025
[66]

Argoverse 2: Next generation datasets for self-driving perception and forecasting.Advances in Neural Information Processing Systems (NeurIPS), 2021

Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, et al. Argoverse 2: Next generation datasets for self-driving perception and forecasting.Advances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[67]

Pandaset: Advanced sensor suite dataset for autonomous driving

Pengchuan Xiao, Zhenlei Shao, Steven Hao, Zishuo Zhang, Xiaolin Chai, Judy Jiao, Zesong Li, Jian Wu, Kai Sun, Kun Jiang, et al. Pandaset: Advanced sensor suite dataset for autonomous driving. InProc. IEEE Conf. on Intelligent Transportation Systems (ITSC). IEEE, 2021

work page 2021
[68]

Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv.org, 2025

Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to-end driving in challenging long-tail scenarios.arXiv.org, 2025

work page 2025
[69]

Improving traffic signal data quality for the waymo open motion dataset.Transportation Research Part C: Emerging Technologies, 183:105476, 2026

Xintao Yan, Erdao Liang, Jiawei Wang, Haojie Zhu, and Henry X Liu. Improving traffic signal data quality for the waymo open motion dataset.Transportation Research Part C: Emerging Technologies, 183:105476, 2026

work page 2026
[70]

Qwen2.5 technical report.arXiv.org, 2024

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, et al. Qwen2.5 technical report.arXiv.org, 2024

work page 2024
[71]

Viser: Imperative, web-based 3d visualization in python.arXiv.org, 2025

Brent Yi, Chung Min Kim, Justin Kerr, Gina Wu, Rebecca Feng, Anthony Zhang, Jonas Kulhanek, Hongsuk Choi, Yi Ma, Matthew Tancik, et al. Viser: Imperative, web-based 3d visualization in python.arXiv.org, 2025

work page 2025
[72]

Object detection with a unified label space from multiple datasets

Xiangyun Zhao, Samuel Schulter, Gaurav Sharma, Yi-Hsuan Tsai, Manmohan Chandraker, and Ying Wu. Object detection with a unified label space from multiple datasets. InProc. of the European Conf. on Computer Vision (ECCV), 2020

work page 2020
[73]

Simple multi-dataset detection

Xingyi Zhou, Vladlen Koltun, and Philipp Krähenbühl. Simple multi-dataset detection. InProc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2022. 14

work page 2022