pith. sign in

arxiv: 2605.18074 · v1 · pith:62QDNEW6new · submitted 2026-05-18 · 💻 cs.RO

4DLidarOpen: An Open 4D FMCW Lidar Dataset for Motion-Aware Autonomous Driving

Pith reviewed 2026-05-20 10:02 UTC · model grok-4.3

classification 💻 cs.RO
keywords 4D FMCW Lidarautonomous drivingradial velocitymotion forecasting3D object detectionBEV segmentationurban datasetmulti-sensor fusion
0
0 comments X

The pith

Direct velocity measurements from 4D FMCW Lidar improve motion-related perception and planning over geometry alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an open multi-modal dataset collected in Beijing that pairs conventional geometric Lidar with a forward-facing 4D FMCW Lidar delivering point-wise radial velocity. Benchmarks on 3D detection, birds-eye-view segmentation, flow prediction, and motion forecasting show that models incorporating the velocity channel outperform geometric-only baselines, with the largest gains appearing around pedestrians and fast-moving vehicles. A hybrid auto-labeling plus human-refinement pipeline supplies large-scale 3D bounding boxes with persistent track IDs across five categories. The work positions the velocity channel as a practical addition that supplies explicit motion cues missing from time-of-flight sensors. Public release of the data and evaluation toolkit is intended to support further research on velocity-aware scene understanding and planning.

Core claim

The central claim is that point-wise radial velocity measurements supplied by 4D FMCW Lidar act as complementary motion cues that measurably improve dynamic-scene tasks when added to geometric sensing, with the improvement most evident for vulnerable road users and fast-moving objects in the Beijing urban recordings.

What carries the argument

The forward-facing 4D FMCW Lidar that records radial velocity at each point in addition to range and intensity.

If this is right

  • Velocity-aware models achieve higher precision on 3D detection of pedestrians and cyclists than geometry-only baselines.
  • Motion forecasting and planning modules trained with the velocity channel reduce error in congested traffic and unprotected turns.
  • The dataset's persistent track IDs across frames enable consistent evaluation of multi-frame flow and trajectory tasks.
  • Multi-Lidar fusion pipelines can incorporate the radial-velocity channel from the 4D sensor to improve surround coverage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same velocity channel could be tested for robustness gains in adverse weather or nighttime conditions not covered in the Beijing collection.
  • Persistent track annotations open the door to longer-horizon trajectory prediction models that build directly on the provided labels.
  • Combining the velocity measurements with camera semantics might further reduce false positives on vulnerable road users.

Load-bearing premise

The hybrid auto-labeling plus human refinement process produces sufficiently accurate 3D bounding-box annotations with consistent track IDs, and the chosen Beijing urban scenes are representative of the conditions where velocity cues provide the claimed gains.

What would settle it

A re-run of the motion-forecasting benchmark on a held-out set of fast-moving objects or pedestrians that shows no accuracy gain when velocity channels are added would falsify the complementary-cue claim.

Figures

Figures reproduced from arXiv: 2605.18074 by Diange Yang, Kai Sun, Kane Qian, Kaojin Zhu, Kun Jiang, Mengmeng Yang, Rujun Yan, Xin Zhao, Yining Shi, Zhengqing Pan.

Figure 1
Figure 1. Figure 1: 4D FMCW sample showing raw point cloud data with point-wise [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: 4DLidarOpen sensor configuration: five Lidars and five surround [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: 4DLidarOpen data processing pipeline, including raw data collection, sensor synchronization, automatic labeling, human verification, and final dataset [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 4DLidarOpen sample showing 4D FMCW Lidar data with velocity information: (a) raw point cloud with radial velocity coloring, (b) semantic [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 4DLidarOpen class and richness statistics. (a) Instance counts across five categories (Car, Van, Cyclist, Pedestrian, Traffic Cone) comparing auto-labeled [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: 4DLidarOpen spatial and density statistics. (a) Object distance distribution showing peak concentration within 50 meters. (b) Distribution of 3D cuboid [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: 4DLidarOpen speed statistics. (a) Category-wise box plots showing speed distributions for Car, Van, Cyclist, Pedestrian, and Traffic Cone. (b) Overall [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 4DLidarOpen campus ablation experiment. Top row: rolling cone scenario; bottom row: darting pedestrian scenario. (a)-(c) 4D FMCW Lidar results [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: 4DLidarOpen Tianjin crossing test. Top row: pedestrian crossing scenario; bottom row: e-bike crossing scenario. (a) 4D FMCW Lidar + our model; [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
read the original abstract

We present 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving, centered on 4D frequency-modulated continuous-wave (FMCW) Lidar sensing. Unlike conventional time-of-flight Lidar datasets that mainly provide geometric measurements, 4DLidarOpen includes point-wise radial velocity measurements from a forward-facing 4D FMCW Lidar, together with multiple Lidars of different types, including rotating, solid-state, and blind-spot variants, surround-view cameras, and 6-DOF ego-vehicle poses. The dataset was collected in complex urban environments in Beijing and covers dense pedestrian interactions, congested traffic, high-speed driving, and unprotected maneuvers. 4DLidarOpen provides synchronized multi-sensor data and 3D bounding-box annotations with persistent track IDs across five object categories. A hybrid annotation strategy is adopted, where large-scale auto-labeled data support scalable training and human experts refine annotations for the human-annotated training and validation sets. Based on this dataset, we establish benchmarks for 3D object detection, birds-eye view (BEV) segmentation and flow prediction, and motion forecasting with planning. Extensive experiments show that direct velocity measurements from 4D FMCW Lidar provide complementary motion cues for dynamic-scene understanding. Compared with geometric-only sensing, the velocity-aware representation improves motion-related perception and downstream forecasting and planning, especially in scenarios involving vulnerable road users and fast-moving objects. These results indicate that 4D FMCW Lidar is a promising sensing modality for motion-aware autonomous driving. The dataset and evaluation toolkit are publicly released to support research on 4D scene understanding, multi-Lidar fusion, and velocity-aware perception and planning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces 4DLidarOpen, a large-scale open multi-modal dataset for autonomous driving centered on 4D FMCW Lidar that provides point-wise radial velocity measurements in addition to geometric data. Collected in complex Beijing urban scenes, it includes synchronized data from multiple Lidar types, surround-view cameras, and 6-DOF ego poses, along with 3D bounding-box annotations and persistent track IDs for five object categories generated via a hybrid auto-labeling and human-refinement pipeline. Benchmarks are established for 3D object detection, BEV segmentation and flow prediction, and motion forecasting with planning; experiments indicate that incorporating direct velocity measurements yields complementary motion cues that improve performance on motion-related tasks relative to geometric-only sensing, particularly for vulnerable road users and fast-moving objects.

Significance. If the central results hold after addressing validation gaps, the work is significant as the first public release of an open 4D FMCW Lidar dataset with native velocity data, directly supporting research on velocity-aware perception, multi-Lidar fusion, and motion forecasting in autonomous driving. The public release of the dataset and evaluation toolkit is a clear strength that promotes reproducibility and community follow-on work. The experiments provide initial evidence that velocity measurements offer benefits beyond geometry in dynamic scenes, which could influence sensor selection for future AV systems if the quantitative gains are robustly demonstrated.

major comments (2)
  1. [§3.2] §3.2 (Annotation Pipeline): The hybrid auto-labeling plus human-refinement process is described at a high level but reports no quantitative metrics on track-ID consistency across frames, label error rates for fast-moving or occluded objects, or inter-annotator agreement. This is load-bearing for the central claim because all motion-forecasting and planning benchmarks rely on accurate persistent track IDs; without these validation statistics, observed gains from adding radial velocity could be confounded by annotation noise or drift rather than true complementary sensing cues.
  2. [§5] §5 (Experiments and Benchmarks): The abstract and results sections claim that velocity-aware representations improve downstream forecasting and planning, yet provide no specific quantitative deltas, error bars, ablation tables, or statistical significance tests comparing velocity-inclusive versus geometric-only inputs. Concrete numbers from the relevant tables or figures are required to evaluate effect sizes and rule out post-hoc scenario selection effects.
minor comments (2)
  1. [Abstract] Abstract: The summary of experimental findings would be strengthened by including one or two key quantitative results (e.g., mAP or forecasting error reductions) rather than qualitative statements alone.
  2. [§2] §2 (Related Work): Explicit side-by-side comparison with prior datasets (KITTI, nuScenes, Waymo) regarding availability of native velocity channels would clarify the novelty of the 4D FMCW contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the thoughtful comments, which have helped us identify areas for improvement in the manuscript. We provide point-by-point responses to the major comments below and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Annotation Pipeline): The hybrid auto-labeling plus human-refinement process is described at a high level but reports no quantitative metrics on track-ID consistency across frames, label error rates for fast-moving or occluded objects, or inter-annotator agreement. This is load-bearing for the central claim because all motion-forecasting and planning benchmarks rely on accurate persistent track IDs; without these validation statistics, observed gains from adding radial velocity could be confounded by annotation noise or drift rather than true complementary sensing cues.

    Authors: We thank the referee for highlighting this important aspect. The annotation pipeline is indeed critical for the validity of the motion-related benchmarks. While the manuscript provides a high-level description, we acknowledge the lack of quantitative metrics. In the revised version, we will include a dedicated subsection in §3.2 reporting track-ID consistency across frames (e.g., percentage of tracks maintained over sequences), estimated label error rates for challenging cases like fast-moving and occluded objects, and inter-annotator agreement scores from the human refinement process. This will help confirm that the observed benefits from velocity measurements are not confounded by annotation issues. revision: yes

  2. Referee: [§5] §5 (Experiments and Benchmarks): The abstract and results sections claim that velocity-aware representations improve downstream forecasting and planning, yet provide no specific quantitative deltas, error bars, ablation tables, or statistical significance tests comparing velocity-inclusive versus geometric-only inputs. Concrete numbers from the relevant tables or figures are required to evaluate effect sizes and rule out post-hoc scenario selection effects.

    Authors: We agree that the presentation of the experimental results can be strengthened by providing more explicit quantitative comparisons. In the revised manuscript, we will update §5 to include specific numerical deltas between velocity-inclusive and geometric-only models, add error bars where appropriate, present detailed ablation tables, and report statistical significance tests for the observed improvements. This will allow readers to better assess the effect sizes and robustness of the findings. revision: yes

Circularity Check

0 steps flagged

Dataset release with empirical benchmarks; no circular derivation chain

full rationale

The paper releases raw multi-modal sensor data, 3D bounding-box annotations with track IDs, and runs standard benchmarks for detection, segmentation, flow prediction, and motion forecasting on fixed splits. The central claim—that radial velocity measurements provide complementary cues—is supported by direct experimental comparisons of velocity-aware versus geometric-only inputs on the collected Beijing urban scenes. No equations, fitted parameters, or self-citations are used to define or force the reported improvements; the results are falsifiable against the public dataset and external validation. This is a standard empirical dataset contribution with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on standard assumptions about sensor synchronization and calibration accuracy plus the representativeness of the collected Beijing scenes; no new physical entities or free parameters are introduced beyond typical dataset annotation thresholds.

axioms (2)
  • domain assumption Multi-sensor data streams are accurately time-synchronized and spatially calibrated
    Required for all multi-modal fusion and velocity-aware benchmarks described in the abstract
  • domain assumption Hybrid auto-labeling followed by human refinement yields sufficiently accurate 3D bounding boxes and persistent track IDs
    Central to the claim that the dataset supports reliable training and evaluation of motion-aware tasks

pith-pipeline@v0.9.0 · 5875 in / 1486 out tokens · 40491 ms · 2026-05-20T10:02:54.476054+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 12 internal anchors

  1. [1]

    End-to-end autonomous driving: Challenges and frontiers,

    L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  2. [2]

    A survey on vision-language- action models for autonomous driving,

    S. Jiang, Z. Huang, K. Qian, Z. Luo, T. Zhu, Y . Zhong, Y . Tang, M. Kong, Y . Wang, S. Jiaoet al., “A survey on vision-language- action models for autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4524–4536

  3. [3]

    Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving,

    K. Qian, S. Jiang, Y . Zhong, Z. Luo, Z. Huang, T. Zhu, K. Jiang, M. Yang, Z. Fu, J. Miaoet al., “Agentthink: A unified framework for tool-augmented chain-of-thought reasoning in vision-language models for autonomous driving,”arXiv preprint arXiv:2505.15298, vol. 1, no. 2, p. 3, 2025

  4. [4]

    4d-are: Bridging the attribution gap in llm agent requirements engineering,

    B. Yu and L. Zhao, “4d-are: Bridging the attribution gap in llm agent requirements engineering,” 2026. [Online]. Available: https://arxiv.org/abs/2601.04556

  5. [5]

    Streamingflow: Streaming occupancy forecasting with asynchronous multi-modal data streams via neural ordinary differential equation,

    Y . Shi, K. Jiang, K. Wang, J. Li, Y . Wang, M. Yang, and D. Yang, “Streamingflow: Streaming occupancy forecasting with asynchronous multi-modal data streams via neural ordinary differential equation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 14 833–14 842

  6. [6]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, S. Zhang, C. Huang, C. Liu, and X. Wang, “Vadv2: End-to-end vectorized autonomous driving via probabilistic planning,”arXiv preprint arXiv:2402.13243, 2024

  7. [7]

    Driveworld: 4d pre-trained scene understanding via world models for autonomous driving,

    C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y . Guo, J. Xinget al., “Driveworld: 4d pre-trained scene understanding via world models for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 15 522–15 533

  8. [8]

    World4drive: End-to-end autonomous driving via intention-aware physical latent world model,

    Y . Zheng, P. Yang, Z. Xing, Q. Zhang, Y . Zheng, Y . Gao, P. Li, T. Zhang, Z. Xia, P. Jiaet al., “World4drive: End-to-end autonomous driving via intention-aware physical latent world model,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 28 632–28 642

  9. [9]

    Vision meets robotics: The KITTI dataset,

    A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: The KITTI dataset,”The International Journal of Robotics Research, vol. 32, no. 11, pp. 1231–1237, Sep. 2013

  10. [10]

    nuscenes: A multi- modal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multi- modal dataset for autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  11. [11]

    Are we ready for autonomous driving? The KITTI vision benchmark suite,

    A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Jun. 2012

  12. [12]

    PointPillars: Fast encoders for object detection from point clouds,

    A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom, “PointPillars: Fast encoders for object detection from point clouds,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

  13. [13]

    Center-based 3d object detection and tracking,

    T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3d object detection and tracking,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2021. SUBMITTED TO IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 14

  14. [14]

    Semantickitti: A dataset for semantic scene understanding of lidar sequences,

    J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019

  15. [15]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation.arXiv preprint arXiv:2205.13542, 2022

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,”arXiv preprint arXiv:2205.13542, 2022

  16. [16]

    Learning lane graph representations for motion forecasting,

    M. Liang, B. Yang, R. Hu, Y . Chen, R. Liao, S. Feng, and R. Urtasun, “Learning lane graph representations for motion forecasting,” inECCV, 2020

  17. [17]

    Multi-head attention for multi-modal joint vehicle motion forecasting,

    J. Mercat, T. Gilles, N. El Zoghby, G. Sandou, D. Beauvois, and G. P. Gil, “Multi-head attention for multi-modal joint vehicle motion forecasting,” inICRA. IEEE, 2020

  18. [18]

    The Cityscapes Dataset for Semantic Urban Scene Understanding,

    M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The Cityscapes Dataset for Semantic Urban Scene Understanding,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3213–3223

  19. [19]

    TrafficPredict: Trajectory Prediction for Heterogeneous Traffic-Agents

    Y . Ma, X. Zhu, S. Zhang, R. Yang, W. Wang, and D. Manocha, “Trafficpredict: Trajectory prediction for heterogeneous traffic-agents,” CoRR, vol. abs/1811.02146, 2018

  20. [20]

    Scalability in Percep- tion for Autonomous Driving: Waymo Open Dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in Percep- tion for Autonomous Driving: Waymo Open Dataset,” inProceedings of the IEEE/...

  21. [21]

    Coda: A real-world road corner case dataset for object detection in autonomous driving,

    K. Li, K. Chen, H. Wang, L. Hong, C. Ye, J. Han, Y . Chen, W. Zhang, C. Xu, D.-Y . Yeunget al., “Coda: A real-world road corner case dataset for object detection in autonomous driving,” inEuropean Conference on Computer Vision. Springer, 2022, pp. 406–423

  22. [22]

    Pandaset: Advanced sensor suite dataset for autonomous driving,

    P. Xiao, Z. Shao, S. Hao, Z. Zhang, X. Chai, J. Jiao, Z. Li, J. Wu, K. Sun, K. Jianget al., “Pandaset: Advanced sensor suite dataset for autonomous driving,” in2021 IEEE international intelligent transportation systems conference (ITSC). IEEE, 2021, pp. 3095–3101

  23. [23]

    Carla: An open urban driving simulator,

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun, “Carla: An open urban driving simulator,” inConference on robot learning. PMLR, 2017, pp. 1–16

  24. [24]

    Lego-motion: Learning-enhanced grids with occupancy instance modeling for class-agnostic motion prediction,

    K. Qian, J. Miao, Z. Luo, Z. Fu, J. Li, Y . Shi, Y . Wang, K. Jiang, M. Yang, and D. Yang, “Lego-motion: Learning-enhanced grids with occupancy instance modeling for class-agnostic motion prediction,” in 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2025, pp. 14 178–14 185

  25. [25]

    A survey of motion planning and control techniques for self-driving urban vehicles,

    B. Paden, M. ˇC´ap, S. Z. Yong, D. Yershov, and E. Frazzoli, “A survey of motion planning and control techniques for self-driving urban vehicles,” IEEE Transactions on intelligent vehicles, vol. 1, no. 1, pp. 33–55, 2016

  26. [26]

    End to End Learning for Self-Driving Cars

    M. Bojarski, D. Del Testa, D. Dworakowski, B. Firner, B. Flepp, P. Goyal, L. D. Jackel, M. Monfort, U. Muller, J. Zhanget al., “End to end learning for self-driving cars,”arXiv preprint arXiv:1604.07316, 2016

  27. [27]

    End-to-end driving via conditional imitation learning,

    F. Codevilla, M. M ¨uller, A. L ´opez, V . Koltun, and A. Dosovitskiy, “End-to-end driving via conditional imitation learning,” in2018 IEEE international conference on robotics and automation (ICRA). IEEE, 2018, pp. 4693–4700

  28. [28]

    End-to-end learning of driving models from large-scale video datasets,

    H. Xu, Y . Gao, F. Yu, and T. Darrell, “End-to-end learning of driving models from large-scale video datasets,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2174– 2182

  29. [29]

    Learning to drive in a day,

    A. Kendall, J. Hawke, D. Janz, P. Mazur, D. Reda, J.-M. Allen, V .-D. Lam, A. Bewley, and A. Shah, “Learning to drive in a day,” in2019 international conference on robotics and automation (ICRA). IEEE, 2019, pp. 8248–8254

  30. [30]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,”arXiv preprint arXiv:1707.06347, 2017

  31. [31]

    Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” in2023 IEEE international conference on robotics and automation (ICRA). IEEE, 2023, pp. 2774–2781

  32. [32]

    Multi-modal fusion transformer for end-to-end autonomous driving,

    A. Prakash, K. Chitta, and A. Geiger, “Multi-modal fusion transformer for end-to-end autonomous driving,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 7077–7087

  33. [33]

    Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger, “Transfuser: Imitation with transformer-based sensor fusion for au- tonomous driving,”IEEE transactions on pattern analysis and machine intelligence, vol. 45, no. 11, pp. 12 878–12 895, 2022

  34. [34]

    Vad: Vectorized scene representation for efficient autonomous driving,

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, S. Zhang, C. Liu, C. Huang, X. Wanget al., “Vad: Vectorized scene representation for efficient autonomous driving,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 8340– 8350

  35. [35]

    An lstm network for highway trajectory prediction,

    F. Altch ´e and A. de La Fortelle, “An lstm network for highway trajectory prediction,” in2017 IEEE 20th international conference on intelligent transportation systems (ITSC). IEEE, 2017, pp. 353–359

  36. [36]

    Multi-task learning with deep neural networks: A survey,

    M. Crawshaw, “Multi-task learning with deep neural networks: A survey,”arXiv preprint arXiv:2009.09796, 2020

  37. [37]

    Multi-task learning with attention for end-to-end autonomous driving,

    K. Ishihara, A. Kanervisto, J. Miura, and V . Hautamaki, “Multi-task learning with attention for end-to-end autonomous driving,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 2902–2911

  38. [38]

    Planning-oriented autonomous driving,

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wanget al., “Planning-oriented autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2023, pp. 17 853–17 862

  39. [39]

    Vavim and vavam: Autonomous driving through video generative modeling

    F. Bartoccioni, E. Ramzi, V . Besnier, S. Venkataramanan, T.-H. Vu, Y . Xu, L. Chambon, S. Gidaris, S. Odabas, D. Hurychet al., “Vavim and vavam: Autonomous driving through video generative modeling,” arXiv preprint arXiv:2502.15672, 2025

  40. [40]

    Scalable diffusion models with transformers,

    W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

  41. [41]

    Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang, “Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving,”arXiv preprint arXiv:2411.15139, 2024

  42. [42]

    Hidden biases of end-to-end driving models,

    B. Jaeger, K. Chitta, and A. Geiger, “Hidden biases of end-to-end driving models,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 8240–8249

  43. [43]

    End-to-end interpretable neural motion planner,

    W. Zeng, W. Luo, S. Suo, A. Sadat, B. Yang, S. Casas, and R. Urtasun, “End-to-end interpretable neural motion planner,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8660–8669

  44. [44]

    Perceive, predict, and plan: Safe motion planning through interpretable semantic representations,

    A. Sadat, S. Casas, M. Ren, X. Wu, P. Dhawan, and R. Urtasun, “Perceive, predict, and plan: Safe motion planning through interpretable semantic representations,” inComputer Vision–ECCV 2020: 16th Euro- pean Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16. Springer, 2020, pp. 414–430

  45. [45]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

  46. [46]

    LLaMA: Open and Efficient Foundation Language Models

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “Llama: Open and efficient foundation language models,”arXiv preprint arXiv:2302.13971, 2023

  47. [47]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y . Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosaleet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  48. [48]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  49. [49]

    GPT-Driver: Learning to Drive with GPT

    J. Mao, Y . Qian, J. Ye, H. Zhao, and Y . Wang, “Gpt-driver: Learning to drive with gpt,”arXiv preprint arXiv:2310.01415, 2023

  50. [50]

    Drivemlm: Aligning multi-modal large language models with behavioral planning states for au- tonomous driving

    W. Wang, J. Xie, C. Hu, H. Zou, J. Fan, W. Tong, Y . Wen, S. Wu, H. Deng, Z. Liet al., “Drivemlm: Aligning multi-modal large language models with behavioral planning states for autonomous driving,”arXiv preprint arXiv:2312.09245, 2023

  51. [51]

    Lmdrive: Closed-loop end-to-end driving with large language models,

    H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 120–15 130

  52. [52]

    A language agent for autonomous driving,

    J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang, “A language agent for autonomous driving,”arXiv preprint arXiv:2311.10813, 2023

  53. [53]

    Driveagent: Multi-agent structured reasoning with llm and multimodal sensor fusion for autonomous driving,

    X. Hou, W. Wang, L. Yang, H. Lin, J. Feng, H. Min, and X. Zhao, “Driveagent: Multi-agent structured reasoning with llm and multimodal sensor fusion for autonomous driving,”arXiv preprint arXiv:2505.02123, 2025. SUBMITTED TO IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT 15

  54. [54]

    Dilu: A knowledge-driven approach to au- tonomous driving with large language models

    L. Wen, D. Fu, X. Li, X. Cai, T. Ma, P. Cai, M. Dou, B. Shi, L. He, and Y . Qiao, “Dilu: A knowledge-driven approach to autonomous driving with large language models,”arXiv preprint arXiv:2309.16292, 2023

  55. [55]

    Koma: Knowledge-driven multi-agent framework for autonomous driving with large language models,

    K. Jiang, X. Cai, Z. Cui, A. Li, Y . Ren, H. Yu, H. Yang, D. Fu, L. Wen, and P. Cai, “Koma: Knowledge-driven multi-agent framework for autonomous driving with large language models,”IEEE Transactions on Intelligent Vehicles, 2024

  56. [56]

    Instruct large language models to drive like humans,

    R. Zhang, X. Guo, W. Zheng, C. Zhang, K. Keutzer, and L. Chen, “Instruct large language models to drive like humans,”arXiv preprint arXiv:2406.07296, 2024

  57. [57]

    Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,

    A. Ishaq, J. Lahoud, K. More, O. Thawakar, R. Thawkar, D. Dis- sanayake, N. Ahsan, Y . Li, F. S. Khan, H. Cholakkalet al., “Drivelmm- o1: A step-by-step reasoning dataset and large multimodal model for driving scenario understanding,”arXiv preprint arXiv:2503.10621, 2025

  58. [58]

    Drivemm: All-in-one large multimodal model for autonomous driving,

    Z. Huang, C. Feng, F. Yan, B. Xiao, Z. Jie, Y . Zhong, X. Liang, and L. Ma, “Drivemm: All-in-one large multimodal model for autonomous driving,”arXiv preprint arXiv:2412.07689, 2024

  59. [59]

    Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving

    B. Jiang, S. Chen, B. Liao, X. Zhang, W. Yin, Q. Zhang, C. Huang, W. Liu, and X. Wang, “Senna: Bridging large vision-language models and end-to-end autonomous driving,”arXiv preprint arXiv:2410.22313, 2024

  60. [60]

    Otter: A vision-language-action model with text-aware visual feature extraction

    H. Huang, F. Liu, L. Fu, T. Wu, M. Mukadam, J. Malik, K. Goldberg, and P. Abbeel, “Otter: A vision-language-action model with text-aware visual feature extraction,” 2025. [Online]. Available: https://arxiv.org/abs/2503.03734

  61. [61]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma, “Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning,” 2025. [Online]. Available: https://arxiv.org/abs/2506.13757

  62. [62]

    arXiv preprint arXiv:2505.04769 (2025)

    R. Sapkota, Y . Cao, K. I. Roumeliotis, and M. Karkee, “Vision-language- action models: Concepts, progress, applications and challenges,”arXiv preprint arXiv:2505.04769, 2025

  63. [63]

    A Survey on Vision-Language-Action Models for Embodied AI

    Y . Ma, Z. Song, Y . Zhuang, J. Hao, and I. King, “A survey on vision-language-action models for embodied ai,”arXiv preprint arXiv:2405.14093, 2024

  64. [64]

    ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation

    H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai, “Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,” arXiv preprint arXiv:2503.19755, 2025

  65. [65]

    Covla: Comprehensive vision-language-action dataset for autonomous driving,

    H. Arai, K. Miwa, K. Sasaki, K. Watanabe, Y . Yamaguchi, S. Aoki, and I. Yamamoto, “Covla: Comprehensive vision-language-action dataset for autonomous driving,” in2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025, pp. 1933–1943

  66. [66]

    Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving,

    Z. Chen, M. Ye, S. Xu, T. Cao, and Q. Chen, “Ppad: Iterative interactions of prediction and planning for end-to-end autonomous driving,” in European Conference on Computer Vision. Springer, 2024, pp. 239– 256

  67. [67]

    Para- drive: Parallelized architecture for real-time autonomous driving,

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone, “Para- drive: Parallelized architecture for real-time autonomous driving,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 15 449–15 458

  68. [68]

    Chatmpc: Natural language based mpc personalization,

    Y . Miyaoka, M. Inoue, and T. Nii, “Chatmpc: Natural language based mpc personalization,” in2024 American Control Conference (ACC). IEEE, 2024, pp. 3598–3603

  69. [69]

    Vlm-mpc: Vision language foundation model (vlm)-guided model predictive controller (mpc) for autonomous driving,

    K. Long, H. Shi, J. Liu, and X. Li, “Vlm-mpc: Vision language foundation model (vlm)-guided model predictive controller (mpc) for autonomous driving,”arXiv preprint arXiv:2408.04821, 2024

  70. [70]

    Data scaling laws for end-to-end autonomous driving,

    A. Naumann, X. Gu, T. Dimlioglu, M. Bojarski, A. Degirmenci, A. Popov, D. Bisla, M. Pavone, U. M ¨uller, and B. Ivanovic, “Data scaling laws for end-to-end autonomous driving,”arXiv preprint arXiv:2504.04338, 2025

  71. [71]

    Data scaling laws for imitation learning-based end-to-end autonomous driving.arXiv preprint arXiv:2412.02689, 2024

    Y . Zheng, Z. Xia, Q. Zhang, T. Zhang, B. Lu, X. Huo, C. Han, Y . Li, M. Yu, B. Jinet al., “Preliminary investigation into data scaling laws for imitation learning-based end-to-end autonomous driving,”arXiv preprint arXiv:2412.02689, 2024

  72. [72]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 11 621–11 631

  73. [73]

    Scalability in perception for autonomous driving: Waymo open dataset,

    P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caineet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 2446–2454

  74. [74]

    Argoverse: 3d tracking and forecasting with rich maps,

    M.-F. Chang, J. Lambert, P. Sangkloy, J. Singh, S. Bak, A. Hartnett, D. Wang, P. Carr, S. Lucey, D. Ramananet al., “Argoverse: 3d tracking and forecasting with rich maps,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8748– 8757

  75. [75]

    Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

    B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Ponteset al., “Argoverse 2: Next generation datasets for self-driving perception and forecasting,”arXiv preprint arXiv:2301.00493, 2023

  76. [76]

    Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems,

    Y . Li and J. Ibanez-Guzman, “Lidar for autonomous driving: The principles, challenges, and trends for automotive lidar and perception systems,”IEEE Signal Processing Magazine, vol. 37, no. 4, pp. 50–61, 2020

  77. [77]

    Helipr: Heterogeneous lidar dataset for inter-lidar place recognition under spa- tiotemporal variations,

    M. Jung, W. Yang, D. Lee, H. Gil, G. Kim, and A. Kim, “Helipr: Heterogeneous lidar dataset for inter-lidar place recognition under spa- tiotemporal variations,”The International Journal of Robotics Research, vol. 43, no. 12, pp. 1867–1883, 2024

  78. [78]

    DICP: Doppler Iterative Closest Point Algorithm,

    B. Hexsel, H. Vhavle, and Y . Chen, “DICP: Doppler Iterative Closest Point Algorithm,” inProceedings of Robotics: Science and Systems, New York City, NY , USA, June 2022

  79. [79]

    Tracking 3d moving objects as centroids using fmcw lidar,

    Y . Zeng, Y . Yu, S. Qi, and T. Wu, “Tracking 3d moving objects as centroids using fmcw lidar,” inProceedings of 4th 2024 International Conference on Autonomous Unmanned Systems (4th ICAUS 2024), L. Liu, Y . Niu, W. Fu, and Y . Qu, Eds. Singapore: Springer Nature Singapore, 2025, pp. 536–545

  80. [80]

    Towards fast correspondence-free odometry using multiple fmcw lidars,

    D. J. Yoon, Y . Chen, H. Vhavle, J. Reuther, and T. D. Barfoot, “Towards fast correspondence-free odometry using multiple fmcw lidars,”IEEE Robotics and Automation Letters, vol. 10, no. 9, pp. 9088–9095, 2025

Showing first 80 references.