pith. sign in

arxiv: 2408.16322 · v4 · submitted 2024-08-29 · 💻 cs.CV · cs.RO

BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autonomous Driving

Pith reviewed 2026-05-23 21:23 UTC · model grok-4.3

classification 💻 cs.CV cs.RO
keywords BEV segmentationautonomous drivingcross-dataset evaluationdomain shiftmulti-dataset traininggeneralizationsemantic segmentation
0
0 comments X

The pith

BEV segmentation models generalize better when trained on multiple datasets rather than one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a cross-dataset evaluation of BEV segmentation models for autonomous driving to highlight issues with domain shift. It tests models trained on one dataset like nuScenes against others with different sensors and scenes. The study also shows that training on combined datasets boosts performance over single-dataset approaches. This matters for building reliable systems that work in varied real conditions instead of overfitting to specific data.

Core claim

State-of-the-art BEV segmentation models exhibit reduced performance under cross-dataset validation due to domain shift from varying environments and sensors, but multi-dataset training experiments demonstrate improved segmentation accuracy compared to single-dataset training.

What carries the argument

Cross-dataset training and testing protocols that vary datasets, sensor inputs such as cameras and LiDAR, and semantic categories to measure generalization.

If this is right

  • Models show better results on test datasets when trained on data from multiple sources.
  • Performance differences appear based on whether models use camera, LiDAR, or both.
  • Some semantic classes are more affected by dataset changes than others.
  • Generalization to new setups becomes essential for practical use in autonomous driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Autonomous driving systems may require ongoing dataset updates to maintain performance in new areas.
  • Evaluation protocols for these models should routinely include tests on unseen datasets.
  • Combining datasets could reduce the need for frequent retraining in deployment scenarios.

Load-bearing premise

The datasets used represent the range of variations autonomous driving systems encounter in real deployment.

What would settle it

An experiment showing that models trained on multiple datasets do not outperform single-dataset models on cross-dataset tests.

Figures

Figures reproduced from arXiv: 2408.16322 by Christian Laugier (CHROMA), Manuel Alejandro Diaz-Zapata (CHROMA), Robin Baruffa (CHROMA), UGA), Wenqian Liu (CHROMA.

Figure 1
Figure 1. Figure 1: Cross-dataset validation using the BEV semantic [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Point cloud sample illustration (top) and histogram of the number of points per sample (bottom) for (a) nuScenes, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example of map annotations provided by (a)nuScenes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Drivable Area ground truth generation for the Woven [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative BEV semantic segmentation results for LAPT-PP on nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative BEV semantic segmentation results for LAPT-PP on Woven Planet dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
read the original abstract

Current research in semantic bird's-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice leads to the development of highly specialized models that may fail when faced with different environments or sensor setups, a problem known as domain shift. In this paper, we conduct a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models to assess their performance across different training and testing datasets and setups, as well as different semantic categories. We investigate the influence of different sensors, such as cameras and LiDAR, on the models' ability to generalize to diverse conditions and scenarios. Additionally, we conduct multi-dataset training experiments that improve models' BEV segmentation performance compared to single-dataset training. Our work addresses the gap in evaluating BEV segmentation models under cross-dataset validation. And our findings underscore the importance of enhancing model generalizability and adaptability to ensure more robust and reliable BEV segmentation approaches for autonomous driving applications. The code for this paper available at https://github.com/manueldiaz96/beval .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper conducts a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models for autonomous driving. It assesses performance across multiple training and test datasets, varying sensor configurations (cameras and LiDAR), and semantic categories, while also reporting that multi-dataset training improves segmentation performance relative to single-dataset baselines. The work includes a public code release.

Significance. If the empirical results hold, the study is significant because it directly addresses the domain-shift problem in BEV segmentation, an area that has been dominated by single-dataset (primarily nuScenes) optimization. The demonstration that multi-dataset training yields measurable gains, together with the released code, supplies concrete, reproducible evidence that can guide the development of more robust models for real-world autonomous driving.

minor comments (2)
  1. [Abstract] Abstract: the sentence beginning 'And our findings' is grammatically awkward; rephrasing to 'Our findings underscore...' would improve readability.
  2. [§3 (Datasets and Evaluation Protocol)] The manuscript would benefit from an explicit statement of the label taxonomy alignment procedure used when merging semantic categories across datasets.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the manuscript. The summary accurately captures the contributions of our cross-dataset evaluation of BEV segmentation models.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical cross-dataset evaluations and multi-dataset training experiments on BEV segmentation models, with results compared directly to single-dataset baselines. No derivations, equations, fitted predictions, uniqueness theorems, or ansatzes are invoked; claims rest on experimental measurements and released code. The work is self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study that relies on existing public datasets and standard deep-learning training practices without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)
  • domain assumption Standard supervised training and evaluation protocols for semantic segmentation (cross-entropy loss, IoU metrics) apply without modification to BEV tasks.
    The evaluation implicitly uses these conventions common to the computer-vision literature.

pith-pipeline@v0.9.0 · 5746 in / 1061 out tokens · 24190 ms · 2026-05-23T21:23:43.160053+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Siciliano, O

    B. Siciliano, O. Khatib, and T. Kr ¨oger, Springer handbook of robotics . Springer, 2008, vol. 200

  2. [2]

    Using occupancy grids for mobile robot perception and navigation,

    A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989

  3. [3]

    Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,

    M. Diaz-Zapata, D. Sierra-Gonzalez, ¨O. Erkent, C. Laugier, and J. Dibangoye, “Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 712–718

  4. [4]

    Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

    J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in European Conference on Computer Vision . Springer, 2020, pp. 194–210

  5. [5]

    Bird’s-eye-view panoptic segmentation using monocular frontal view images,

    N. Gosala and A. Valada, “Bird’s-eye-view panoptic segmentation using monocular frontal view images,” IEEE Robotics and Automation Letters, 2022

  6. [6]

    Trans- lating images into maps,

    A. Saha, O. Mendez Maldonado, C. Russell, and R. Bowden, “Trans- lating images into maps,” 2022 IEEE International Conference on Robotics and Automation (ICRA) , 2022

  7. [7]

    Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,

    J. Fei, K. Peng, P. Heidenreich, F. Bieder, and C. Stiller, “Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,” in 2021 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2021, pp. 838–844

  8. [8]

    A simple baseline for bev perception without lidar,

    A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “A simple baseline for bev perception without lidar,” arXiv preprint arXiv:2206.07959, 2022

  9. [9]

    Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,

    G. Salazar-Gomez, D. S. Gonz ´alez, M. A. Diaz-Zapata, A. Paigwar, W. Liu, ¨O. Erkent, and C. Laugier, “Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,” in ICARCV 2022-17th International Conference on Control, Automation, Robotics and Vision, 2022

  10. [10]

    Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,

    J. Dequaire, P. Ondr ´uˇska, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,” The International Journal of Robotics Research , p. 0278364917710543, 2017

  11. [11]

    Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,

    S. Srivastava, F. Jurie, and G. Sharma, “Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2019, pp. 4504–4511

  12. [12]

    Predicting semantic map representations from images using pyramid occupancy networks,

    T. Roddick and R. Cipolla, “Predicting semantic map representations from images using pyramid occupancy networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 11 138–11 147

  13. [13]

    nuscenes: A multimodal dataset for autonomous driving,

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 621–11 631

  14. [14]

    Woven planet perception dataset 2020,

    R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Fer- reira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V . Shet, “Woven planet perception dataset 2020,” https://woven.toyota/ en/perception-dataset, 2019

  15. [15]

    Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,

    A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V . Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 273–15 282

  16. [16]

    Cross-view transformers for real-time map-view semantic segmentation,

    B. Zhou and P. Kr ¨ahenb¨uhl, “Cross-view transformers for real-time map-view semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 13 760–13 769

  17. [17]

    Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

    Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270, 2022

  18. [18]

    Uncertainty estimation for cross-dataset performance in trajectory prediction,

    T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “Uncertainty estimation for cross-dataset performance in trajectory prediction,” CoRR, vol. abs/2205.07310, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.07310

  19. [19]

    Assess- ing cross-dataset generalization of pedestrian crossing predictors,

    J. Gesnouin, S. Pechberti, B. Stanciulescu, and F. Moutarde, “Assess- ing cross-dataset generalization of pedestrian crossing predictors,” in 2022 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2022, pp. 419–426

  20. [20]

    Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,

    L. St ¨acker, P. Heidenreich, J. Rambach, and D. Stricker, “Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,” in2023 31st European Signal Processing Conference (EUSIPCO) . IEEE, 2023, pp. 810–814

  21. [21]

    Imagenet: A large-scale hierarchical image database,

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

  22. [22]

    Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,

    F. Bartoccioni, E. Zablocki, A. Bursuc, P. Perez, M. Cord, and K. Alahari, “Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id= abd D-iVjk0