BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autonomous Driving

Christian Laugier (CHROMA); Manuel Alejandro Diaz-Zapata (CHROMA); Robin Baruffa (CHROMA); UGA); Wenqian Liu (CHROMA

arxiv: 2408.16322 · v4 · submitted 2024-08-29 · 💻 cs.CV · cs.RO

BEVal: A Cross-dataset Evaluation Study of BEV Segmentation Models for Autonomous Driving

Manuel Alejandro Diaz-Zapata (CHROMA) , Wenqian Liu (CHROMA , UGA) , Robin Baruffa (CHROMA) , Christian Laugier (CHROMA) This is my paper

Pith reviewed 2026-05-23 21:23 UTC · model grok-4.3

classification 💻 cs.CV cs.RO

keywords BEV segmentationautonomous drivingcross-dataset evaluationdomain shiftmulti-dataset traininggeneralizationsemantic segmentation

0 comments

The pith

BEV segmentation models generalize better when trained on multiple datasets rather than one.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper performs a cross-dataset evaluation of BEV segmentation models for autonomous driving to highlight issues with domain shift. It tests models trained on one dataset like nuScenes against others with different sensors and scenes. The study also shows that training on combined datasets boosts performance over single-dataset approaches. This matters for building reliable systems that work in varied real conditions instead of overfitting to specific data.

Core claim

State-of-the-art BEV segmentation models exhibit reduced performance under cross-dataset validation due to domain shift from varying environments and sensors, but multi-dataset training experiments demonstrate improved segmentation accuracy compared to single-dataset training.

What carries the argument

Cross-dataset training and testing protocols that vary datasets, sensor inputs such as cameras and LiDAR, and semantic categories to measure generalization.

If this is right

Models show better results on test datasets when trained on data from multiple sources.
Performance differences appear based on whether models use camera, LiDAR, or both.
Some semantic classes are more affected by dataset changes than others.
Generalization to new setups becomes essential for practical use in autonomous driving.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Autonomous driving systems may require ongoing dataset updates to maintain performance in new areas.
Evaluation protocols for these models should routinely include tests on unseen datasets.
Combining datasets could reduce the need for frequent retraining in deployment scenarios.

Load-bearing premise

The datasets used represent the range of variations autonomous driving systems encounter in real deployment.

What would settle it

An experiment showing that models trained on multiple datasets do not outperform single-dataset models on cross-dataset tests.

Figures

Figures reproduced from arXiv: 2408.16322 by Christian Laugier (CHROMA), Manuel Alejandro Diaz-Zapata (CHROMA), Robin Baruffa (CHROMA), UGA), Wenqian Liu (CHROMA.

**Figure 2.** Figure 2: Point cloud sample illustration (top) and histogram of the number of points per sample (bottom) for (a) nuScenes, [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of map annotations provided by (a)nuScenes [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Drivable Area ground truth generation for the Woven [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Qualitative BEV semantic segmentation results for LAPT-PP on nuScenes dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative BEV semantic segmentation results for LAPT-PP on Woven Planet dataset. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

read the original abstract

Current research in semantic bird's-eye view segmentation for autonomous driving focuses solely on optimizing neural network models using a single dataset, typically nuScenes. This practice leads to the development of highly specialized models that may fail when faced with different environments or sensor setups, a problem known as domain shift. In this paper, we conduct a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models to assess their performance across different training and testing datasets and setups, as well as different semantic categories. We investigate the influence of different sensors, such as cameras and LiDAR, on the models' ability to generalize to diverse conditions and scenarios. Additionally, we conduct multi-dataset training experiments that improve models' BEV segmentation performance compared to single-dataset training. Our work addresses the gap in evaluating BEV segmentation models under cross-dataset validation. And our findings underscore the importance of enhancing model generalizability and adaptability to ensure more robust and reliable BEV segmentation approaches for autonomous driving applications. The code for this paper available at https://github.com/manueldiaz96/beval .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows multi-dataset training lifts BEV segmentation cross-dataset scores over single-dataset baselines, with code released to check the numbers.

read the letter

The main thing to know is that this work measures how much BEV segmentation models drop when tested on a different dataset than they were trained on, then shows that training on multiple datasets together reduces that drop for the models they tested. They also vary camera and LiDAR inputs and report per-category results. The protocol is described at a high level in the abstract and the code is public, so the empirical comparison can be reproduced or extended directly. That is the concrete addition: specific numbers on domain shift for this task rather than another single-dataset leaderboard entry. The sensor ablation is a practical detail that matters for real setups. No derivation or fitted claim reduces to its own inputs, and the central result is a direct measurement, not a circular one. The datasets chosen are the usual suspects in the area, so the findings are at least comparable to existing work. One soft spot is that the paper does not yet show whether the multi-dataset gains hold when the test distribution is farther from the training mix; the representativeness assumption is reasonable for the chosen sets but remains an assumption. Label alignment across datasets is not discussed in the abstract, which could affect how much credit the joint training gets. Overall the evidence is empirical and checkable rather than overstated. This paper is for groups already running BEV segmentation experiments who want data on training strategies before committing to a single benchmark. It is not a foundational method paper but supplies useful negative results on specialization. I would send it to peer review because the experimental design is clear enough to referee and the code lowers the barrier to verification.

Referee Report

0 major / 2 minor

Summary. The paper conducts a comprehensive cross-dataset evaluation of state-of-the-art BEV segmentation models for autonomous driving. It assesses performance across multiple training and test datasets, varying sensor configurations (cameras and LiDAR), and semantic categories, while also reporting that multi-dataset training improves segmentation performance relative to single-dataset baselines. The work includes a public code release.

Significance. If the empirical results hold, the study is significant because it directly addresses the domain-shift problem in BEV segmentation, an area that has been dominated by single-dataset (primarily nuScenes) optimization. The demonstration that multi-dataset training yields measurable gains, together with the released code, supplies concrete, reproducible evidence that can guide the development of more robust models for real-world autonomous driving.

minor comments (2)

[Abstract] Abstract: the sentence beginning 'And our findings' is grammatically awkward; rephrasing to 'Our findings underscore...' would improve readability.
[§3 (Datasets and Evaluation Protocol)] The manuscript would benefit from an explicit statement of the label taxonomy alignment procedure used when merging semantic categories across datasets.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the thorough review and positive recommendation to accept the manuscript. The summary accurately captures the contributions of our cross-dataset evaluation of BEV segmentation models.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical cross-dataset evaluations and multi-dataset training experiments on BEV segmentation models, with results compared directly to single-dataset baselines. No derivations, equations, fitted predictions, uniqueness theorems, or ansatzes are invoked; claims rest on experimental measurements and released code. The work is self-contained against external benchmarks with no load-bearing self-citation chains or self-definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical evaluation study that relies on existing public datasets and standard deep-learning training practices without introducing new free parameters, axioms beyond domain norms, or invented entities.

axioms (1)

domain assumption Standard supervised training and evaluation protocols for semantic segmentation (cross-entropy loss, IoU metrics) apply without modification to BEV tasks.
The evaluation implicitly uses these conventions common to the computer-vision literature.

pith-pipeline@v0.9.0 · 5746 in / 1061 out tokens · 24190 ms · 2026-05-23T21:23:43.160053+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

[1]

Siciliano, O

B. Siciliano, O. Khatib, and T. Kr ¨oger, Springer handbook of robotics . Springer, 2008, vol. 200

work page 2008
[2]

Using occupancy grids for mobile robot perception and navigation,

A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989

work page 1989
[3]

Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,

M. Diaz-Zapata, D. Sierra-Gonzalez, ¨O. Erkent, C. Laugier, and J. Dibangoye, “Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 712–718

work page 2023
[4]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in European Conference on Computer Vision . Springer, 2020, pp. 194–210

work page 2020
[5]

Bird’s-eye-view panoptic segmentation using monocular frontal view images,

N. Gosala and A. Valada, “Bird’s-eye-view panoptic segmentation using monocular frontal view images,” IEEE Robotics and Automation Letters, 2022

work page 2022
[6]

Trans- lating images into maps,

A. Saha, O. Mendez Maldonado, C. Russell, and R. Bowden, “Trans- lating images into maps,” 2022 IEEE International Conference on Robotics and Automation (ICRA) , 2022

work page 2022
[7]

Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,

J. Fei, K. Peng, P. Heidenreich, F. Bieder, and C. Stiller, “Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,” in 2021 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2021, pp. 838–844

work page 2021
[8]

A simple baseline for bev perception without lidar,

A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “A simple baseline for bev perception without lidar,” arXiv preprint arXiv:2206.07959, 2022

work page arXiv 2022
[9]

Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,

G. Salazar-Gomez, D. S. Gonz ´alez, M. A. Diaz-Zapata, A. Paigwar, W. Liu, ¨O. Erkent, and C. Laugier, “Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,” in ICARCV 2022-17th International Conference on Control, Automation, Robotics and Vision, 2022

work page 2022
[10]

Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,

J. Dequaire, P. Ondr ´uˇska, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,” The International Journal of Robotics Research , p. 0278364917710543, 2017

work page 2017
[11]

Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,

S. Srivastava, F. Jurie, and G. Sharma, “Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2019, pp. 4504–4511

work page 2019
[12]

Predicting semantic map representations from images using pyramid occupancy networks,

T. Roddick and R. Cipolla, “Predicting semantic map representations from images using pyramid occupancy networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 11 138–11 147

work page 2020
[13]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 621–11 631

work page 2020
[14]

Woven planet perception dataset 2020,

R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Fer- reira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V . Shet, “Woven planet perception dataset 2020,” https://woven.toyota/ en/perception-dataset, 2019

work page 2020
[15]

Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,

A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V . Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 273–15 282

work page 2021
[16]

Cross-view transformers for real-time map-view semantic segmentation,

B. Zhou and P. Kr ¨ahenb¨uhl, “Cross-view transformers for real-time map-view semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 13 760–13 769

work page 2022
[17]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270, 2022

work page arXiv 2022
[18]

Uncertainty estimation for cross-dataset performance in trajectory prediction,

T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “Uncertainty estimation for cross-dataset performance in trajectory prediction,” CoRR, vol. abs/2205.07310, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.07310

work page doi:10.48550/arxiv.2205.07310 2022
[19]

Assess- ing cross-dataset generalization of pedestrian crossing predictors,

J. Gesnouin, S. Pechberti, B. Stanciulescu, and F. Moutarde, “Assess- ing cross-dataset generalization of pedestrian crossing predictors,” in 2022 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2022, pp. 419–426

work page 2022
[20]

Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,

L. St ¨acker, P. Heidenreich, J. Rambach, and D. Stricker, “Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,” in2023 31st European Signal Processing Conference (EUSIPCO) . IEEE, 2023, pp. 810–814

work page 2023
[21]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009
[22]

Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,

F. Bartoccioni, E. Zablocki, A. Bursuc, P. Perez, M. Cord, and K. Alahari, “Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id= abd D-iVjk0

work page 2022

[1] [1]

Siciliano, O

B. Siciliano, O. Khatib, and T. Kr ¨oger, Springer handbook of robotics . Springer, 2008, vol. 200

work page 2008

[2] [2]

Using occupancy grids for mobile robot perception and navigation,

A. Elfes, “Using occupancy grids for mobile robot perception and navigation,” Computer, vol. 22, no. 6, pp. 46–57, 1989

work page 1989

[3] [3]

Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,

M. Diaz-Zapata, D. Sierra-Gonzalez, ¨O. Erkent, C. Laugier, and J. Dibangoye, “Laptnet-fpn: Multi-scale lidar-aided projective trans- form network for real time semantic grid prediction,” in 2023 IEEE International Conference on Robotics and Automation (ICRA) . IEEE, 2023, pp. 712–718

work page 2023

[4] [4]

Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,

J. Philion and S. Fidler, “Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d,” in European Conference on Computer Vision . Springer, 2020, pp. 194–210

work page 2020

[5] [5]

Bird’s-eye-view panoptic segmentation using monocular frontal view images,

N. Gosala and A. Valada, “Bird’s-eye-view panoptic segmentation using monocular frontal view images,” IEEE Robotics and Automation Letters, 2022

work page 2022

[6] [6]

Trans- lating images into maps,

A. Saha, O. Mendez Maldonado, C. Russell, and R. Bowden, “Trans- lating images into maps,” 2022 IEEE International Conference on Robotics and Automation (ICRA) , 2022

work page 2022

[7] [7]

Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,

J. Fei, K. Peng, P. Heidenreich, F. Bieder, and C. Stiller, “Pillarsegnet: Pillar-based semantic grid map estimation using sparse lidar data,” in 2021 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2021, pp. 838–844

work page 2021

[8] [8]

A simple baseline for bev perception without lidar,

A. W. Harley, Z. Fang, J. Li, R. Ambrus, and K. Fragkiadaki, “A simple baseline for bev perception without lidar,” arXiv preprint arXiv:2206.07959, 2022

work page arXiv 2022

[9] [9]

Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,

G. Salazar-Gomez, D. S. Gonz ´alez, M. A. Diaz-Zapata, A. Paigwar, W. Liu, ¨O. Erkent, and C. Laugier, “Transfusegrid: Transformer-based lidar-rgb fusion for semantic grid prediction,” in ICARCV 2022-17th International Conference on Control, Automation, Robotics and Vision, 2022

work page 2022

[10] [10]

Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,

J. Dequaire, P. Ondr ´uˇska, D. Rao, D. Wang, and I. Posner, “Deep tracking in the wild: End-to-end tracking using recurrent neu- ral networks,” The International Journal of Robotics Research , p. 0278364917710543, 2017

work page 2017

[11] [11]

Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,

S. Srivastava, F. Jurie, and G. Sharma, “Learning 2d to 3d lifting for object detection in 3d for autonomous vehicles,” in 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) . IEEE, 2019, pp. 4504–4511

work page 2019

[12] [12]

Predicting semantic map representations from images using pyramid occupancy networks,

T. Roddick and R. Cipolla, “Predicting semantic map representations from images using pyramid occupancy networks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2020, pp. 11 138–11 147

work page 2020

[13] [13]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2020, pp. 11 621–11 631

work page 2020

[14] [14]

Woven planet perception dataset 2020,

R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nadhamuni, A. Fer- reira, M. Yuan, B. Low, A. Jain, P. Ondruska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao, L. Platinsky, W. Jiang, and V . Shet, “Woven planet perception dataset 2020,” https://woven.toyota/ en/perception-dataset, 2019

work page 2020

[15] [15]

Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,

A. Hu, Z. Murez, N. Mohan, S. Dudas, J. Hawke, V . Badrinarayanan, R. Cipolla, and A. Kendall, “Fiery: Future instance prediction in bird’s- eye view from surround monocular cameras,” in Proceedings of the IEEE/CVF International Conference on Computer Vision , 2021, pp. 15 273–15 282

work page 2021

[16] [16]

Cross-view transformers for real-time map-view semantic segmentation,

B. Zhou and P. Kr ¨ahenb¨uhl, “Cross-view transformers for real-time map-view semantic segmentation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2022, pp. 13 760–13 769

work page 2022

[17] [17]

Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers

Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Q. Yu, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” arXiv preprint arXiv:2203.17270, 2022

work page arXiv 2022

[18] [18]

Uncertainty estimation for cross-dataset performance in trajectory prediction,

T. Gilles, S. Sabatini, D. Tsishkou, B. Stanciulescu, and F. Moutarde, “Uncertainty estimation for cross-dataset performance in trajectory prediction,” CoRR, vol. abs/2205.07310, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2205.07310

work page doi:10.48550/arxiv.2205.07310 2022

[19] [19]

Assess- ing cross-dataset generalization of pedestrian crossing predictors,

J. Gesnouin, S. Pechberti, B. Stanciulescu, and F. Moutarde, “Assess- ing cross-dataset generalization of pedestrian crossing predictors,” in 2022 IEEE Intelligent V ehicles Symposium (IV) . IEEE, 2022, pp. 419–426

work page 2022

[20] [20]

Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,

L. St ¨acker, P. Heidenreich, J. Rambach, and D. Stricker, “Cross-dataset experimental study of radar-camera fusion in bird’s-eye view,” in2023 31st European Signal Processing Conference (EUSIPCO) . IEEE, 2023, pp. 810–814

work page 2023

[21] [21]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition . Ieee, 2009, pp. 248–255

work page 2009

[22] [22]

Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,

F. Bartoccioni, E. Zablocki, A. Bursuc, P. Perez, M. Cord, and K. Alahari, “Lara: Latents and rays for multi-camera bird’s-eye- view semantic segmentation,” in 6th Annual Conference on Robot Learning, 2022. [Online]. Available: https://openreview.net/forum?id= abd D-iVjk0

work page 2022