Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

Muhammad Shahbaz; Shaurya Agarwal

arxiv: 2606.12981 · v1 · pith:UFP6WPMCnew · submitted 2026-06-11 · 💻 cs.CV

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

Muhammad Shahbaz , Shaurya Agarwal This is my paper

Pith reviewed 2026-06-27 07:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords cooperative perception3D object detectionBEV fusioncamera-LiDAR fusionV2XTUMTrafdata leakage

0 comments

The pith

Camera and LiDAR BEV fusion for cooperative 3D detection reaches 0.85 mAP on TUMTraf V2X test split.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a detector that fuses three roadside cameras with a combined infrastructure and vehicle point cloud in a shared bird's-eye-view representation. Predictions are made with a CenterPoint-style head using generalized IoU loss and an additional IoU quality re-ranking head. On the public Codabench test split the system scores 0.85 3D mAP after training on the released train and validation data. The authors also identify that 44 of the 50 test frames appear in the train or validation splits and quantify the effect of this overlap through targeted experiments.

Core claim

A BEV fusion model combining roadside cameras and infrastructure-plus-vehicle LiDAR achieves 0.85 3D mAP on the public test split; oversampling the 44 overlapping frames raises the score to 0.89 while replacing predictions on those frames with ground truth raises it to 0.99.

What carries the argument

Bird's-eye-view fusion of multi-view camera features and a merged point cloud, followed by a CenterPoint detection head with generalized IoU regression and IoU-based re-ranking.

If this is right

The fusion approach can be applied to roadside cooperative perception tasks.
Higher scores are obtained by including more overlapping frames in training.
Near-perfect scores result from using ground truth on overlapping test frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 0.85 mAP likely overestimates generalization performance due to data leakage.
V2X benchmarks require stricter train-test separation to avoid such overlaps.
Methods should be evaluated on non-overlapping test subsets for fair comparison.

Load-bearing premise

The public test split contains frames that do not appear in the train or validation splits.

What would settle it

Running the model on a test set consisting only of the 6 non-overlapping frames would reveal the performance without leakage.

Figures

Figures reproduced from arXiv: 2606.12981 by Muhammad Shahbaz, Shaurya Agarwal.

**Figure 2.** Figure 2: Qualitative bevfusion detections on four representative test sequences. Each panel shows the three infrastructure cameras and the vehicle ego camera (top) alongside the fused-LiDAR bird’s-eye view with the predicted 3D boxes (bottom), coloured by class. Top row: a busy intersection (31 detected objects) and an oncoming car-carrier trailer. Bottom row: a low-light dusk scene, where the camera branch helps r… view at source ↗

**Figure 3.** Figure 3: Cooperative perception sees through occlusion (ego camera view). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Per-class AP on the leak44 frames for the three configurations. Legend pairs are (Codabench [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags significant train-test overlap in the TUMTraf V2X benchmark and quantifies its impact on reported scores.

read the letter

Hey,

The key point on this paper is that it identifies and measures train-test frame overlap in the TUMTraf V2X cooperative detection benchmark, showing that 44 out of 50 test frames appear in the released train and validation sets. They back this up with two experiments: one oversampling the overlapping frames to reach 0.89 mAP, and another substituting ground truth to hit 0.99 mAP. This directly challenges the assumption that the public test split measures true generalization.

The detector itself is a BEV fusion of three roadside cameras and a combined infrastructure-vehicle point cloud, using a CenterPoint-style head with generalized IoU loss and an IoU quality re-ranker. Trained on the given splits, it scores 0.85 mAP on the Codabench test. The architecture follows common practices in the field, so the novelty sits in the overlap analysis rather than the model design.

They do a good job by being transparent about the leakage instead of presenting the 0.85 as clean. The controlled experiments give a sense of how much the overlap inflates the score. That said, the method section in the abstract is thin on details like exact fusion implementation, loss formulations, or training settings, which makes it hard to replicate without the full text. If the full paper has those, it strengthens the work; otherwise the contribution stays mostly diagnostic.

No major circularity or invented results here—the numbers come from running on the public split and the additional tests. The citation pattern seems focused on the benchmark and standard detectors.

This is for researchers building or evaluating on V2X datasets, especially those running challenges or checking leaderboard reliability. Someone working on data quality in perception benchmarks would find the overlap count and experiments useful. It is worth a serious referee because the evidence on the leakage is empirical and falsifiable, even if the fusion model is incremental.

I would send it for peer review. The honest handling of the data issue makes it a solid note on benchmark practices.

Referee Report

0 major / 2 minor

Summary. The manuscript describes a Camera and LiDAR BEV fusion detector for the TUMTraf V2X cooperative 3D object detection track. It fuses three roadside cameras with a combined infrastructure-plus-vehicle point cloud in shared BEV space and predicts boxes via a CenterPoint-style head using generalized IoU regression loss and an IoU quality re-ranking head. Trained on the released train and validation splits, the model reports 0.85 3D mAP on the public Codabench test split. The authors disclose that 44 of 50 test frames overlap with the train (40) and validation (4) splits and quantify the effect via an oversampling experiment (0.89 mAP) and a ground-truth substitution experiment (0.99 mAP).

Significance. If the empirical results hold, the work supplies a practical baseline for multi-modal cooperative perception and, more importantly, provides transparent quantification of test-set overlap effects through controlled experiments. This disclosure and the accompanying ablation-style runs strengthen the credibility of the reported numbers and offer useful guidance for interpreting leaderboard scores in V2X challenges.

minor comments (2)

[Abstract] Abstract: the description of the architecture, loss, and head is high-level only; no equations, network dimensions, training hyperparameters, or implementation details are supplied, which prevents independent verification or reproduction of the 0.85 mAP figure.
The manuscript does not indicate whether the per-class mAP breakdowns or the two controlled experiments are presented in tables or figures; adding such structured results would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions including the test-set overlap disclosure, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript reports empirical 3D mAP results obtained by training a BEV fusion detector on the provided train/validation splits and evaluating on the public Codabench test split. It explicitly discloses the 44-frame overlap between test and train/validation data, then quantifies the effect through two controlled experiments (oversampling overlap and GT substitution). No derivation chain, fitted parameter presented as a prediction, self-citation load-bearing premise, or ansatz is present; the reported numbers are direct experimental outcomes rather than quantities defined in terms of the model's own outputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the paper describes an applied detector without mathematical derivations, free parameters, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5756 in / 1029 out tokens · 16159 ms · 2026-06-27T07:27:05.891188+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 2 linked inside Pith

[1]

Zimmer, G

W. Zimmer, G. A. Wardana, S. Sritharan, X. Zhou, R. Song, and A. C. Knoll,TUMTraf V2X Cooperative Perception Dataset, CVPR 2024

2024
[2]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han,BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation, ICRA 2023

2023
[3]

Philion and S

J. Philion and S. Fidler,Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, ECCV 2020

2020
[4]

T. Yin, X. Zhou, and P. Kr¨ ahenb¨ uhl,Center-based 3D Object Detection and Tracking, CVPR 2021

2021
[5]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009

2009
[6]

Wightman,PyTorch Image Models (timm), GitHub repository, 2019

R. Wightman,PyTorch Image Models (timm), GitHub repository, 2019. https://github.com/huggingface/ pytorch-image-models

2019
[7]

Zhou and O

Y. Zhou and O. Tuzel,VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, CVPR 2018

2018
[8]

A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,PointPillars: Fast Encoders for Object Detection from Point Clouds, CVPR 2019

2019
[9]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas,PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, CVPR 2017

2017
[10]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,Attention Is All You Need, NeurIPS 2017

2017
[11]

Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia,LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs, CVPR 2023

2023
[12]

T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,Feature Pyramid Networks for Object Detection, CVPR 2017

2017
[13]

Tan and Q

M. Tan and Q. V. Le,EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML 2019

2019
[14]

J. Hu, L. Shen, and G. Sun,Squeeze-and-Excitation Networks, CVPR 2018

2018
[15]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar,Focal Loss for Dense Object Detection, ICCV 2017

2017
[16]

X. Zhou, D. Wang, and P. Kr¨ ahenb¨ uhl,Objects as Points, arXiv:1904.07850, 2019

Pith/arXiv arXiv 1904
[17]

Rezatofighi, N

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression, CVPR 2019

2019
[18]

Zheng, W

W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu,CIA-SSD: Confident IoU-Aware Single-Stage Object Detector from Point Cloud, AAAI 2021

2021
[19]

Loshchilov and F

I. Loshchilov and F. Hutter,Decoupled Weight Decay Regularization, ICLR 2019

2019
[20]

L. N. Smith and N. Topin,Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates, arXiv:1708.07120, 2017

Pith/arXiv arXiv 2017
[21]

Micikevicius, S

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu,Mixed Precision Training, ICLR 2018

2018
[22]

B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu,Class-Balanced Grouping and Sampling for Point Cloud 3D Object Detection, arXiv:1908.09492, 2019

arXiv 1908
[23]

Y. Yan, Y. Mao, and B. Li,SECOND: Sparsely Embedded Convolutional Detection, Sensors, 18(10):3337, 2018. 8

2018

[1] [1]

Zimmer, G

W. Zimmer, G. A. Wardana, S. Sritharan, X. Zhou, R. Song, and A. C. Knoll,TUMTraf V2X Cooperative Perception Dataset, CVPR 2024

2024

[2] [2]

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han,BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation, ICRA 2023

2023

[3] [3]

Philion and S

J. Philion and S. Fidler,Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, ECCV 2020

2020

[4] [4]

T. Yin, X. Zhou, and P. Kr¨ ahenb¨ uhl,Center-based 3D Object Detection and Tracking, CVPR 2021

2021

[5] [5]

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009

2009

[6] [6]

Wightman,PyTorch Image Models (timm), GitHub repository, 2019

R. Wightman,PyTorch Image Models (timm), GitHub repository, 2019. https://github.com/huggingface/ pytorch-image-models

2019

[7] [7]

Zhou and O

Y. Zhou and O. Tuzel,VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, CVPR 2018

2018

[8] [8]

A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,PointPillars: Fast Encoders for Object Detection from Point Clouds, CVPR 2019

2019

[9] [9]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas,PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, CVPR 2017

2017

[10] [10]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,Attention Is All You Need, NeurIPS 2017

2017

[11] [11]

Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia,LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs, CVPR 2023

2023

[12] [12]

T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,Feature Pyramid Networks for Object Detection, CVPR 2017

2017

[13] [13]

Tan and Q

M. Tan and Q. V. Le,EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML 2019

2019

[14] [14]

J. Hu, L. Shen, and G. Sun,Squeeze-and-Excitation Networks, CVPR 2018

2018

[15] [15]

T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar,Focal Loss for Dense Object Detection, ICCV 2017

2017

[16] [16]

X. Zhou, D. Wang, and P. Kr¨ ahenb¨ uhl,Objects as Points, arXiv:1904.07850, 2019

Pith/arXiv arXiv 1904

[17] [17]

Rezatofighi, N

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression, CVPR 2019

2019

[18] [18]

Zheng, W

W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu,CIA-SSD: Confident IoU-Aware Single-Stage Object Detector from Point Cloud, AAAI 2021

2021

[19] [19]

Loshchilov and F

I. Loshchilov and F. Hutter,Decoupled Weight Decay Regularization, ICLR 2019

2019

[20] [20]

L. N. Smith and N. Topin,Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates, arXiv:1708.07120, 2017

Pith/arXiv arXiv 2017

[21] [21]

Micikevicius, S

P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu,Mixed Precision Training, ICLR 2018

2018

[22] [22]

B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu,Class-Balanced Grouping and Sampling for Point Cloud 3D Object Detection, arXiv:1908.09492, 2019

arXiv 1908

[23] [23]

Y. Yan, Y. Mao, and B. Li,SECOND: Sparsely Embedded Convolutional Detection, Sensors, 18(10):3337, 2018. 8

2018