pith. sign in

arxiv: 2606.12981 · v1 · pith:UFP6WPMCnew · submitted 2026-06-11 · 💻 cs.CV

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

Pith reviewed 2026-06-27 07:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords cooperative perception3D object detectionBEV fusioncamera-LiDAR fusionV2XTUMTrafdata leakage
0
0 comments X

The pith

Camera and LiDAR BEV fusion for cooperative 3D detection reaches 0.85 mAP on TUMTraf V2X test split.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper describes a detector that fuses three roadside cameras with a combined infrastructure and vehicle point cloud in a shared bird's-eye-view representation. Predictions are made with a CenterPoint-style head using generalized IoU loss and an additional IoU quality re-ranking head. On the public Codabench test split the system scores 0.85 3D mAP after training on the released train and validation data. The authors also identify that 44 of the 50 test frames appear in the train or validation splits and quantify the effect of this overlap through targeted experiments.

Core claim

A BEV fusion model combining roadside cameras and infrastructure-plus-vehicle LiDAR achieves 0.85 3D mAP on the public test split; oversampling the 44 overlapping frames raises the score to 0.89 while replacing predictions on those frames with ground truth raises it to 0.99.

What carries the argument

Bird's-eye-view fusion of multi-view camera features and a merged point cloud, followed by a CenterPoint detection head with generalized IoU regression and IoU-based re-ranking.

If this is right

  • The fusion approach can be applied to roadside cooperative perception tasks.
  • Higher scores are obtained by including more overlapping frames in training.
  • Near-perfect scores result from using ground truth on overlapping test frames.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 0.85 mAP likely overestimates generalization performance due to data leakage.
  • V2X benchmarks require stricter train-test separation to avoid such overlaps.
  • Methods should be evaluated on non-overlapping test subsets for fair comparison.

Load-bearing premise

The public test split contains frames that do not appear in the train or validation splits.

What would settle it

Running the model on a test set consisting only of the 6 non-overlapping frames would reveal the performance without leakage.

Figures

Figures reproduced from arXiv: 2606.12981 by Muhammad Shahbaz, Shaurya Agarwal.

Figure 1
Figure 1. Figure 1: Detector architecture. The LiDAR branch (top) and camera branch (bottom) produce BEV [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative bevfusion detections on four representative test sequences. Each panel shows the three infrastructure cameras and the vehicle ego camera (top) alongside the fused-LiDAR bird’s-eye view with the predicted 3D boxes (bottom), coloured by class. Top row: a busy intersection (31 detected objects) and an oncoming car-carrier trailer. Bottom row: a low-light dusk scene, where the camera branch helps r… view at source ↗
Figure 3
Figure 3. Figure 3: Cooperative perception sees through occlusion (ego camera view). [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-class AP on the leak44 frames for the three configurations. Legend pairs are (Codabench [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript describes a Camera and LiDAR BEV fusion detector for the TUMTraf V2X cooperative 3D object detection track. It fuses three roadside cameras with a combined infrastructure-plus-vehicle point cloud in shared BEV space and predicts boxes via a CenterPoint-style head using generalized IoU regression loss and an IoU quality re-ranking head. Trained on the released train and validation splits, the model reports 0.85 3D mAP on the public Codabench test split. The authors disclose that 44 of 50 test frames overlap with the train (40) and validation (4) splits and quantify the effect via an oversampling experiment (0.89 mAP) and a ground-truth substitution experiment (0.99 mAP).

Significance. If the empirical results hold, the work supplies a practical baseline for multi-modal cooperative perception and, more importantly, provides transparent quantification of test-set overlap effects through controlled experiments. This disclosure and the accompanying ablation-style runs strengthen the credibility of the reported numbers and offer useful guidance for interpreting leaderboard scores in V2X challenges.

minor comments (2)
  1. [Abstract] Abstract: the description of the architecture, loss, and head is high-level only; no equations, network dimensions, training hyperparameters, or implementation details are supplied, which prevents independent verification or reproduction of the 0.85 mAP figure.
  2. The manuscript does not indicate whether the per-class mAP breakdowns or the two controlled experiments are presented in tables or figures; adding such structured results would improve clarity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript, the accurate summary of our contributions including the test-set overlap disclosure, and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The manuscript reports empirical 3D mAP results obtained by training a BEV fusion detector on the provided train/validation splits and evaluating on the public Codabench test split. It explicitly discloses the 44-frame overlap between test and train/validation data, then quantifies the effect through two controlled experiments (oversampling overlap and GT substitution). No derivation chain, fitted parameter presented as a prediction, self-citation load-bearing premise, or ansatz is present; the reported numbers are direct experimental outcomes rather than quantities defined in terms of the model's own outputs or prior self-referential results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; the paper describes an applied detector without mathematical derivations, free parameters, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5756 in / 1029 out tokens · 16159 ms · 2026-06-27T07:27:05.891188+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 2 linked inside Pith

  1. [1]

    Zimmer, G

    W. Zimmer, G. A. Wardana, S. Sritharan, X. Zhou, R. Song, and A. C. Knoll,TUMTraf V2X Cooperative Perception Dataset, CVPR 2024

  2. [2]

    Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han,BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation, ICRA 2023

  3. [3]

    Philion and S

    J. Philion and S. Fidler,Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D, ECCV 2020

  4. [4]

    T. Yin, X. Zhou, and P. Kr¨ ahenb¨ uhl,Center-based 3D Object Detection and Tracking, CVPR 2021

  5. [5]

    J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,ImageNet: A Large-Scale Hierarchical Image Database, CVPR 2009

  6. [6]

    Wightman,PyTorch Image Models (timm), GitHub repository, 2019

    R. Wightman,PyTorch Image Models (timm), GitHub repository, 2019. https://github.com/huggingface/ pytorch-image-models

  7. [7]

    Zhou and O

    Y. Zhou and O. Tuzel,VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection, CVPR 2018

  8. [8]

    A. H. Lang, S. Vora, H. Caesar, L. Zhou, J. Yang, and O. Beijbom,PointPillars: Fast Encoders for Object Detection from Point Clouds, CVPR 2019

  9. [9]

    C. R. Qi, H. Su, K. Mo, and L. J. Guibas,PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation, CVPR 2017

  10. [10]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,Attention Is All You Need, NeurIPS 2017

  11. [11]

    Y. Chen, J. Liu, X. Zhang, X. Qi, and J. Jia,LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs, CVPR 2023

  12. [12]

    T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,Feature Pyramid Networks for Object Detection, CVPR 2017

  13. [13]

    Tan and Q

    M. Tan and Q. V. Le,EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks, ICML 2019

  14. [14]

    J. Hu, L. Shen, and G. Sun,Squeeze-and-Excitation Networks, CVPR 2018

  15. [15]

    T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Doll´ ar,Focal Loss for Dense Object Detection, ICCV 2017

  16. [16]

    X. Zhou, D. Wang, and P. Kr¨ ahenb¨ uhl,Objects as Points, arXiv:1904.07850, 2019

  17. [17]

    Rezatofighi, N

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese,Generalized Intersection over Union: A Metric and a Loss for Bounding Box Regression, CVPR 2019

  18. [18]

    Zheng, W

    W. Zheng, W. Tang, S. Chen, L. Jiang, and C.-W. Fu,CIA-SSD: Confident IoU-Aware Single-Stage Object Detector from Point Cloud, AAAI 2021

  19. [19]

    Loshchilov and F

    I. Loshchilov and F. Hutter,Decoupled Weight Decay Regularization, ICLR 2019

  20. [20]

    L. N. Smith and N. Topin,Super-Convergence: Very Fast Training of Neural Networks Using Large Learning Rates, arXiv:1708.07120, 2017

  21. [21]

    Micikevicius, S

    P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu,Mixed Precision Training, ICLR 2018

  22. [22]

    B. Zhu, Z. Jiang, X. Zhou, Z. Li, and G. Yu,Class-Balanced Grouping and Sampling for Point Cloud 3D Object Detection, arXiv:1908.09492, 2019

  23. [23]

    Y. Yan, Y. Mao, and B. Li,SECOND: Sparsely Embedded Convolutional Detection, Sensors, 18(10):3337, 2018. 8