arxiv: 2112.11790 · v3 · submitted 2021-12-22 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang , Guan Huang , Zheng Zhu , Yun Ye , Dalong Du

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords BEVDet3D object detectionbird-eye-viewmulti-cameraautonomous drivingdata augmentationnon-maximum suppressionnuScenes

0 comments

The pith

BEVDet detects 3D objects in bird-eye-view by reusing standard modules plus custom data augmentation and upgraded NMS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BEVDet as a paradigm for multi-camera 3D object detection that operates directly in bird-eye-view space, where object positions and route planning align naturally. It constructs the system from existing detection components rather than new architectures, and improves results through a specialized data augmentation strategy and a refined non-maximum suppression procedure. Experiments on nuScenes show the base model reaching 39.3 percent mAP and 47.2 percent NDS while the tiny version delivers comparable accuracy to prior work at far lower compute and higher speed. A reader would care because the approach suggests that targeted engineering of data handling and post-processing can advance 3D perception performance in autonomous driving without escalating model size or complexity.

Core claim

BEVDet performs 3D object detection in bird-eye-view by reusing existing modules, with performance substantially improved by an exclusive data augmentation strategy and an upgraded non-maximum suppression strategy. On the nuScenes validation set, BEVDet-Base reaches 39.3 percent mAP and 47.2 percent NDS, exceeding all prior published results, while BEVDet-Tiny matches FCOS3D accuracy at 11 percent of the compute and 9.2 times the speed.

What carries the argument

The bird-eye-view (BEV) detection framework, which transforms multi-camera images into a unified top-down representation and applies standard detection heads enhanced by custom augmentation and NMS upgrades.

If this is right

BEVDet-Tiny achieves 31.2 percent mAP and 39.2 percent NDS at 15.6 FPS using only 215.3 GFLOPs.
BEVDet-Base surpasses FCOS3D by 9.8 percent mAP and 10.0 percent NDS at similar inference speed.
Detection in BEV space simplifies integration with downstream route planning because coordinates match vehicle motion directly.
Performance lifts derive mainly from the augmentation pipeline and NMS changes rather than architectural novelty.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation and NMS adjustments could be transferred to other view-transformation methods to improve accuracy without new network designs.
If the gains hold across varied sensor rigs, the paradigm may reduce reliance on bespoke 3D architectures in favor of careful data and post-processing choices.
Validation on additional benchmarks with different lighting or traffic patterns would test whether the reported margins persist outside nuScenes.

Load-bearing premise

The custom data augmentation strategy and upgraded non-maximum suppression produce reliable gains on new datasets and environments without introducing biases or requiring extensive per-dataset retuning.

What would settle it

A drop in BEVDet mAP below the level of FCOS3D when evaluated on a different multi-camera dataset with altered camera configurations or weather conditions would show the gains do not generalize.

read the original abstract

Autonomous driving perceives its surroundings for decision making, which is one of the most complex scenarios in visual perception. The success of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for fundamentally pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed. We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy. In the experiment, BEVDet offers an excellent trade-off between accuracy and time-efficiency. As a fast version, BEVDet-Tiny scores 31.2% mAP and 39.2% NDS on the nuScenes val set. It is comparable with FCOS3D, but requires just 11% computational budget of 215.3 GFLOPs and runs 9.2 times faster at 15.6 FPS. Another high-precision version dubbed BEVDet-Base scores 39.3% mAP and 47.2% NDS, significantly exceeding all published results. With a comparable inference speed, it surpasses FCOS3D by a large margin of +9.8% mAP and +10.0% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes the BEVDet paradigm for multi-camera 3D object detection performed directly in Bird-Eye-View (BEV). It reuses standard modules for the core pipeline but introduces an exclusive data augmentation strategy and an upgraded Non-Maximum Suppression procedure. On the nuScenes validation set, BEVDet-Tiny reports 31.2% mAP / 39.2% NDS at 15.6 FPS while BEVDet-Base reaches 39.3% mAP / 47.2% NDS, exceeding FCOS3D by +9.8% mAP and +10.0% NDS at comparable speed.

Significance. If the performance margins are reproducible and attributable to the proposed components, the work shows that large gains in multi-view 3D detection are obtainable without architectural novelty, simply by refining training and post-processing. This would be valuable for practical autonomous-driving stacks that already rely on BEV representations and need strong accuracy–latency trade-offs.

major comments (2)

[Section 3] Section 3: The central performance claims rest on an 'exclusive data augmentation strategy' and 'upgraded NMS'. No ablation tables isolate the contribution of either component (e.g., standard augmentations + vanilla NMS vs. the proposed versions) while holding the rest of the pipeline fixed. Without these controlled experiments the reported +9.8% mAP margin cannot be confidently attributed to the new elements rather than hyper-parameter tuning or other unreported factors.
[Section 4] Section 4 (Experiments): All quantitative results are reported solely on the nuScenes validation split. No cross-dataset evaluation (e.g., on KITTI or Waymo) or additional nuScenes splits is provided to test whether the custom augmentation and NMS strategies generalize or introduce dataset-specific biases.

minor comments (2)

[Abstract and Section 4] The abstract and Section 4 state that source code is released, yet the manuscript does not specify which exact hyper-parameters, augmentation schedules, or NMS thresholds are used in the released implementation; this should be clarified for reproducibility.
[Figure 2] Figure 2 and the accompanying text would benefit from an explicit diagram or table contrasting the proposed augmentation pipeline against the standard one used by prior BEV methods.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the attribution of our results and the evaluation of generalization. We address each major comment below and will revise the manuscript to incorporate the suggested improvements where feasible.

read point-by-point responses

Referee: [Section 3] Section 3: The central performance claims rest on an 'exclusive data augmentation strategy' and 'upgraded NMS'. No ablation tables isolate the contribution of either component (e.g., standard augmentations + vanilla NMS vs. the proposed versions) while holding the rest of the pipeline fixed. Without these controlled experiments the reported +9.8% mAP margin cannot be confidently attributed to the new elements rather than hyper-parameter tuning or other unreported factors.

Authors: We agree that explicit, controlled ablations are required to isolate the contributions of the exclusive data augmentation strategy and upgraded NMS. In the revised manuscript we will add dedicated ablation tables that compare (i) standard augmentations versus our proposed strategy and (ii) vanilla NMS versus the upgraded procedure, while keeping the remainder of the pipeline (backbone, BEV encoder, detection head, training schedule) fixed. These experiments will be run on the same nuScenes validation split to directly quantify the gains attributable to each component. revision: yes
Referee: [Section 4] Section 4 (Experiments): All quantitative results are reported solely on the nuScenes validation split. No cross-dataset evaluation (e.g., on KITTI or Waymo) or additional nuScenes splits is provided to test whether the custom augmentation and NMS strategies generalize or introduce dataset-specific biases.

Authors: We acknowledge that reporting only on the nuScenes validation split limits the assessment of generalization. In the revision we will add results on the official nuScenes test set (via the evaluation server) to provide an additional held-out evaluation. We will also include a short discussion of potential dataset-specific effects of the augmentation and NMS choices. Full cross-dataset experiments on KITTI or Waymo would require substantial additional engineering and compute; if they cannot be completed within the revision window we will explicitly note this as a limitation and list it as future work. revision: partial

Circularity Check

0 steps flagged

No circularity: BEVDet is an empirical reuse of modules plus custom augmentation/NMS, evaluated directly on nuScenes.

full rationale

The paper presents BEVDet as a paradigm that reuses existing modules for BEV-based 3D detection and improves results via an exclusive data augmentation strategy and upgraded NMS. All reported metrics (mAP, NDS on nuScenes val) are direct empirical measurements on a standard split using conventional evaluation protocols. No equations, derivations, or predictions are defined that reduce by construction to fitted parameters, self-citations, or renamed inputs. The central claims rest on implementation details and benchmark scores rather than any self-referential mathematical chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical effectiveness of standard BEV projection and the two added engineering steps; no new free parameters, invented physical entities, or non-standard mathematical axioms are introduced in the abstract.

axioms (1)

domain assumption Camera images can be reliably transformed into a consistent bird's-eye-view feature map using existing geometric projection techniques.
The method assumes standard multi-view geometry holds without providing new validation.

pith-pipeline@v0.9.0 · 5603 in / 1315 out tokens · 47683 ms · 2026-05-15T14:30:46.098411+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy.
Foundation.DimensionForcing dimension_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving
cs.CR 2026-05 unverdicted novelty 8.0

Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
cs.CV 2026-05 unverdicted novelty 7.0

EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods esp...
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
cs.CV 2026-05 unverdicted novelty 7.0

PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
cs.CR 2026-04 unverdicted novelty 7.0

The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
cs.CV 2026-04 unverdicted novelty 7.0

Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather
cs.CV 2026-04 unverdicted novelty 7.0

DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset ...
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
cs.CV 2026-03 unverdicted novelty 7.0

TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
cs.CV 2026-05 conditional novelty 6.0

HiPR improves 3D occupancy prediction by adaptively reparameterizing projection sampling ranges using LiDAR height priors instead of fixed uniform pillars.
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
cs.CV 2026-05 unverdicted novelty 6.0

HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras
cs.CV 2026-05 unverdicted novelty 6.0

SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
cs.CV 2026-04 unverdicted novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
cs.CV 2026-04 unverdicted novelty 6.0

CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
cs.CV 2026-04 unverdicted novelty 6.0

ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
cs.CV 2026-04 unverdicted novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to pla...
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
cs.CV 2026-03 unverdicted novelty 6.0

R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.
InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making
cs.CV 2026-05 unverdicted novelty 5.0

Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
cs.CV 2026-04 unverdicted novelty 5.0

SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation
cs.CV 2026-04 unverdicted novelty 5.0

CTAB exchanges features between detection and segmentation via multi-scale deformable attention in BEV space, yielding segmentation gains on 7 nuScenes classes at neutral detection cost.
Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning
cs.CV 2026-04 unverdicted novelty 5.0

GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end...
Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 5.0

MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving
cs.CV 2026-04 unverdicted novelty 5.0

BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · cited by 20 Pith papers · 2 internal anchors

[1]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuScenes: A multimodal dataset for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11621–11631 (2020)

work page 2020
[2]

IEEE Transactions on Pattern Analysis and Machine In- telligence (2019)

Cai, Z., Vasconcelos, N.: Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Transactions on Pattern Analysis and Machine In- telligence (2019)

work page 2019
[3]

In: Proceedings of the European Conference on Computer Vision

Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-End Object Detection with Transformers. In: Proceedings of the European Conference on Computer Vision. pp. 213–229. Springer (2020)

work page 2020
[4]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Chen, K., Pang, J., Wang, J., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Shi, J., Ouyang, W., et al.: Hybrid Task Cascade for Instance Segmentation. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4974–4983 (2019)

work page 2019
[5]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Chen, Y., Liu, S., Shen, X., Jia, J.: DSGN: Deep Stereo Geometry Network for 3D Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12536–12545 (2020)

work page 2020
[6]

https://github.com/open-mmlab/mmdetection3d (2020)

Contributors, M.: MMDetection3D: OpenMMLab next-generation platform for general 3D object detection. https://github.com/open-mmlab/mmdetection3d (2020)

work page 2020
[7]

In: Proceedings of the International Conference on Learning Representa- tions (2020)

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE. In: Proceedings of the International Conference on Learning Representa- tions (2020)

work page 2020
[8]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)

Gao, S., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.H.: Res2Net: A New Multi-scale Backbone Architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence (2019)

work page 2019
[9]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2012)

Geiger, A., Lenz, P., Urtasun, R.: Are we ready for Autonomous Driving? The KITTI Vision Benchmark Suite. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition (2012)

work page 2012
[10]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Ghiasi, G., Lin, T.Y., Le, Q.V.: NAS-FPN: Learning Scalable Feature Pyramid Architecture for Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7036–7045 (2019)

work page 2019
[11]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac- curate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 580–587 (2014)

work page 2014
[12]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Guizilini, V., Ambrus, R., Pillai, S., Raventos, A., Gaidon, A.: 3D Packing for Self- Supervised Monocular Depth Estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2485–2494 (2020) 16 J. Huang et al

work page 2020
[13]

He, K., Gkioxari, G., Doll´ ar, P., Girshick, R.: Mask R-CNN

work page
[14]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recogni- tion. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 770–778 (2016)

work page 2016
[15]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., An- dreetto, M., Adam, H.: MobileNets: Eﬃcient Convolutional Neural Networks for Mobile Vision Applications. arXiv preprint arXiv:1704.04861 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[16]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708 (2017)

work page 2017
[17]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Huang, J., Zhu, Z., Guo, F., Huang, G.: The Devil is in the Details: Delving into Unbiased Data Processing for Human Pose Estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5700–5709 (2020)

work page 2020
[18]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Kirillov, A., Wu, Y., He, K., Girshick, R.: PointRend: Image Segmentation as Ren- dering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9799–9808 (2020)

work page 2020
[19]

Advances in Neural Information Processing Sys- tems 25, 1097–1105 (2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet Classiﬁcation with Deep Convolutional Neural Networks. Advances in Neural Information Processing Sys- tems 25, 1097–1105 (2012)

work page 2012
[20]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Kumar, A., Brazil, G., Liu, X.: GrooMeD-NMS: Grouped Mathematically Diﬀer- entiable NMS for Monocular 3D Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8973–8983 (2021)

work page 2021
[21]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: PointPillars: Fast Encoders for Object Detection from Point Clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 12697–12705 (2019)

work page 2019
[22]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Lin, T.Y., Doll´ ar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature Pyramid Networks for Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2117–2125 (2017)

work page 2017
[23]

In: Proceedings of the International Conference on Computer Vision

Lin, T.Y., Goyal, P., Girshick, R., He, K., Doll´ ar, P.: Focal Loss for Dense Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 2980–2988 (2017)

work page 2017
[24]

In: Proceedings of the European Conference on Computer Vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft COCO: Common Objects in Context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755. Springer (2014)

work page 2014
[25]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Liu, S., Qi, L., Qin, H., Shi, J., Jia, J.: Path Aggregation Network for Instance Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8759–8768 (2018)

work page 2018
[26]

In: Proceedings of the International Conference on Computer Vision

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin Trans- former: Hierarchical Vision Transformer using Shifted Windows. In: Proceedings of the International Conference on Computer Vision. pp. 10012–10022 (2021)

work page 2021
[27]

In: Proceedings of the International Conference on Computer Vision

Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: AutoShape: Real-Time Shape-Aware Monocular 3D Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 15641–15650 (2021)

work page 2021
[28]

In: Proceedings of the International Conference on Learning Representa- tions (2019)

Loshchilov, I., Hutter, F.: DECOUPLED WEIGHT DECAY REGULARIZA- TION. In: Proceedings of the International Conference on Learning Representa- tions (2019)

work page 2019
[29]

BEVDet 17 In: Proceedings of the International Conference on Computer Vision

Lu, Y., Ma, X., Yang, L., Zhang, T., Liu, Y., Chu, Q., Yan, J., Ouyang, W.: Geometry Uncertainty Projection Network for Monocular 3D Object Detection. BEVDet 17 In: Proceedings of the International Conference on Computer Vision. pp. 3111– 3121 (2021)

work page 2021
[30]

In: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision

Nabati, R., Qi, H.: CenterFusion: Center-based Radar and Camera Fusion for 3D Object Detection. In: Proceedings of the IEEE/CVF Winter Conference on Appli- cations of Computer Vision. pp. 1527–1536 (2021)

work page 2021
[31]

IEEE Robotics and Automation Letters 5(3), 4867–4873 (2020)

Pan, B., Sun, J., Leung, H.Y.T., Andonian, A., Zhou, B.: Cross-View Semantic Segmentation for Sensing Surroundings. IEEE Robotics and Automation Letters 5(3), 4867–4873 (2020)

work page 2020
[32]

Park, D., Ambrus, R., Guizilini, V., Li, J., Gaidon, A.: Is Pseudo-Lidar needed for Monocular 3D Object detection? In: Proceedings of the International Conference on Computer Vision. pp. 3142–3152 (2021)

work page 2021
[33]

In: Proceedings of the European Conference on Computer Vision

Philion, J., Fidler, S.: Lift, Splat, Shoot: Encoding Images from Arbitrary Camera Rigs by Implicitly Unprojecting to 3D. In: Proceedings of the European Conference on Computer Vision. pp. 194–210. Springer (2020)

work page 2020
[34]

In: Proceedings of the 31st International Conference on Neural Information Processing Systems

Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space. In: Proceedings of the 31st International Conference on Neural Information Processing Systems. pp. 5105–5114 (2017)

work page 2017
[35]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Radosavovic, I., Kosaraju, R.P., Girshick, R., He, K., Doll´ ar, P.: Designing Network Design Spaces. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 10428–10436 (2020)

work page 2020
[36]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Reading, C., Harakeh, A., Chae, J., Waslander, S.L.: Categorical Depth Distribu- tion Network for Monocular 3D Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8555–8564 (2021)

work page 2021
[37]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You Only Look Once: Uniﬁed, Real-Time Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 779–788 (2016)

work page 2016
[38]

Advances in Neural Information Processing Systems 28, 91–99 (2015)

Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards Real-Time Ob- ject Detection with Region Proposal Networks. Advances in Neural Information Processing Systems 28, 91–99 (2015)

work page 2015
[39]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Roddick, T., Cipolla, R.: Predicting Semantic Map Representations from Images using Pyramid Occupancy Networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11138–11147 (2020)

work page 2020
[40]

IEEE Transactions on computers 100(5), 562–569 (1971)

Rosenfeld, A., Thurston, M.: Edge and Curve Detection for Visual Scene Analysis. IEEE Transactions on computers 100(5), 562–569 (1971)

work page 1971
[41]

In: Proceedings of the International Conference on Computer Vision

Shao, S., Li, Z., Zhang, T., Peng, C., Yu, G., Zhang, X., Li, J., Sun, J.: Objects365: A Large-scale, High-quality Dataset for Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 8430–8439 (2019)

work page 2019
[42]

In: Proceedings of the International Conference on Computer Vision

Simonelli, A., Bulo, S.R., Porzi, L., L´ opez-Antequera, M., Kontschieder, P.: Dis- entangling Monocular 3D Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 1991–1999 (2019)

work page 1991
[43]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition

Sun, K., Xiao, B., Liu, D., Wang, J.: Deep High-Resolution Representation Learn- ing for Human Pose Estimation. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition. pp. 5693–5703 (2019)

work page 2019
[44]

In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition

Sun, P., Kretzschmar, H., Dotiwalla, X., Chouard, A., Patnaik, V., Tsui, P., Guo, J., Zhou, Y., Chai, Y., Caine, B., et al.: Scalability in Perception for Autonomous Driving: Waymo Open Dataset. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition. pp. 2446–2454 (2020)

work page 2020
[45]

In: Proceedings of the International Conference on Machine Learning

Tan, M., Le, Q.: EﬃcientNet: Rethinking Model Scaling for Convolutional Neural Networks. In: Proceedings of the International Conference on Machine Learning. pp. 6105–6114. PMLR (2019) 18 J. Huang et al

work page 2019
[46]

In: Proceedings of the International Conference on Computer Vision

Tian, Z., Shen, C., Chen, H., He, T.: FCOS: Fully Convolutional One-Stage Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 9627–9636 (2019)

work page 2019
[47]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Wang, L., Du, L., Ye, X., Fu, Y., Guo, G., Xue, X., Feng, J., Zhang, L.: Depth- conditioned Dynamic Message Propagation for Monocular 3D Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 454–463 (2021)

work page 2021
[48]

In: Advances in Neural Information Processing Systems (2021)

Wang, L., Zhang, L., Zhu, Y., Zhang, Z., He, T., Li, M., Xue, X.: Progressive Coordinate Transforms for Monocular 3D Object Detection. In: Advances in Neural Information Processing Systems (2021)

work page 2021
[49]

arXiv preprint arXiv:2104.10956 (2021)

Wang, T., Zhu, X., Pang, J., Lin, D.: FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection. arXiv preprint arXiv:2104.10956 (2021)

work page arXiv 2021
[50]

arXiv preprint arXiv:2107.14160 (2021)

Wang, T., Zhu, X., Pang, J., Lin, D.: Probabilistic and Geometric Depth: Detecting Objects in Perspective. arXiv preprint arXiv:2107.14160 (2021)

work page arXiv 2021
[51]

arXiv preprint arXiv:2110.06922 (2021)

Wang, Y., Guizilini, V., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: DETR3D: 3D Object Detection from Multi-view Images via 3D-to-2D Queries. arXiv preprint arXiv:2110.06922 (2021)

work page arXiv 2021
[52]

In: Proceedings of the European Conference on Computer Vision

Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Uniﬁed Perceptual Parsing for Scene Understanding. In: Proceedings of the European Conference on Computer Vision. pp. 418–434 (2018)

work page 2018
[53]

Sensors 18(10), 3337 (2018)

Yan, Y., Mao, Y., Li, B.: SECOND: Sparsely Embedded Convolutional Detection. Sensors 18(10), 3337 (2018)

work page 2018
[54]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yang, W., Li, Q., Liu, W., Yu, Y., Ma, Y., He, S., Pan, J.: Projecting Your View Attentively: Monocular Road Scene Layout Estimation via Cross-View Transfor- mation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 15536–15545 (2021)

work page 2021
[55]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yang, Z., Sun, Y., Liu, S., Jia, J.: 3DSSD: Point-based 3D Single Stage Object Detector. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11040–11048 (2020)

work page 2020
[56]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3D Object Detection and Track- ing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 11784–11793 (2021)

work page 2021
[57]

In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhang, S., Chi, C., Yao, Y., Lei, Z., Li, S.Z.: Bridging the Gap Between Anchor- based and Anchor-free Detection via Adaptive Training Sample Selection. In: Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9759–9768 (2020)

work page 2020
[58]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhang, Y., Lu, J., Zhou, J.: Objects are Diﬀerent: Flexible Monocular 3D Ob- ject Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3289–3298 (2021)

work page 2021
[59]

Objects as Points

Zhou, X., Wang, D., Kr¨ ahenb¨ uhl, P.: Objects as Points. arXiv preprint arXiv:1904.07850 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[60]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhou, Y., Tuzel, O.: VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4490–4499 (2018)

work page 2018
[61]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

Zhou, Y., He, Y., Zhu, H., Wang, C., Li, H., Jiang, Q.: Monocular 3D Object Detection: An Extrinsic Parameter Free Approach. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7556–7566 (2021)

work page 2021
[62]

arXiv preprint arXiv:1908.09492 (2019) BEVDet 19

Zhu, B., Jiang, Z., Zhou, X., Li, Z., Yu, G.: Class-balanced Grouping and Sampling for Point Cloud 3D Object Detection. arXiv preprint arXiv:1908.09492 (2019) BEVDet 19

work page arXiv 1908
[63]

In: European Conference on Computer Vision

Zhu, X., Ma, Y., Wang, T., Xu, Y., Shi, J., Lin, D.: SSN: Shape Signature Networks for Multi-class Object Detection from Point Clouds. In: European Conference on Computer Vision. pp. 581–597. Springer (2020)

work page 2020
[64]

In: Proceedings of the International Conference on Computer Vision

Zou, Z., Ye, X., Du, L., Cheng, X., Tan, X., Zhang, L., Feng, J., Xue, X., Ding, E.: The Devil Is in the Task: Exploiting Reciprocal Appearance-Localization Fea- tures for Monocular 3D Object Detection. In: Proceedings of the International Conference on Computer Vision. pp. 2713–2722 (2021)

work page 2021