hub

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

Junjie Huang, Guan Huang, Zheng Zhu, Yun Ye, Dalong Du · 2021 · cs.CV · arXiv 2112.11790

21 Pith papers cite this work. Polarity classification is still indexing.

21 Pith papers citing it

open full Pith review browse 21 citing papers arXiv PDF

abstract

Autonomous driving perceives its surroundings for decision making, which is one of the most complex scenarios in visual perception. The success of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for fundamentally pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed. We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy. In the experiment, BEVDet offers an excellent trade-off between accuracy and time-efficiency. As a fast version, BEVDet-Tiny scores 31.2% mAP and 39.2% NDS on the nuScenes val set. It is comparable with FCOS3D, but requires just 11% computational budget of 215.3 GFLOPs and runs 9.2 times faster at 15.6 FPS. Another high-precision version dubbed BEVDet-Base scores 39.3% mAP and 47.2% NDS, significantly exceeding all published results. With a comparable inference speed, it surpasses FCOS3D by a large margin of +9.8% mAP and +10.0% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 baseline 1

citation-polarity summary

background 2 baseline 1

representative citing papers

Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving

cs.CR · 2026-05-12 · unverdicted · novelty 8.0

Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.

EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.

PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.

SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.

Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning

cs.CV · 2026-04-15 · unverdicted · novelty 7.0

Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.

DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather

cs.CV · 2026-04-09 · unverdicted · novelty 7.0

DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.

TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding

cs.CV · 2026-03-02 · unverdicted · novelty 7.0

TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.

Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy

cs.CV · 2026-05-06 · unverdicted · novelty 6.0 · 2 refs

HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.

SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras

cs.CV · 2026-05-03 · unverdicted · novelty 6.0

SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.

OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.

CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras

cs.CV · 2026-04-18 · unverdicted · novelty 6.0

CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.

ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation

cs.CV · 2026-04-15 · unverdicted · novelty 6.0

ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.

DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale

cs.CV · 2026-04-01 · unverdicted · novelty 6.0

DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.

R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection

cs.CV · 2026-03-12 · unverdicted · novelty 6.0

R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.

TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction

cs.CV · 2026-02-06 · unverdicted · novelty 6.0

TFusionOcc uses a family of Student's t-distribution T-primitives and a T-mixture model for multi-sensor 3D occupancy prediction, reporting state-of-the-art results on nuScenes.

InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.

SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.

Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation

cs.CV · 2026-04-14 · unverdicted · novelty 5.0

CTAB exchanges features between detection and segmentation via multi-scale deformable attention in BEV space, yielding segmentation gains on 7 nuScenes classes at neutral detection cost.

Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning

cs.CV · 2026-04-07 · unverdicted · novelty 5.0

GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end methods on nuScenes and Bench2Drive.

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

cs.CV · 2026-04-06 · unverdicted · novelty 5.0

MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.

BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving

cs.CV · 2026-04-03 · unverdicted · novelty 5.0

BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.

citing papers explorer

Showing 21 of 21 citing papers.

Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving cs.CR · 2026-05-12 · unverdicted · none · ref 17 · internal anchor
Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras cs.CV · 2026-05-12 · unverdicted · none · ref 46 · internal anchor
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations cs.CV · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion cs.CR · 2026-04-22 · unverdicted · none · ref 38 · internal anchor
The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning cs.CV · 2026-04-15 · unverdicted · none · ref 8 · internal anchor
Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather cs.CV · 2026-04-09 · unverdicted · none · ref 13 · internal anchor
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding cs.CV · 2026-03-02 · unverdicted · none · ref 10 · internal anchor
TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy cs.CV · 2026-05-06 · unverdicted · none · ref 15 · 2 links · internal anchor
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras cs.CV · 2026-05-03 · unverdicted · none · ref 33 · internal anchor
SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models cs.CV · 2026-04-20 · unverdicted · none · ref 21 · internal anchor
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras cs.CV · 2026-04-18 · unverdicted · none · ref 5 · internal anchor
CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation cs.CV · 2026-04-15 · unverdicted · none · ref 10 · internal anchor
ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale cs.CV · 2026-04-01 · unverdicted · none · ref 18 · internal anchor
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection cs.CV · 2026-03-12 · unverdicted · none · ref 5 · internal anchor
R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.
TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction cs.CV · 2026-02-06 · unverdicted · none · ref 38 · internal anchor
TFusionOcc uses a family of Student's t-distribution T-primitives and a T-mixture model for multi-sensor 3D occupancy prediction, reporting state-of-the-art results on nuScenes.
InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making cs.CV · 2026-05-05 · unverdicted · none · ref 26 · internal anchor
Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection cs.CV · 2026-04-20 · unverdicted · none · ref 15 · internal anchor
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation cs.CV · 2026-04-14 · unverdicted · none · ref 4 · internal anchor
CTAB exchanges features between detection and segmentation via multi-scale deformable attention in BEV space, yielding segmentation gains on 7 nuScenes classes at neutral detection cost.
Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning cs.CV · 2026-04-07 · unverdicted · none · ref 10 · internal anchor
GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end methods on nuScenes and Bench2Drive.
Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving cs.CV · 2026-04-06 · unverdicted · none · ref 8 · internal anchor
MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving cs.CV · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.

BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer