Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
hub
BEVDet: High-performance Multi-camera 3D Object Detection in Bird-Eye-View
21 Pith papers cite this work. Polarity classification is still indexing.
abstract
Autonomous driving perceives its surroundings for decision making, which is one of the most complex scenarios in visual perception. The success of paradigm innovation in solving the 2D object detection task inspires us to seek an elegant, feasible, and scalable paradigm for fundamentally pushing the performance boundary in this area. To this end, we contribute the BEVDet paradigm in this paper. BEVDet performs 3D object detection in Bird-Eye-View (BEV), where most target values are defined and route planning can be handily performed. We merely reuse existing modules to build its framework but substantially develop its performance by constructing an exclusive data augmentation strategy and upgrading the Non-Maximum Suppression strategy. In the experiment, BEVDet offers an excellent trade-off between accuracy and time-efficiency. As a fast version, BEVDet-Tiny scores 31.2% mAP and 39.2% NDS on the nuScenes val set. It is comparable with FCOS3D, but requires just 11% computational budget of 215.3 GFLOPs and runs 9.2 times faster at 15.6 FPS. Another high-precision version dubbed BEVDet-Base scores 39.3% mAP and 47.2% NDS, significantly exceeding all published results. With a comparable inference speed, it surpasses FCOS3D by a large margin of +9.8% mAP and +10.0% NDS. The source code is publicly available for further research at https://github.com/HuangJunJie2017/BEVDet .
hub tools
citation-role summary
citation-polarity summary
years
2026 21verdicts
UNVERDICTED 21representative citing papers
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.
TFusionOcc uses a family of Student's t-distribution T-primitives and a T-mixture model for multi-sensor 3D occupancy prediction, reporting state-of-the-art results on nuScenes.
Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
CTAB exchanges features between detection and segmentation via multi-scale deformable attention in BEV space, yielding segmentation gains on 7 nuScenes classes at neutral detection cost.
GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end methods on nuScenes and Bench2Drive.
MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.
BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.
citing papers explorer
-
Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving
Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.
-
EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras
EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.
-
PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations
PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.
-
SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion
The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.
-
Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning
Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.
-
DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather
DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.
-
TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding
TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.
-
Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy
HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.
-
SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras
SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.
-
OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models
OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.
-
CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras
CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.
-
ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation
ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.
-
DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale
DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.
-
R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection
R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.
-
TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction
TFusionOcc uses a family of Student's t-distribution T-primitives and a T-mixture model for multi-sensor 3D occupancy prediction, reporting state-of-the-art results on nuScenes.
-
InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making
Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.
-
SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection
SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.
-
Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation
CTAB exchanges features between detection and segmentation via multi-scale deformable attention in BEV space, yielding segmentation gains on 7 nuScenes classes at neutral detection cost.
-
Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning
GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end methods on nuScenes and Bench2Drive.
-
Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving
MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.
-
BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving
BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.