{"total":29,"items":[{"citing_arxiv_id":"2605.18059","ref_index":26,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Bench2Drive-Robust: Benchmarking Closed-Loop Autonomous Driving under Deployment Perturbations","primary_cat":"cs.RO","submitted_at":"2026-05-18T08:45:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Bench2Drive-Robust is a new closed-loop benchmark that evaluates end-to-end autonomous driving models under deployment perturbations from camera failures, ego-state errors, and compute delays, showing substantial performance degradation beyond image-level tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16911","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"VGGT-Occ: Geometry-Grounded and Density-Aware Gated Fusion for 3D Occupancy Prediction","primary_cat":"cs.CV","submitted_at":"2026-05-16T09:51:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"VGGT-Occ embeds geometric tokens via PA-DA and uses sequential coarse-to-fine gated fusion to reach 33.00% IoU and 21.08% mIoU on SurroundOcc-nuScenes while using only ~41M parameters in the occupancy head.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12743","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Still Camouflage, Moving Illusion: View-Induced Trajectory Manipulation in Autonomous Driving","primary_cat":"cs.CR","submitted_at":"2026-05-12T20:47:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"Static adversarial camouflage exploits natural view-angle changes during relative motion to induce consistent feature drift in AV perception, leading to incorrect trajectory predictions and unnecessary braking.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12297","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"EgoEV-HandPose: Egocentric 3D Hand Pose Estimation and Gesture Recognition with Stereo Event Cameras","primary_cat":"cs.CV","submitted_at":"2026-05-12T15:51:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EgoEV-HandPose uses stereo event cameras and a bird's-eye-view fusion module to achieve 30.54 mm MPJPE and 86.87% gesture accuracy on a new large-scale egocentric dataset, outperforming prior RGB and event methods especially in low light and occlusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11594","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations","primary_cat":"cs.CV","submitted_at":"2026-05-12T06:20:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"However, these methods are mainly validated on small indoor scenes and have not yet demonstrated strong performance in large-scale driving environments. Multi-View 3D Object Detection.Multi-view 3D object detection provides an important source of inspiration for sparse query-based scene representation. Early methods are mainly based on dense bird's-eye-view (BEV) representations. BEVDet [ 10] adopts lift-splat projection for view transformation, while BEVFormer [17] introduces deformable attention for BEV feature generation and spatial-temporal fusion. BEVDepth [16] further improves detection accuracy with explicit depth supervision, and BEVStereo [15] enhances geometric reasoning by incorporating temporal stereo cues. Another line of work performs detection with sparse object queries."},{"citing_arxiv_id":"2605.05072","ref_index":15,"ref_count":2,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Height-Guided Projection Reparameterization for Camera-LiDAR Occupancy","primary_cat":"cs.CV","submitted_at":"2026-05-06T16:09:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HiPR improves 3D occupancy prediction by reparameterizing image-to-voxel projections using LiDAR-derived height priors to adapt sampling ranges to scene sparsity and height variations.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"Table 1: Performance on Occ3D. The camera visible mask is used during training. Inputs include Camera (C), Radar (R) and LIDAR (L). R denotes ResNet, and Swin denotes Swin Transformer. MethodInput BackbonemIoU ■others■barrier■bicycle■bus■car■cons. veh.■motorcycle■pedestrian■traffic cone■trailer■truck■drive. surf.■other flat■sidewalk■terrain■manmade■vegetationBEVDetOcc [15]C Swin-B42.012.2 49.6 25.1 52.0 54.5 27.9 28.0 28.9 27.2 36.4 42.2 82.3 43.3 54.6 57.9 48.6 43.6FB-Occ [23]C R5039.813.8 44.5 27.1 46.2 49.7 24.6 27.4 28.5 28.2 33.7 36.5 81.7 44.1 52.6 56.9 42.6 38.1FlashOcc [54]C R5032.06.2 39.6 11.3 36.3 44.0 16.3 14.7 16.9 15.8 28.6 30.9 78.2 37.5 47.4 51.4 36.8 31.4COTR [31]C R5044.513.3 52.1 32.0 46.0 55.6 32."},{"citing_arxiv_id":"2605.04355","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"InterFuserDVS: Event-Enhanced Sensor Fusion for Safe RL-Based Decision Making","primary_cat":"cs.CV","submitted_at":"2026-05-05T23:24:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Integrating DVS event data into InterFuser through token fusion yields a driving score of 77.2 and 100% route completion on CARLA benchmarks, indicating improved robustness in dynamic conditions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CoRR abs/2006.07722. URL: https://arxiv.org/abs/2006.07722,arXiv:2006.07722. [25] Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H., 2023. Planning-oriented autonomous driving. arXiv preprint arXiv:2212.10156 URL:https://arxiv.org/abs/2212.10156, arXiv:2212.10156. [26] Huang, J., Huang, G., Zhu, Z., Du, D., 2021. Bevdet: High-performance multi-camera 3d object detection in bird-eye-view. CoRR abs/2112.11790. URL:https://arxiv.org/abs/2112.11790,arXiv:2112.11790. [27] Janai, J., Güney, F., Behl, A., Geiger, A., 2017. Computer vision for autonomous vehicles: Problems, datasets and state-of-the-art. CoRR abs/1704.05519."},{"citing_arxiv_id":"2605.01924","ref_index":33,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SimPB++: Simultaneously Detecting 2D and 3D Objects from Multiple Cameras","primary_cat":"cs.CV","submitted_at":"2026-05-03T15:10:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SimPB++ unifies multi-view 2D perspective and 3D BEV object detection in one model via an interactive hybrid decoder, reporting state-of-the-art results on nuScenes and long-range detection up to 150 m on Argoverse2.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.20621","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SoK: The Next Frontier in AV Security: Systematizing Perception Attacks and the Emerging Threat of Multi-Sensor Fusion","primary_cat":"cs.CR","submitted_at":"2026-04-22T14:37:57+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"The paper organizes perception attacks on AVs into a new taxonomy, identifies gaps in fusion-aware defenses, and validates one cross-sensor vulnerability with a proof-of-concept simulation.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ing features from multiple camera images and projecting them onto a 2D ground plane. This technique improves spatial understanding of the vehicle's surroundings [33]. These systems are cost-effective and provide high-resolution semantic understanding, enabling critical ADAS features like forward collision warning [34], traffic sign recognition [35], [36], pedestrian recognition [37], [38], lane detection [39], [40], [41], blind spot detection [42], [43] and au- tomated parking assistance [44]. Cameras are inexpensive compared to LiDAR and radar and thus are the preferred sensing modality for many A Vs [16], [45], [46]. However, they are vulnerable to adversarial perturbations [12] and perform poorly in low-light conditions. LiDAR Perception."},{"citing_arxiv_id":"2604.18476","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"SemLT3D: Semantic-Guided Expert Distillation for Camera-only Long-Tailed 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-04-20T16:28:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"SemLT3D introduces semantic-guided expert distillation with a language MoE module and CLIP projection to enrich features for long-tailed classes in camera-only 3D detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.17915","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"OneDrive: Unified Multi-Paradigm Driving with Vision-Language-Action Models","primary_cat":"cs.CV","submitted_at":"2026-04-20T07:50:00+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"OneDrive unifies heterogeneous decoding in a single VLM transformer decoder for end-to-end driving, achieving 0.28 L2 error and 0.18 collision rate on nuScenes plus 86.8 PDMS on NAVSIM.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Existing end-to-end multi-task autonomous driving models typically organize heterogeneous decoders either in a cascaded manner or in parallel. Unified Architectures for End- to-end Autonomous Driving.Au- tonomous driving requires the in- tegration of multiple interdependent tasks, including perception, predic- tion, and planning, such as 3D object detection [21, 30, 36, 52], lane detec- tion [5,32,34], BEV segmentation [36, 56], occupancy prediction [58,59], and trajectory planning. Early systems decomposed these tasks into separate modules with predefined interfaces, which limited information flow and hindered global optimization. Recent unified frameworks [6, 19, 23, 25, 48], often referred to as conventional end-to-end driving models [53], advocate for"},{"citing_arxiv_id":"2604.17024","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"CAM3DNet: Comprehensively mining the multi-scale features for 3D Object Detection with Multi-View Cameras","primary_cat":"cs.CV","submitted_at":"2026-04-18T15:14:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CAM3DNet outperforms prior camera-based 3D detectors on nuScenes, Waymo and Argoverse by using three new modules to better mine multi-scale spatiotemporal features from 2D queries and pyramid maps.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13633","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"ESCAPE: Episodic Spatial Memory and Adaptive Execution Policy for Long-Horizon Mobile Manipulation","primary_cat":"cs.CV","submitted_at":"2026-04-15T09:01:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ESCAPE combines spatio-temporal fusion mapping for depth-free 3D memory with a memory-driven grounding module and adaptive execution policy to reach 65.09% success on ALFRED test-seen long-horizon mobile manipulation tasks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13586","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Efficient Multi-View 3D Object Detection by Dynamic Token Selection and Fine-Tuning","primary_cat":"cs.CV","submitted_at":"2026-04-15T07:44:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Dynamic token selection and training only 1.6 million parameters instead of over 300 million reduces computation by 48-55% and improves accuracy over prior state-of-the-art on the NuScenes dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12918","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Radar-Camera BEV Multi-Task Learning with Cross-Task Attention Bridge for Joint 3D Detection and Segmentation","primary_cat":"cs.CV","submitted_at":"2026-04-14T16:00:53+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.08074","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DinoRADE: Full Spectral Radar-Camera Fusion with Vision Foundation Model Features for Multi-class Object Detection in Adverse Weather","primary_cat":"cs.CV","submitted_at":"2026-04-09T10:46:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DinoRADE reports a radar-centered multi-class detection pipeline that fuses dense radar tensors with DINOv3 features via deformable attention and outperforms prior radar-camera methods by 12.1% on the K-Radar dataset across weather conditions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.05449","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Not All Agents Matter: From Global Attention Dilution to Risk-Prioritized Game Planning","primary_cat":"cs.CV","submitted_at":"2026-04-07T05:33:54+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GameAD models autonomous driving as a risk-prioritized game among agents via Risk-Aware Topology Anchoring, Minimax Risk-Aware Sparse Attention and related components, yielding safer trajectories than prior end-to-end methods on nuScenes and Bench2Drive.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.04797","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-06T16:03:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"MMF-BEV fuses camera and radar branches with deformable self- and cross-attention, outperforming unimodal baselines on the VoD 4D radar dataset through a two-stage training process.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.02930","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"BEVPredFormer: Spatio-temporal Attention for BEV Instance Prediction in Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2026-04-03T09:58:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"BEVPredFormer uses attention-based temporal processing and 3D camera projection to match or exceed prior methods on nuScenes for BEV instance prediction.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"attention-based or Backwards Projection. Depth-based methods rely on estimating depth maps for each camera to perform the corre- sponding feature projection. This methodology was introduced in Lift-Splat-Shoot (LSS) [2], which performs a categorical depth distribution for each feature position and projects them into a frustrum-shaped region in the final BEV space. BEVDet [10] builds on this concept, aiming to build a simpler modular BEV segmentation archi- tecture. The authors also focus on addressing the overfitting problem present by introducing data augmentation techniques in the BEV space rather than only on the image plane. BEVDepth [11] focuses on reducing depth estimation errors to enhance the camera BEV representation, uti-"},{"citing_arxiv_id":"2604.00813","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"DVGT-2: Vision-Geometry-Action Model for Autonomous Driving at Scale","primary_cat":"cs.CV","submitted_at":"2026-04-01T12:21:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DVGT-2 is a streaming vision-geometry-action model that jointly reconstructs dense 3D geometry and plans trajectories online, achieving better reconstruction than prior batch methods while transferring directly to planning benchmarks without fine-tuning.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"probabilistic planning by sampling from a fixed trajectory codebook. Diffu- sionDrive [41] proposed an anchor-based truncated diffusion model to capture the multi-modal distribution of trajectories, which is adopted by other meth- ods [28,32,76]. Despite these advances, existing methods [3,86,87] fundamentally depend on manually defined perception tasks, such as detection [18,36,39,43], tracking [45,48,69,91], or occupancy [19-21,94,95] to understand driving scenes, which is extremely inefficient with world information. In contrast, our model ex- plicitly reconstructs fine-grained dense geometry. By comprehensively modeling the geometric details, our approach facilitates robust trajectory planning. VLA for Autonomous Driving."},{"citing_arxiv_id":"2603.11566","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":true,"paper_title":"R4Det: 4D Radar-Camera Fusion for High-Performance 3D Object Detection","primary_cat":"cs.CV","submitted_at":"2026-03-12T05:41:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"R4Det fuses 4D radar and camera inputs via panoramic depth fusion, deformable gated temporal fusion without ego pose, and instance-guided refinement to reach state-of-the-art 3D detection on TJ4DRadSet and VoD.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.01558","ref_index":10,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TopoMaskV3: 3D Mask Head with Dense Offset and Height Predictions for Road Topology Understanding","primary_cat":"cs.CV","submitted_at":"2026-03-02T07:33:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TopoMaskV3 adds dense offset and height heads to produce standalone 3D road centerlines from masks and reports 28.5 OLS on a new geographically disjoint long-range benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2602.06400","ref_index":38,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"TFusionOcc: T-Primitive Based Object-Centric Multi-Sensor Fusion Framework for 3D Occupancy Prediction","primary_cat":"cs.CV","submitted_at":"2026-02-06T05:43:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TFusionOcc uses a family of Student's t-distribution T-primitives and a T-mixture model for multi-sensor 3D occupancy prediction, reporting state-of-the-art results on nuScenes.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.23421","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DriveLaW:Unifying Planning and Video Generation in a Latent Driving World","primary_cat":"cs.CV","submitted_at":"2025-12-29T12:32:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DriveLaW unifies video world modeling and trajectory planning by injecting video-generator latents into a diffusion planner, achieving SOTA video prediction and a new record on the NAVSIM planning benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.08237","ref_index":3,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Fast-BEV++: Fast by Algorithm, Deployable by Design","primary_cat":"cs.CV","submitted_at":"2025-12-09T04:37:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Fast-BEV++ achieves at least 3x speedup over Fast-BEV, a new SOTA of 0.488 NDS on nuScenes 3D detection, and over 134 FPS inference by redesigning the core transformation pipeline and adding a learnable depth module.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.12796","ref_index":18,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"DriveVLA-W0: World Models Amplify Data Scaling Law in Autonomous Driving","primary_cat":"cs.CV","submitted_at":"2025-10-14T17:59:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"DriveVLA-W0 adds world modeling to predict future images in VLA models, overcoming sparse action supervision and amplifying data scaling laws on NAVSIM benchmarks and a large in-house dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.04503","ref_index":16,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"U-ViLAR: Uncertainty-Aware Visual Localization for Autonomous Driving via Differentiable Association and Registration","primary_cat":"cs.CV","submitted_at":"2025-07-06T18:40:42+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"U-ViLAR achieves robust visual localization for autonomous driving by mapping features to BEV and applying perceptual uncertainty-guided association plus localization uncertainty-guided registration.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2506.07002","ref_index":15,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"BePo: Dual Representation for 3D Occupancy Prediction","primary_cat":"cs.CV","submitted_at":"2025-06-08T05:19:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BePo proposes a dual BEV and sparse-points representation with cross-attention fusion for more accurate and efficient 3D occupancy prediction on autonomous driving benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17732","ref_index":13,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"RQR3D: Reparametrizing the regression targets for BEV-based 3D object detection","primary_cat":"cs.CV","submitted_at":"2025-05-23T10:52:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RQR3D reparametrizes oriented bounding box regression in BEV 3D detection as regressing a horizontal box plus corner offsets and achieves SOTA camera-radar performance on nuScenes with 67.5 NDS and 59.7 mAP.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}