PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

Harsh Dave; Kriti Faujdar; Shriya Gumber; Smit Kadvani

arxiv: 2606.01757 · v1 · pith:NCFLOEHSnew · submitted 2026-06-01 · 💻 cs.CV

PillarDETR: YOLO-Backbone and RT-DETR Head for Real-Time 3D Object Detection

Smit Kadvani , Shriya Gumber , Kriti Faujdar , Harsh Dave This is my paper

Pith reviewed 2026-06-28 15:37 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D object detectionLiDARpillar-based encodingYOLOv8RT-DETRreal-time detectionautonomous drivingtransformer

0 comments

The pith

PillarDETR replaces standard backbones with a YOLOv8 CSP network and anchor-based heads with an RT-DETR decoder to balance accuracy and speed in 3D LiDAR detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PillarDETR to address the challenge of real-time 3D object detection from LiDAR point clouds in autonomous systems. It encodes points into pillars and uses a 2D backbone for feature extraction before applying a transformer decoder. By adopting the CSP network from YOLOv8, the model extracts richer features from the resulting pseudoimages. Switching to the RT-DETR head enables direct prediction of 3D boxes while capturing global context without NMS. Tests on KITTI and nuScenes confirm better mAP and latency than the PointPillars baseline, with ablations validating the component changes.

Core claim

PillarDETR achieves its performance by integrating a Cross Stage Partial network from YOLOv8 as the backbone for pseudoimage feature extraction and an RT-DETR decoder as the head for direct 3D bounding box prediction, resulting in improved mean average precision and reduced inference latency on the KITTI and nuScenes benchmarks compared to PointPillars.

What carries the argument

The combination of the YOLOv8 CSP backbone for efficient feature extraction from pillar-encoded pseudoimages and the RT-DETR decoder for global context-aware direct box regression without NMS.

If this is right

The model supports real-time 3D perception suitable for autonomous driving and robotics.
Detection proceeds end-to-end without non-maximum suppression post-processing.
Ablation studies show each modification contributes to the accuracy-speed trade-off.
The approach is validated across two standard LiDAR benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar backbone and head swaps could be tested on other 3D detection baselines like VoxelNet.
The global context from the transformer might help in crowded scenes where local features fail.
This design opens possibilities for fully differentiable pipelines in multi-object tracking.

Load-bearing premise

The replacement of the backbone with the YOLOv8 CSP network and the head with the RT-DETR decoder produces the claimed gains in mAP and latency on the given benchmarks without additional processing.

What would settle it

Measuring mAP and inference time on KITTI or nuScenes using the original PointPillars components instead of the proposed ones and observing no improvement or degradation.

Figures

Figures reproduced from arXiv: 2606.01757 by Harsh Dave, Kriti Faujdar, Shriya Gumber, Smit Kadvani.

**Figure 1.** Figure 1: A. Architecture Overview B. Pillar Feature Net (PFN) The first stage of our pipeline follows the pillarization process introduced in [5]. Given a 3D point cloud, we discretize the space in the x-y plane into a grid of evenly spaced pillars, ignoring the z (height) dimension. Each point p in a [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 1.** Figure 1: Overall architecture of PillarDETR. The raw LiDAR point cloud is converted into a BEV pseudo-image using the Pillar Feature Net (PFN). A [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

read the original abstract

Real-time 3D object detection is a critical component for the safe operation of autonomous driving systems and robotics. While LiDAR point clouds provide accurate spatial information, processing them efficiently remains a significant challenge. Traditional methods rely on complex 3D convolutions or anchor-based paradigms that struggle to balance detection accuracy with inference speed. In this paper, we propose PillarDETR, a novel end-to-end 3D object detection architecture that combines the efficiency of pillar-based LiDAR encoding with the representational power of modern 2D vision models. Specifically, PillarDETR replaces standard convolutional backbones with a Cross Stage Partial (CSP) network derived from YOLOv8, enabling richer feature extraction from pseudoimages. Furthermore, we discard conventional anchor-based or center-based detection heads in favor of a Real-Time Detection Transformer (RT-DETR) decoder. This hybrid design allows the network to capture global context and directly predict 3D bounding boxes without relying on non-maximum suppression (NMS). Extensive experiments on the KITTI and nuScenes benchmarks demonstrate that PillarDETR achieves a compelling trade-off between mean Average Precision (mAP) and inference latency. Our ablation studies confirm that integrating the YOLOv8 backbone and RT-DETR head yields substantial improvements over the PointPillars baseline, establishing PillarDETR as a highly effective solution for real-time 3D perception.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PillarDETR is a straightforward swap of YOLOv8 CSP backbone and RT-DETR decoder onto PointPillars pillars with no new ideas and zero numbers shown to support the accuracy-speed claims.

read the letter

PillarDETR takes the pillar pseudo-image from PointPillars, replaces the usual backbone with the CSP network from YOLOv8, and puts an RT-DETR decoder on top instead of an anchor head. That is the entire contribution. The abstract says this lets the model pull richer features and predict boxes directly without NMS, which in principle could cut latency while keeping mAP reasonable on KITTI and nuScenes.

The paper does a clear job laying out the motivation and the component choices. Using a modern 2D backbone on the pillar view and a transformer decoder to avoid post-processing are both reasonable engineering moves for real-time settings. If the full experiments actually show solid gains with ablations that isolate each change, the work could be useful as a practical recipe.

The main problem is that none of the claimed gains are quantified anywhere in the abstract. There are no mAP numbers, no latency figures, no dataset splits, no error bars, and no implementation details. The central assertion that the YOLOv8 plus RT-DETR combination produces substantial improvements over the PointPillars baseline therefore cannot be checked. The stress-test note correctly flags that the description itself has no internal contradictions, but that does not substitute for missing evidence. The assumption that these specific replacements will deliver the stated trade-off without extra tuning is left untested in what we have.

Citation choices are standard and unremarkable. No equations or derivations appear, so there is no circularity issue. The work is aimed at practitioners who want a fast 3D detector for driving and are ready to re-implement the hybrid. Researchers looking for new mechanisms or first-principles advances will not find them here.

I would not bring this to reading group and would not cite it. It does not deserve peer review in its current form because the empirical claims are unsupported; if the full manuscript supplies reproducible tables, code, and clear ablations that hold up, then a referee could evaluate whether the engineering combo is worth publishing.

Referee Report

1 major / 0 minor

Summary. The paper proposes PillarDETR, a hybrid 3D object detection architecture for LiDAR point clouds that encodes pillars into pseudo-images, replaces standard convolutional backbones with a CSP network from YOLOv8, and substitutes anchor- or center-based heads with an RT-DETR decoder. It claims this yields a superior mAP versus inference latency trade-off on the KITTI and nuScenes benchmarks relative to the PointPillars baseline, with end-to-end box prediction that eliminates NMS, and ablation studies attributing gains to the backbone and decoder choices.

Significance. If the empirical claims hold with concrete metrics, the design offers a plausible route to real-time 3D perception by repurposing mature 2D vision components for pillar features and adopting a transformer decoder for global context; the absence of NMS is a practical advantage for deployment.

major comments (1)

[Abstract] Abstract: the central claim of 'substantial improvements' and a 'compelling trade-off' between mAP and latency is unsupported because the abstract (and the supplied review materials) contain no numerical results, error bars, dataset splits, latency measurements, or implementation details, rendering the empirical contribution unverifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed feedback. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim of 'substantial improvements' and a 'compelling trade-off' between mAP and latency is unsupported because the abstract (and the supplied review materials) contain no numerical results, error bars, dataset splits, latency measurements, or implementation details, rendering the empirical contribution unverifiable.

Authors: We agree that the abstract does not contain the requested numerical results, error bars, or implementation details, which limits immediate verifiability of the claims. The full manuscript provides these in the experiments section (KITTI and nuScenes results with mAP, latency, dataset splits, and comparisons to PointPillars). We will revise the abstract to include key quantitative metrics supporting the mAP-latency trade-off. revision: yes

Circularity Check

0 steps flagged

Empirical architecture proposal with no derivation chain

full rationale

The paper proposes a hybrid PillarDETR model by combining a YOLOv8 CSP backbone with an RT-DETR decoder for pillar-based LiDAR detection, then reports empirical mAP/latency results on KITTI and nuScenes plus ablations versus PointPillars. No equations, first-principles derivations, fitted-parameter predictions, or self-citation load-bearing steps appear. All central claims are presented as experimental outcomes of the design choice, not quantities defined in terms of themselves. This is the normal non-circular case for an applied CV architecture paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract was available; no specific free parameters, axioms, or invented entities are described in sufficient detail to populate the ledger.

pith-pipeline@v0.9.1-grok · 5793 in / 1167 out tokens · 24455 ms · 2026-06-28T15:37:04.948808+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references

[1]

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 652–660

2017
[2]

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5099–5108

2017
[3]

V oxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection,

Y . Zhou and O. Tuzel, “V oxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4490–4499

2018
[4]

SECOND: Sparsely Embedded Convolu- tional Detection,

Y . Yan, Y . Mao, and B. Li, “SECOND: Sparsely Embedded Convolu- tional Detection,”Sensors, vol. 18, no. 10, p. 3337, 2018

2018
[5]

PointPillars: Fast Encoders for Object Detection from Point Clouds,

A. H. Lang, S. V ora, H. Caesar, L. Lublin, R. Meyers, and O. Beijbom, “PointPillars: Fast Encoders for Object Detection from Point Clouds,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 12697–12705

2019
[6]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

2023
[7]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 213–229

2020
[8]

DETRs Beat YOLOs on Real-time Object Detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “DETRs Beat YOLOs on Real-time Object Detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024

2024
[9]

PointRCNN: 3D Object Proposal Gener- ation and Detection from Point Cloud,

S. Shi, X. Wang, and H. Li, “PointRCNN: 3D Object Proposal Gener- ation and Detection from Point Cloud,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 770–779

2019
[10]

Complex-YOLO: An Euler-Region-Proposal for 3D Object Detection on Point Clouds,

M. Simon, S. Milz, K. Amende, and H. Gross, “Complex-YOLO: An Euler-Region-Proposal for 3D Object Detection on Point Clouds,” in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, 2018

2018
[11]

LIDAR BASED 3D OBJECT DETECTION USING YOLOV8,

S. S. Menon, “LIDAR BASED 3D OBJECT DETECTION USING YOLOV8,” M.S. thesis, Purdue University, 2024

2024
[12]

Deformable DETR: Deformable Transformers for End-to-End Object Detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in Int. Conf. Learn. Represent. (ICLR), 2021

2021
[13]

Center-based 3D Object Detection and Tracking,

T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3D Object Detection and Tracking,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 11784–11793

2021
[14]

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers,

X. Bai, Z. Hu, X. Zhu, Q. Huang, Y . Chen, H. Fu, and C. Tai, “TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 1090–1099

2022
[15]

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation,

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2023, pp. 2774–2781

2023
[16]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2012, pp. 3354–3361

2012
[17]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesaret al., “nuScenes: A Multimodal Dataset for Autonomous Driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 11621–11631

2020
[18]

OpenPCDet: An Open-source Tool- box for 3D Object Detection from Point Clouds,

OpenPCDet Development Team, “OpenPCDet: An Open-source Tool- box for 3D Object Detection from Point Clouds,” 2020. [Online]. Available: https://github.com/open-mmlab/OpenPCDet

2020
[19]

Path Aggregation Network for Instance Segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8759–8768

2018
[20]

Focal Loss for Dense Object Detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988

2017
[21]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sunet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2446–2454

2020
[22]

SUN RGB-D: A RGB-D scene understanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 567–576

2015
[23]

Deep hough voting for 3d object detection in point clouds,

C. R. Qiet al., “Deep hough voting for 3d object detection in point clouds,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9277–9286

2019
[24]

Group-free 3d object detection via transformers,

Z. Liuet al., “Group-free 3d object detection via transformers,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 2949–2958

2021
[25]

Going beyond density functional theory accuracy: Leveraging experimental data to refine pre-trained machine learning interatomic potentials,

S. Gumberet al., “Going beyond density functional theory accuracy: Leveraging experimental data to refine pre-trained machine learning interatomic potentials,”arXiv preprint arXiv:2506.10211, 2026

arXiv 2026

[1] [1]

PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,

C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2017, pp. 652–660

2017

[2] [2]

PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,

C. R. Qi, L. Yi, H. Su, and L. J. Guibas, “PointNet++: Deep Hierarchical Feature Learning on Point Sets in a Metric Space,” inAdv. Neural Inf. Process. Syst. (NeurIPS), 2017, pp. 5099–5108

2017

[3] [3]

V oxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection,

Y . Zhou and O. Tuzel, “V oxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 4490–4499

2018

[4] [4]

SECOND: Sparsely Embedded Convolu- tional Detection,

Y . Yan, Y . Mao, and B. Li, “SECOND: Sparsely Embedded Convolu- tional Detection,”Sensors, vol. 18, no. 10, p. 3337, 2018

2018

[5] [5]

PointPillars: Fast Encoders for Object Detection from Point Clouds,

A. H. Lang, S. V ora, H. Caesar, L. Lublin, R. Meyers, and O. Beijbom, “PointPillars: Fast Encoders for Object Detection from Point Clouds,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 12697–12705

2019

[6] [6]

Ultralytics YOLOv8,

G. Jocher, A. Chaurasia, and J. Qiu, “Ultralytics YOLOv8,” 2023. [Online]. Available: https://github.com/ultralytics/ultralytics

2023

[7] [7]

End-to-End Object Detection with Transformers,

N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-End Object Detection with Transformers,” inProc. Eur. Conf. Comput. Vis. (ECCV), 2020, pp. 213–229

2020

[8] [8]

DETRs Beat YOLOs on Real-time Object Detection,

Y . Zhao, W. Lv, S. Xu, J. Wei, G. Wang, Q. Dang, Y . Liu, and J. Chen, “DETRs Beat YOLOs on Real-time Object Detection,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2024

2024

[9] [9]

PointRCNN: 3D Object Proposal Gener- ation and Detection from Point Cloud,

S. Shi, X. Wang, and H. Li, “PointRCNN: 3D Object Proposal Gener- ation and Detection from Point Cloud,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2019, pp. 770–779

2019

[10] [10]

Complex-YOLO: An Euler-Region-Proposal for 3D Object Detection on Point Clouds,

M. Simon, S. Milz, K. Amende, and H. Gross, “Complex-YOLO: An Euler-Region-Proposal for 3D Object Detection on Point Clouds,” in Proc. Eur. Conf. Comput. Vis. (ECCV) Workshops, 2018

2018

[11] [11]

LIDAR BASED 3D OBJECT DETECTION USING YOLOV8,

S. S. Menon, “LIDAR BASED 3D OBJECT DETECTION USING YOLOV8,” M.S. thesis, Purdue University, 2024

2024

[12] [12]

Deformable DETR: Deformable Transformers for End-to-End Object Detection,

X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable DETR: Deformable Transformers for End-to-End Object Detection,” in Int. Conf. Learn. Represent. (ICLR), 2021

2021

[13] [13]

Center-based 3D Object Detection and Tracking,

T. Yin, X. Zhou, and P. Krahenbuhl, “Center-based 3D Object Detection and Tracking,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2021, pp. 11784–11793

2021

[14] [14]

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers,

X. Bai, Z. Hu, X. Zhu, Q. Huang, Y . Chen, H. Fu, and C. Tai, “TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2022, pp. 1090–1099

2022

[15] [15]

BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation,

Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. Rus, and S. Han, “BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird’s-Eye View Representation,” inProc. IEEE Int. Conf. Robot. Autom. (ICRA), 2023, pp. 2774–2781

2023

[16] [16]

Are we ready for autonomous driving? The KITTI vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? The KITTI vision benchmark suite,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2012, pp. 3354–3361

2012

[17] [17]

nuScenes: A Multimodal Dataset for Autonomous Driving,

H. Caesaret al., “nuScenes: A Multimodal Dataset for Autonomous Driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 11621–11631

2020

[18] [18]

OpenPCDet: An Open-source Tool- box for 3D Object Detection from Point Clouds,

OpenPCDet Development Team, “OpenPCDet: An Open-source Tool- box for 3D Object Detection from Point Clouds,” 2020. [Online]. Available: https://github.com/open-mmlab/OpenPCDet

2020

[19] [19]

Path Aggregation Network for Instance Segmentation,

S. Liu, L. Qi, H. Qin, J. Shi, and J. Jia, “Path Aggregation Network for Instance Segmentation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8759–8768

2018

[20] [20]

Focal Loss for Dense Object Detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object Detection,” inProc. IEEE Int. Conf. Comput. Vis. (ICCV), 2017, pp. 2980–2988

2017

[21] [21]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sunet al., “Scalability in perception for autonomous driving: Waymo open dataset,” inProc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 2446–2454

2020

[22] [22]

SUN RGB-D: A RGB-D scene understanding benchmark suite,

S. Song, S. P. Lichtenberg, and J. Xiao, “SUN RGB-D: A RGB-D scene understanding benchmark suite,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), 2015, pp. 567–576

2015

[23] [23]

Deep hough voting for 3d object detection in point clouds,

C. R. Qiet al., “Deep hough voting for 3d object detection in point clouds,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2019, pp. 9277–9286

2019

[24] [24]

Group-free 3d object detection via transformers,

Z. Liuet al., “Group-free 3d object detection via transformers,” inProc. IEEE/CVF Int. Conf. Comput. Vis. (ICCV), 2021, pp. 2949–2958

2021

[25] [25]

Going beyond density functional theory accuracy: Leveraging experimental data to refine pre-trained machine learning interatomic potentials,

S. Gumberet al., “Going beyond density functional theory accuracy: Leveraging experimental data to refine pre-trained machine learning interatomic potentials,”arXiv preprint arXiv:2506.10211, 2026

arXiv 2026