FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Fatih Porikli; Hong Cai; Hsin-Pai Cheng; Jihad Masri; Shizhong Han; Soyeb Nagori

arxiv: 2506.04499 · v2 · pith:NDZ2MQHUnew · submitted 2025-06-04 · 💻 cs.CV

FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Shizhong Han , Hsin-Pai Cheng , Hong Cai , Jihad Masri , Soyeb Nagori , Fatih Porikli This is my paper

Pith reviewed 2026-05-19 10:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords LiDAR 3D object detectionresource-constrained devicesmobile GPUmobile NPUvoxel sequenceConvDotMixnuScenesWaymo

0 comments

The pith

FALO sorts sparse LiDAR voxels into a 1D sequence and processes them with ConvDotMix blocks to match detection accuracy at much higher speed on mobile hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LiDAR 3D object detection methods rely on sparse convolutions or transformers that create irregular memory access and high compute demands, making them hard to run on edge devices. FALO voxelizes the point cloud, arranges the sparse voxels into a 1D sequence ordered by coordinate proximity, and feeds the sequence through ConvDotMix blocks built from large-kernel convolutions, Hadamard products, and linear layers. Implicit grouping inside these blocks balances tensor dimensions as the receptive field expands. The design supplies spatial and embedding mixing plus higher-order nonlinear interactions without explicit 3D neighborhood modeling. On nuScenes and Waymo benchmarks the method reaches competitive accuracy while running 1.6 to 9.8 times faster than recent state-of-the-art models on mobile GPU and NPU hardware.

Core claim

FALO arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity after voxelization. The sequence is processed by ConvDotMix blocks consisting of large-kernel convolutions, Hadamard products, and linear layers. Implicit grouping is introduced to balance tensor dimensions and account for the growing receptive field. These operations provide sufficient mixing capability in both spatial and embedding dimensions and introduce higher-order nonlinear interaction among spatial features. The resulting model achieves competitive performance on nuScenes and Waymo while running 1.6 to 9.8 times faster than the latest state-of-the-art detectors on mobile GPU and mobile NPU.

What carries the argument

ConvDotMix blocks that combine large-kernel convolutions, Hadamard products, and linear layers with implicit grouping to mix spatial and embedding features on a 1D voxel sequence.

If this is right

FALO can be deployed directly on compact embedded devices with mobile GPU or NPU hardware.
The approach avoids the irregular memory patterns of sparse convolutions and the high costs of transformers.
Detection accuracy remains competitive on established benchmarks such as nuScenes and Waymo.
Inference speed improves by a factor of 1.6 to 9.8 relative to recent state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The 1D proximity ordering plus ConvDotMix mixing may offer a lightweight substitute for explicit 3D neighborhood operations in other sparse-data perception tasks.
Hardware-specific tuning of the implicit grouping step could further improve efficiency on particular NPU architectures.
If the higher-order interactions from the Hadamard products are the key accuracy driver, similar element-wise operations could be tested in related vision pipelines for speed gains.

Load-bearing premise

Arranging sparse voxels into a 1D sequence by coordinate proximity together with the ConvDotMix operations supplies enough spatial and embedding mixing to match the accuracy of sparse-convolution or transformer baselines without explicit 3D neighborhood modeling.

What would settle it

Run FALO on the nuScenes and Waymo validation sets, compare its detection metrics directly to the latest sparse-convolution and transformer detectors, and measure wall-clock inference latency on a mobile GPU or NPU to check whether the claimed accuracy and 1.6-9.8x speedup both hold.

Figures

Figures reproduced from arXiv: 2506.04499 by Fatih Porikli, Hong Cai, Hsin-Pai Cheng, Jihad Masri, Shizhong Han, Soyeb Nagori.

**Figure 2.** Figure 2: Overview of our proposed FALO approach. Given the input 3D point cloud and after voxelization, we perform 3D sparse feature [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sparse voxel feature serialization example. The space [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Implicit voxel grouping. The 1D sequence of non-empty voxel tokens with shape [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FALO's 1D voxel sequencing plus ConvDotMix blocks aim for fast edge LiDAR detection, but the 3D neighborhood mixing through proximity sorting remains the least secure part of the argument.

read the letter

The punchline is that FALO turns 3D voxel data into a 1D sequence using coordinate proximity and runs it through ConvDotMix blocks to get fast inference on mobile GPUs and NPUs while aiming for competitive accuracy on nuScenes and Waymo. The new part is the combination of this sequencing with blocks that mix via large-kernel 1D convolutions, Hadamard products for higher-order interactions, and implicit grouping to handle receptive fields and tensor sizes efficiently. This setup avoids the irregular access patterns of sparse convolutions and the compute of transformers, which is a practical step for resource-constrained devices. The paper does well in laying out why these operations are hardware-friendly and how they provide spatial and embedding mixing. The claims of 1.6 to 9.8x speedups suggest real gains if the benchmarks hold. That said, the soft spots center on whether the 1D proximity sort and ConvDotMix truly deliver equivalent 3D neighborhood mixing. A linear sort based on coordinates will break some 3D adjacencies and create artificial ones in 1D, and while large kernels and Hadamard products add mixing, they may not fully compensate without explicit 3D structure. If accuracy lags on complex scenes, this could be why. The reported results are competitive but without error bars or extensive ablations visible in the summary, it's tough to gauge robustness. The training details also seem light. This paper suits engineers and researchers working on edge deployment for autonomous systems or robotics who need faster alternatives to heavy models. Someone looking for implementation ideas on efficient perception would get value here. It deserves serious referee time because the problem is important and the approach is distinct, even if it needs more validation to confirm the mixing works as intended. I recommend sending it for review with a focus on additional experiments to test the 3D locality assumption.

Referee Report

2 major / 2 minor

Summary. The paper proposes FALO for LiDAR 3D object detection on resource-constrained devices. After voxelization of the input point cloud, sparse voxels are sorted into a 1D sequence according to coordinate proximity. This sequence is processed by ConvDotMix blocks that combine large-kernel 1D convolutions, Hadamard products, and linear layers, with implicit grouping applied across layers to balance dimensions and grow receptive fields. The method is claimed to deliver competitive detection accuracy on nuScenes and Waymo while providing 1.6–9.8× speedups over recent SOTA models on mobile GPU and NPU hardware.

Significance. If the accuracy claims are substantiated, the work would be significant for enabling real-time 3D detection on edge platforms. Converting irregular 3D sparse data into a hardware-friendly 1D sequence with custom mixing operations addresses a practical deployment gap that current sparse-convolution and transformer approaches have not fully closed.

major comments (2)

[Experiments] Experimental section: The manuscript asserts competitive benchmark results on nuScenes and Waymo yet reports no error bars, no ablation studies isolating the contributions of proximity sorting, large-kernel convolution, Hadamard product, or implicit grouping, and no detailed training protocol or hyper-parameter settings. These omissions prevent verification that the observed accuracy is robust and attributable to the proposed components rather than implementation details.
[Method] Method section (ConvDotMix and implicit grouping): The central accuracy claim rests on the premise that coordinate-proximity 1D sorting plus large-kernel 1D operations and implicit grouping supply sufficient 3D spatial mixing. A linear sort necessarily severs some true 3D adjacencies while creating spurious 1D neighbors; the paper provides no quantitative analysis, receptive-field visualization, or cross-object contamination study demonstrating that the resulting mixing matches the locality modeling of sparse 3D convolutions or transformers.

minor comments (2)

[Abstract] Abstract: Typo 'predominantely' should read 'predominantly'.
[Abstract] Abstract: The speedup range '1.6~9.8x' should specify the exact baseline models and hardware configurations for each end of the range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for improving experimental rigor and methodological clarity. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Experiments] Experimental section: The manuscript asserts competitive benchmark results on nuScenes and Waymo yet reports no error bars, no ablation studies isolating the contributions of proximity sorting, large-kernel convolution, Hadamard product, or implicit grouping, and no detailed training protocol or hyper-parameter settings. These omissions prevent verification that the observed accuracy is robust and attributable to the proposed components rather than implementation details.

Authors: We agree that these additions would improve reproducibility and help attribute performance gains to specific components. In the revised manuscript, we will report error bars from multiple independent training runs with different random seeds. We will also add ablation studies that isolate the individual contributions of proximity-based voxel sorting, large-kernel 1D convolutions, Hadamard products, and implicit grouping. Finally, we will include a dedicated subsection detailing the full training protocol, including all hyper-parameters, optimizer settings, learning rate schedules, data augmentations, and hardware used for training. revision: yes
Referee: [Method] Method section (ConvDotMix and implicit grouping): The central accuracy claim rests on the premise that coordinate-proximity 1D sorting plus large-kernel 1D operations and implicit grouping supply sufficient 3D spatial mixing. A linear sort necessarily severs some true 3D adjacencies while creating spurious 1D neighbors; the paper provides no quantitative analysis, receptive-field visualization, or cross-object contamination study demonstrating that the resulting mixing matches the locality modeling of sparse 3D convolutions or transformers.

Authors: We acknowledge that a 1D proximity sort can disrupt some true 3D adjacencies and introduce spurious neighbors. However, the large-kernel 1D convolutions in ConvDotMix enable broad information propagation along the sequence, while implicit grouping progressively expands the effective receptive field across layers to recover 3D context. In the revision, we will add receptive-field visualizations demonstrating feature mixing from 3D-proximate voxels and a quantitative analysis of effective neighborhood sizes compared to sparse convolutions. We will also include a brief discussion of cross-object contamination, supported by the observation that coordinate-proximity sorting largely preserves object boundaries; a more extensive contamination study can be considered if space permits. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture and benchmarks are independent of fitted inputs

full rationale

The paper proposes a new voxel-to-1D-sequence arrangement followed by ConvDotMix blocks (large-kernel conv, Hadamard product, linear layers, implicit grouping) and reports accuracy/speed on public nuScenes and Waymo benchmarks. No equations, self-citations, or fitted parameters are shown that define the claimed mixing capability or performance by construction from the same inputs. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard neural-network building blocks and public benchmark datasets.

pith-pipeline@v0.9.0 · 5828 in / 1247 out tokens · 59578 ms · 2026-05-19T10:31:51.954962+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers

Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099,

work page
[2]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 6

work page 2020
[3]

Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection

Chen Chen, Zhe Chen, Jing Zhang, and Dacheng Tao. Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 221–229, 2022. 3

work page 2022
[4]

Fast point r-cnn

Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast point r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9775–9784, 2019. 3

work page 2019
[5]

Largekernel3d: Scaling up kernels in 3d sparse cnns

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 13488–13498,

work page
[6]

V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21674–21683, 2023. 3, 6, 7

work page 2023
[7]

Focal- former3d: focusing on hard instance for 3d object detection

Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, An- ima Anandkumar, Jiaya Jia, and Jose M Alvarez. Focal- former3d: focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8394–8405, 2023. 7

work page 2023
[8]

Back-tracing representative points for voting- based 3d object detection in point clouds

Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. Back-tracing representative points for voting- based 3d object detection in point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8963–8972, 2021. 3

work page 2021
[9]

V oxel r-cnn: Towards high performance voxel-based 3d object detection

Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. In Pro- ceedings of the AAAI conference on artificial intelligence , pages 1201–1209, 2021. 2, 3

work page 2021
[10]

Vista: Boosting 3d object detection via dual cross-view spatial at- tention

Shengheng Deng, Zhihao Liang, Lin Sun, and Kui Jia. Vista: Boosting 3d object detection via dual cross-view spatial at- tention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8448–8457,

work page
[11]

Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds

Shaocong Dong, Lihe Ding, Haiyang Wang, Tingfa Xu, Xinli Xu, Jie Wang, Ziyang Bian, Ying Wang, and Jianan Li. Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:11615–11628, 2022. 2, 3

work page 2022
[12]

Embracing single stride 3d object detector with sparse trans- former

Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse trans- former. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8458–8468,

work page
[13]

Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels

Lue Fan, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels. arXiv preprint arXiv:2308.03755, 2023. 6

work page arXiv 2023
[14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers

Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zux- uan Wu, Larry Davis, and Dinesh Manocha. M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 772–782, 2022. 2, 3

work page 2022
[16]

Structure aware single-stage 3d object detec- tion from point cloud

Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detec- tion from point cloud. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 11873–11882, 2020. 3

work page 2020
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016
[18]

Conv2former: A simple transformer-style convnet for visual recognition

Qibin Hou, Cheng-Ze Lu, Ming-Ming Cheng, and Jiashi Feng. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

work page 2024
[19]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2, 3, 6, 7

work page 2019
[20]

Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer

Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer. arXiv preprint arXiv:2407.11306, 2024. 2, 5

work page arXiv 2024
[21]

Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds

Jinyu Li, Chenxu Luo, and Xiaodong Yang. Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17567– 17576, 2023. 2, 3, 6, 7

work page 2023
[22]

Tanet: Robust 3d object detection from point clouds with triple attention

Zhe Liu, Xin Zhao, Tengteng Huang, Ruolan Hu, Yu Zhou, and Xiang Bai. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI con- ference on artificial intelligence, pages 11677–11684, 2020. 2, 3

work page 2020
[23]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11976–11986,

work page
[24]

Flatformer: Flattened window attention for efficient point cloud transformer

Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. Flatformer: Flattened window attention for efficient point cloud transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1200–1211, 2023. 2, 7

work page 2023
[25]

Lion:Lineargrouprnnfor3dobjectdetectioninpointclouds

Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, and Xiang Bai. Lion: Linear group rnn for 3d object detection in point clouds. arXiv preprint arXiv:2407.18232, 2024. 2, 3, 4

work page arXiv 2024
[26]

Link: Linear kernel for lidar-based 3d perception

Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, and Limin Wang. Link: Linear kernel for lidar-based 3d perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1105–1115,

work page
[27]

Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection

Weixin Mao, Tiancai Wang, Diankun Zhang, Junjie Yan, and Osamu Yoshie. Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection. IEEE Trans- actions on Intelligent Vehicles, 2024. 3, 6, 7

work page 2024
[28]

3d object detection with pointformer

Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3d object detection with pointformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7463–7472, 2021. 3

work page 2021
[29]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page
[30]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 3

work page 2017
[31]

Frustum pointnets for 3d object detection from rgb- d data

Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb- d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018. 3

work page 2018
[32]

Deep hough voting for 3d object detection in point clouds

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9277–9286, 2019. 3

work page 2019
[33]

Pillarnet: Real- time and high-performance pillar-based 3d object detection

Guangsheng Shi, Ruifeng Li, and Chao Ma. Pillarnet: Real- time and high-performance pillar-based 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2022. 1, 2, 3, 6, 7

work page 2022
[34]

Pv-rcnn: Point- voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020. 3

work page 2020
[35]

From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work

Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work. IEEE transactions on pattern analysis and machine intelligence, 43(8):2647–2664, 2020. 2, 3

work page 2020
[36]

Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection

Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023. 3

work page 2023
[37]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6

work page 2020
[38]

Swformer: Sparse window transformer for 3d object detection in point clouds

Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In European Conference on Computer Vision , pages 426–

work page
[39]

Ca- group3d: Class-aware grouping for 3d object detection on point clouds

Haiyang Wang, Lihe Ding, Shaocong Dong, Shaoshuai Shi, Aoxue Li, Jianan Li, Zhenguo Li, and Liwei Wang. Ca- group3d: Class-aware grouping for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:29975–29988, 2022. 2, 3

work page 2022
[40]

Dsvt: Dy- namic sparse voxel transformer with rotated sets

Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. Dsvt: Dy- namic sparse voxel transformer with rotated sets. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13520–13529, 2023. 1, 2, 3, 4, 6, 7, 8

work page 2023
[41]

Uni3detr: Unified 3d detection transformer

Zhenyu Wang, Ya-Li Li, Xi Chen, Hengshuang Zhao, and Shengjin Wang. Uni3detr: Unified 3d detection transformer. Advances in Neural Information Processing Systems , 36,

work page
[42]

Second: Sparsely embed- ded convolutional detection

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 2, 3, 7

work page 2018
[43]

Pvt-ssd: Single-stage 3d object detector with point-voxel transformer

Honghui Yang, Wenxiao Wang, Minghao Chen, Binbin Lin, Tong He, Hua Chen, Xiaofei He, and Wanli Ouyang. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13476–13487, 2023. 3

work page 2023
[44]

Dbq-ssd: Dynamic ball query for efficient 3d object detection

Jinrong Yang, Lin Song, Songtao Liu, Weixin Mao, Zem- ing Li, Xiaoping Li, Hongbin Sun, Jian Sun, and Nanning Zheng. Dbq-ssd: Dynamic ball query for efficient 3d object detection. In International Conference on Learning Repre- sentations, 2023. 3

work page 2023
[45]

Std: Sparse-to-dense 3d object detector for point cloud

Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1951–1960, 2019

work page 1951
[46]

3dssd: Point-based 3d single stage object detector

Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11040–11048, 2020. 3

work page 2020
[47]

Center- based 3d object detection and tracking

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021. 2, 3, 6, 7

work page 2021
[48]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022. 5

work page 2022
[49]

Safdnet: A simple and effective net- work for fully sparse 3d object detection

Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective net- work for fully sparse 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14477–14486, 2024. 2, 3, 6, 7

work page 2024
[50]

Voxel mamba: Group-free state space models for point cloud based 3d object detection

Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaox- iang Zhang, and Lei Zhang. V oxel mamba: Group-free state space models for point cloud based 3d object detection. arXiv preprint arXiv:2406.10700, 2024. 2, 3, 4

work page arXiv 2024
[51]

Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds

Gang Zhang, Chen Junnan, Guohuan Gao, Jianmin Li, and Xiaolin Hu. Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds. Advances in Neural Information Processing Systems, 36, 2024. 3, 7

work page 2024
[52]

Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds

Yifan Zhang, Qingyong Hu, Guoquan Xu, Yanxin Ma, Jian- wei Wan, and Yulan Guo. Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 18953–18962,

work page
[53]

Octr: Octree-based transformer for 3d object detection

Chao Zhou, Yanan Zhang, Jiaxin Chen, and Di Huang. Octr: Octree-based transformer for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5166–5175, 2023. 7

work page 2023
[54]

Fastpillars: A deployment-friendly pillar-based 3d detector

Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, and Lin Ma. Fastpillars: A deployment-friendly pillar-based 3d detector. arXiv preprint arXiv:2302.02367 , 9, 2023. 2, 3, 6, 7

work page arXiv 2023
[55]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 2

work page 2018

[1] [1]

Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers

Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099,

work page

[2] [2]

nuscenes: A multi- modal dataset for autonomous driving

Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 6

work page 2020

[3] [3]

Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection

Chen Chen, Zhe Chen, Jing Zhang, and Dacheng Tao. Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 221–229, 2022. 3

work page 2022

[4] [4]

Fast point r-cnn

Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast point r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9775–9784, 2019. 3

work page 2019

[5] [5]

Largekernel3d: Scaling up kernels in 3d sparse cnns

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 13488–13498,

work page

[6] [6]

V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking

Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21674–21683, 2023. 3, 6, 7

work page 2023

[7] [7]

Focal- former3d: focusing on hard instance for 3d object detection

Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, An- ima Anandkumar, Jiaya Jia, and Jose M Alvarez. Focal- former3d: focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8394–8405, 2023. 7

work page 2023

[8] [8]

Back-tracing representative points for voting- based 3d object detection in point clouds

Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. Back-tracing representative points for voting- based 3d object detection in point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8963–8972, 2021. 3

work page 2021

[9] [9]

V oxel r-cnn: Towards high performance voxel-based 3d object detection

Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. In Pro- ceedings of the AAAI conference on artificial intelligence , pages 1201–1209, 2021. 2, 3

work page 2021

[10] [10]

Vista: Boosting 3d object detection via dual cross-view spatial at- tention

Shengheng Deng, Zhihao Liang, Lin Sun, and Kui Jia. Vista: Boosting 3d object detection via dual cross-view spatial at- tention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8448–8457,

work page

[11] [11]

Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds

Shaocong Dong, Lihe Ding, Haiyang Wang, Tingfa Xu, Xinli Xu, Jie Wang, Ziyang Bian, Ying Wang, and Jianan Li. Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:11615–11628, 2022. 2, 3

work page 2022

[12] [12]

Embracing single stride 3d object detector with sparse trans- former

Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse trans- former. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8458–8468,

work page

[13] [13]

Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels

Lue Fan, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels. arXiv preprint arXiv:2308.03755, 2023. 6

work page arXiv 2023

[14] [14]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers

Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zux- uan Wu, Larry Davis, and Dinesh Manocha. M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 772–782, 2022. 2, 3

work page 2022

[16] [16]

Structure aware single-stage 3d object detec- tion from point cloud

Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detec- tion from point cloud. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 11873–11882, 2020. 3

work page 2020

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

work page 2016

[18] [18]

Conv2former: A simple transformer-style convnet for visual recognition

Qibin Hou, Cheng-Ze Lu, Ming-Ming Cheng, and Jiashi Feng. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

work page 2024

[19] [19]

Pointpillars: Fast encoders for object detection from point clouds

Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2, 3, 6, 7

work page 2019

[20] [20]

Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer

Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer. arXiv preprint arXiv:2407.11306, 2024. 2, 5

work page arXiv 2024

[21] [21]

Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds

Jinyu Li, Chenxu Luo, and Xiaodong Yang. Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17567– 17576, 2023. 2, 3, 6, 7

work page 2023

[22] [22]

Tanet: Robust 3d object detection from point clouds with triple attention

Zhe Liu, Xin Zhao, Tengteng Huang, Ruolan Hu, Yu Zhou, and Xiang Bai. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI con- ference on artificial intelligence, pages 11677–11684, 2020. 2, 3

work page 2020

[23] [23]

A convnet for the 2020s

Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11976–11986,

work page

[24] [24]

Flatformer: Flattened window attention for efficient point cloud transformer

Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. Flatformer: Flattened window attention for efficient point cloud transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1200–1211, 2023. 2, 7

work page 2023

[25] [25]

Lion:Lineargrouprnnfor3dobjectdetectioninpointclouds

Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, and Xiang Bai. Lion: Linear group rnn for 3d object detection in point clouds. arXiv preprint arXiv:2407.18232, 2024. 2, 3, 4

work page arXiv 2024

[26] [26]

Link: Linear kernel for lidar-based 3d perception

Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, and Limin Wang. Link: Linear kernel for lidar-based 3d perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1105–1115,

work page

[27] [27]

Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection

Weixin Mao, Tiancai Wang, Diankun Zhang, Junjie Yan, and Osamu Yoshie. Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection. IEEE Trans- actions on Intelligent Vehicles, 2024. 3, 6, 7

work page 2024

[28] [28]

3d object detection with pointformer

Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3d object detection with pointformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7463–7472, 2021. 3

work page 2021

[29] [29]

Pointnet: Deep learning on point sets for 3d classification and segmentation

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

work page

[30] [30]

Pointnet++: Deep hierarchical feature learning on point sets in a metric space

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 3

work page 2017

[31] [31]

Frustum pointnets for 3d object detection from rgb- d data

Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb- d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018. 3

work page 2018

[32] [32]

Deep hough voting for 3d object detection in point clouds

Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9277–9286, 2019. 3

work page 2019

[33] [33]

Pillarnet: Real- time and high-performance pillar-based 3d object detection

Guangsheng Shi, Ruifeng Li, and Chao Ma. Pillarnet: Real- time and high-performance pillar-based 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2022. 1, 2, 3, 6, 7

work page 2022

[34] [34]

Pv-rcnn: Point- voxel feature set abstraction for 3d object detection

Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020. 3

work page 2020

[35] [35]

From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work

Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work. IEEE transactions on pattern analysis and machine intelligence, 43(8):2647–2664, 2020. 2, 3

work page 2020

[36] [36]

Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection

Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023. 3

work page 2023

[37] [37]

Scalability in perception for autonomous driving: Waymo open dataset

Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6

work page 2020

[38] [38]

Swformer: Sparse window transformer for 3d object detection in point clouds

Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In European Conference on Computer Vision , pages 426–

work page

[39] [39]

Ca- group3d: Class-aware grouping for 3d object detection on point clouds

Haiyang Wang, Lihe Ding, Shaocong Dong, Shaoshuai Shi, Aoxue Li, Jianan Li, Zhenguo Li, and Liwei Wang. Ca- group3d: Class-aware grouping for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:29975–29988, 2022. 2, 3

work page 2022

[40] [40]

Dsvt: Dy- namic sparse voxel transformer with rotated sets

Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. Dsvt: Dy- namic sparse voxel transformer with rotated sets. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13520–13529, 2023. 1, 2, 3, 4, 6, 7, 8

work page 2023

[41] [41]

Uni3detr: Unified 3d detection transformer

Zhenyu Wang, Ya-Li Li, Xi Chen, Hengshuang Zhao, and Shengjin Wang. Uni3detr: Unified 3d detection transformer. Advances in Neural Information Processing Systems , 36,

work page

[42] [42]

Second: Sparsely embed- ded convolutional detection

Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 2, 3, 7

work page 2018

[43] [43]

Pvt-ssd: Single-stage 3d object detector with point-voxel transformer

Honghui Yang, Wenxiao Wang, Minghao Chen, Binbin Lin, Tong He, Hua Chen, Xiaofei He, and Wanli Ouyang. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13476–13487, 2023. 3

work page 2023

[44] [44]

Dbq-ssd: Dynamic ball query for efficient 3d object detection

Jinrong Yang, Lin Song, Songtao Liu, Weixin Mao, Zem- ing Li, Xiaoping Li, Hongbin Sun, Jian Sun, and Nanning Zheng. Dbq-ssd: Dynamic ball query for efficient 3d object detection. In International Conference on Learning Repre- sentations, 2023. 3

work page 2023

[45] [45]

Std: Sparse-to-dense 3d object detector for point cloud

Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1951–1960, 2019

work page 1951

[46] [46]

3dssd: Point-based 3d single stage object detector

Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11040–11048, 2020. 3

work page 2020

[47] [47]

Center- based 3d object detection and tracking

Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021. 2, 3, 6, 7

work page 2021

[48] [48]

Metaformer is actually what you need for vision

Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022. 5

work page 2022

[49] [49]

Safdnet: A simple and effective net- work for fully sparse 3d object detection

Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective net- work for fully sparse 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14477–14486, 2024. 2, 3, 6, 7

work page 2024

[50] [50]

Voxel mamba: Group-free state space models for point cloud based 3d object detection

Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaox- iang Zhang, and Lei Zhang. V oxel mamba: Group-free state space models for point cloud based 3d object detection. arXiv preprint arXiv:2406.10700, 2024. 2, 3, 4

work page arXiv 2024

[51] [51]

Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds

Gang Zhang, Chen Junnan, Guohuan Gao, Jianmin Li, and Xiaolin Hu. Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds. Advances in Neural Information Processing Systems, 36, 2024. 3, 7

work page 2024

[52] [52]

Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds

Yifan Zhang, Qingyong Hu, Guoquan Xu, Yanxin Ma, Jian- wei Wan, and Yulan Guo. Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 18953–18962,

work page

[53] [53]

Octr: Octree-based transformer for 3d object detection

Chao Zhou, Yanan Zhang, Jiaxin Chen, and Di Huang. Octr: Octree-based transformer for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5166–5175, 2023. 7

work page 2023

[54] [54]

Fastpillars: A deployment-friendly pillar-based 3d detector

Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, and Lin Ma. Fastpillars: A deployment-friendly pillar-based 3d detector. arXiv preprint arXiv:2302.02367 , 9, 2023. 2, 3, 6, 7

work page arXiv 2023

[55] [55]

V oxelnet: End-to-end learning for point cloud based 3d object detection

Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 2

work page 2018