pith. sign in

arxiv: 2506.04499 · v2 · pith:NDZ2MQHUnew · submitted 2025-06-04 · 💻 cs.CV

FALO: Fast and Accurate LiDAR 3D Object Detection on Resource-Constrained Devices

Pith reviewed 2026-05-19 10:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords LiDAR 3D object detectionresource-constrained devicesmobile GPUmobile NPUvoxel sequenceConvDotMixnuScenesWaymo
0
0 comments X

The pith

FALO sorts sparse LiDAR voxels into a 1D sequence and processes them with ConvDotMix blocks to match detection accuracy at much higher speed on mobile hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing LiDAR 3D object detection methods rely on sparse convolutions or transformers that create irregular memory access and high compute demands, making them hard to run on edge devices. FALO voxelizes the point cloud, arranges the sparse voxels into a 1D sequence ordered by coordinate proximity, and feeds the sequence through ConvDotMix blocks built from large-kernel convolutions, Hadamard products, and linear layers. Implicit grouping inside these blocks balances tensor dimensions as the receptive field expands. The design supplies spatial and embedding mixing plus higher-order nonlinear interactions without explicit 3D neighborhood modeling. On nuScenes and Waymo benchmarks the method reaches competitive accuracy while running 1.6 to 9.8 times faster than recent state-of-the-art models on mobile GPU and NPU hardware.

Core claim

FALO arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity after voxelization. The sequence is processed by ConvDotMix blocks consisting of large-kernel convolutions, Hadamard products, and linear layers. Implicit grouping is introduced to balance tensor dimensions and account for the growing receptive field. These operations provide sufficient mixing capability in both spatial and embedding dimensions and introduce higher-order nonlinear interaction among spatial features. The resulting model achieves competitive performance on nuScenes and Waymo while running 1.6 to 9.8 times faster than the latest state-of-the-art detectors on mobile GPU and mobile NPU.

What carries the argument

ConvDotMix blocks that combine large-kernel convolutions, Hadamard products, and linear layers with implicit grouping to mix spatial and embedding features on a 1D voxel sequence.

If this is right

  • FALO can be deployed directly on compact embedded devices with mobile GPU or NPU hardware.
  • The approach avoids the irregular memory patterns of sparse convolutions and the high costs of transformers.
  • Detection accuracy remains competitive on established benchmarks such as nuScenes and Waymo.
  • Inference speed improves by a factor of 1.6 to 9.8 relative to recent state-of-the-art methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The 1D proximity ordering plus ConvDotMix mixing may offer a lightweight substitute for explicit 3D neighborhood operations in other sparse-data perception tasks.
  • Hardware-specific tuning of the implicit grouping step could further improve efficiency on particular NPU architectures.
  • If the higher-order interactions from the Hadamard products are the key accuracy driver, similar element-wise operations could be tested in related vision pipelines for speed gains.

Load-bearing premise

Arranging sparse voxels into a 1D sequence by coordinate proximity together with the ConvDotMix operations supplies enough spatial and embedding mixing to match the accuracy of sparse-convolution or transformer baselines without explicit 3D neighborhood modeling.

What would settle it

Run FALO on the nuScenes and Waymo validation sets, compare its detection metrics directly to the latest sparse-convolution and transformer detectors, and measure wall-clock inference latency on a mobile GPU or NPU to check whether the claimed accuracy and 1.6-9.8x speedup both hold.

Figures

Figures reproduced from arXiv: 2506.04499 by Fatih Porikli, Hong Cai, Hsin-Pai Cheng, Jihad Masri, Shizhong Han, Soyeb Nagori.

Figure 1
Figure 1. Figure 1: Comparison of 3D object detection accuracy (nuScenes [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our proposed FALO approach. Given the input 3D point cloud and after voxelization, we perform 3D sparse feature [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sparse voxel feature serialization example. The space [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Implicit voxel grouping. The 1D sequence of non-empty voxel tokens with shape [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
read the original abstract

Existing LiDAR 3D object detection methods predominantely rely on sparse convolutions and/or transformers, which can be challenging to run on resource-constrained edge devices, due to irregular memory access patterns and high computational costs. In this paper, we propose FALO, a hardware-friendly approach to LiDAR 3D detection, which offers both state-of-the-art (SOTA) detection accuracy and fast inference speed. More specifically, given the 3D point cloud and after voxelization, FALO first arranges sparse 3D voxels into a 1D sequence based on their coordinates and proximity. The sequence is then processed by our proposed ConvDotMix blocks, consisting of large-kernel convolutions, Hadamard products, and linear layers. ConvDotMix provides sufficient mixing capability in both spatial and embedding dimensions, and introduces higher-order nonlinear interaction among spatial features. Furthermore, when going through the ConvDotMix layers, we introduce implicit grouping, which balances the tensor dimensions for more efficient inference and takes into account the growing receptive field. All these operations are friendly to run on resource-constrained platforms and proposed FALO can readily deploy on compact, embedded devices. Our extensive evaluation on LiDAR 3D detection benchmarks such as nuScenes and Waymo shows that FALO achieves competitive performance. Meanwhile, FALO is 1.6~9.8x faster than the latest SOTA on mobile Graphics Processing Unit (GPU) and mobile Neural Processing Unit (NPU).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes FALO for LiDAR 3D object detection on resource-constrained devices. After voxelization of the input point cloud, sparse voxels are sorted into a 1D sequence according to coordinate proximity. This sequence is processed by ConvDotMix blocks that combine large-kernel 1D convolutions, Hadamard products, and linear layers, with implicit grouping applied across layers to balance dimensions and grow receptive fields. The method is claimed to deliver competitive detection accuracy on nuScenes and Waymo while providing 1.6–9.8× speedups over recent SOTA models on mobile GPU and NPU hardware.

Significance. If the accuracy claims are substantiated, the work would be significant for enabling real-time 3D detection on edge platforms. Converting irregular 3D sparse data into a hardware-friendly 1D sequence with custom mixing operations addresses a practical deployment gap that current sparse-convolution and transformer approaches have not fully closed.

major comments (2)
  1. [Experiments] Experimental section: The manuscript asserts competitive benchmark results on nuScenes and Waymo yet reports no error bars, no ablation studies isolating the contributions of proximity sorting, large-kernel convolution, Hadamard product, or implicit grouping, and no detailed training protocol or hyper-parameter settings. These omissions prevent verification that the observed accuracy is robust and attributable to the proposed components rather than implementation details.
  2. [Method] Method section (ConvDotMix and implicit grouping): The central accuracy claim rests on the premise that coordinate-proximity 1D sorting plus large-kernel 1D operations and implicit grouping supply sufficient 3D spatial mixing. A linear sort necessarily severs some true 3D adjacencies while creating spurious 1D neighbors; the paper provides no quantitative analysis, receptive-field visualization, or cross-object contamination study demonstrating that the resulting mixing matches the locality modeling of sparse 3D convolutions or transformers.
minor comments (2)
  1. [Abstract] Abstract: Typo 'predominantely' should read 'predominantly'.
  2. [Abstract] Abstract: The speedup range '1.6~9.8x' should specify the exact baseline models and hardware configurations for each end of the range.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects for improving experimental rigor and methodological clarity. We address each major comment below and indicate planned revisions to the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experimental section: The manuscript asserts competitive benchmark results on nuScenes and Waymo yet reports no error bars, no ablation studies isolating the contributions of proximity sorting, large-kernel convolution, Hadamard product, or implicit grouping, and no detailed training protocol or hyper-parameter settings. These omissions prevent verification that the observed accuracy is robust and attributable to the proposed components rather than implementation details.

    Authors: We agree that these additions would improve reproducibility and help attribute performance gains to specific components. In the revised manuscript, we will report error bars from multiple independent training runs with different random seeds. We will also add ablation studies that isolate the individual contributions of proximity-based voxel sorting, large-kernel 1D convolutions, Hadamard products, and implicit grouping. Finally, we will include a dedicated subsection detailing the full training protocol, including all hyper-parameters, optimizer settings, learning rate schedules, data augmentations, and hardware used for training. revision: yes

  2. Referee: [Method] Method section (ConvDotMix and implicit grouping): The central accuracy claim rests on the premise that coordinate-proximity 1D sorting plus large-kernel 1D operations and implicit grouping supply sufficient 3D spatial mixing. A linear sort necessarily severs some true 3D adjacencies while creating spurious 1D neighbors; the paper provides no quantitative analysis, receptive-field visualization, or cross-object contamination study demonstrating that the resulting mixing matches the locality modeling of sparse 3D convolutions or transformers.

    Authors: We acknowledge that a 1D proximity sort can disrupt some true 3D adjacencies and introduce spurious neighbors. However, the large-kernel 1D convolutions in ConvDotMix enable broad information propagation along the sequence, while implicit grouping progressively expands the effective receptive field across layers to recover 3D context. In the revision, we will add receptive-field visualizations demonstrating feature mixing from 3D-proximate voxels and a quantitative analysis of effective neighborhood sizes compared to sparse convolutions. We will also include a brief discussion of cross-object contamination, supported by the observation that coordinate-proximity sorting largely preserves object boundaries; a more extensive contamination study can be considered if space permits. revision: partial

Circularity Check

0 steps flagged

No circularity: architecture and benchmarks are independent of fitted inputs

full rationale

The paper proposes a new voxel-to-1D-sequence arrangement followed by ConvDotMix blocks (large-kernel conv, Hadamard product, linear layers, implicit grouping) and reports accuracy/speed on public nuScenes and Waymo benchmarks. No equations, self-citations, or fitted parameters are shown that define the claimed mixing capability or performance by construction from the same inputs. The derivation chain remains self-contained and externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract introduces no explicit free parameters, axioms, or invented entities beyond standard neural-network building blocks and public benchmark datasets.

pith-pipeline@v0.9.0 · 5828 in / 1247 out tokens · 59578 ms · 2026-05-19T10:31:51.954962+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers

    Xuyang Bai, Zeyu Hu, Xinge Zhu, Qingqiu Huang, Yilun Chen, Hongbo Fu, and Chiew-Lan Tai. Transfusion: Ro- bust lidar-camera fusion for 3d object detection with trans- formers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1090–1099,

  2. [2]

    nuscenes: A multi- modal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020. 6

  3. [3]

    Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection

    Chen Chen, Zhe Chen, Jing Zhang, and Dacheng Tao. Sasa: Semantics-augmented set abstraction for point-based 3d ob- ject detection. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 221–229, 2022. 3

  4. [4]

    Fast point r-cnn

    Yilun Chen, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Fast point r-cnn. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9775–9784, 2019. 3

  5. [5]

    Largekernel3d: Scaling up kernels in 3d sparse cnns

    Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. Largekernel3d: Scaling up kernels in 3d sparse cnns. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 13488–13498,

  6. [6]

    V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking

    Yukang Chen, Jianhui Liu, Xiangyu Zhang, Xiaojuan Qi, and Jiaya Jia. V oxelnext: Fully sparse voxelnet for 3d object de- tection and tracking. In Proceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21674–21683, 2023. 3, 6, 7

  7. [7]

    Focal- former3d: focusing on hard instance for 3d object detection

    Yilun Chen, Zhiding Yu, Yukang Chen, Shiyi Lan, An- ima Anandkumar, Jiaya Jia, and Jose M Alvarez. Focal- former3d: focusing on hard instance for 3d object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8394–8405, 2023. 7

  8. [8]

    Back-tracing representative points for voting- based 3d object detection in point clouds

    Bowen Cheng, Lu Sheng, Shaoshuai Shi, Ming Yang, and Dong Xu. Back-tracing representative points for voting- based 3d object detection in point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8963–8972, 2021. 3

  9. [9]

    V oxel r-cnn: Towards high performance voxel-based 3d object detection

    Jiajun Deng, Shaoshuai Shi, Peiwei Li, Wengang Zhou, Yanyong Zhang, and Houqiang Li. V oxel r-cnn: Towards high performance voxel-based 3d object detection. In Pro- ceedings of the AAAI conference on artificial intelligence , pages 1201–1209, 2021. 2, 3

  10. [10]

    Vista: Boosting 3d object detection via dual cross-view spatial at- tention

    Shengheng Deng, Zhihao Liang, Lin Sun, and Kui Jia. Vista: Boosting 3d object detection via dual cross-view spatial at- tention. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8448–8457,

  11. [11]

    Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds

    Shaocong Dong, Lihe Ding, Haiyang Wang, Tingfa Xu, Xinli Xu, Jie Wang, Ziyang Bian, Ying Wang, and Jianan Li. Mssvt: Mixed-scale sparse voxel transformer for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:11615–11628, 2022. 2, 3

  12. [12]

    Embracing single stride 3d object detector with sparse trans- former

    Lue Fan, Ziqi Pang, Tianyuan Zhang, Yu-Xiong Wang, Hang Zhao, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Embracing single stride 3d object detector with sparse trans- former. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8458–8468,

  13. [13]

    Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels

    Lue Fan, Feng Wang, Naiyan Wang, and Zhaoxiang Zhang. Fsd v2: Improving fully sparse 3d object detection with vir- tual voxels. arXiv preprint arXiv:2308.03755, 2023. 6

  14. [14]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023. 3

  15. [15]

    M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers

    Tianrui Guan, Jun Wang, Shiyi Lan, Rohan Chandra, Zux- uan Wu, Larry Davis, and Dinesh Manocha. M3detr: Multi- representation, multi-scale, mutual-relation 3d object detec- tion with transformers. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 772–782, 2022. 2, 3

  16. [16]

    Structure aware single-stage 3d object detec- tion from point cloud

    Chenhang He, Hui Zeng, Jianqiang Huang, Xian-Sheng Hua, and Lei Zhang. Structure aware single-stage 3d object detec- tion from point cloud. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 11873–11882, 2020. 3

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 7

  18. [18]

    Conv2former: A simple transformer-style convnet for visual recognition

    Qibin Hou, Cheng-Ze Lu, Ming-Ming Cheng, and Jiashi Feng. Conv2former: A simple transformer-style convnet for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024. 2, 5

  19. [19]

    Pointpillars: Fast encoders for object detection from point clouds

    Alex H Lang, Sourabh V ora, Holger Caesar, Lubing Zhou, Jiong Yang, and Oscar Beijbom. Pointpillars: Fast encoders for object detection from point clouds. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12697–12705, 2019. 2, 3, 6, 7

  20. [20]

    Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer

    Pierre-David Letourneau, Manish Kumar Singh, Hsin- Pai Cheng, Shizhong Han, Yunxiao Shi, Dalton Jones, Matthew Harper Langston, Hong Cai, and Fatih Porikli. Padre: A unifying polynomial attention drop-in replace- ment for efficient vision transformer. arXiv preprint arXiv:2407.11306, 2024. 2, 5

  21. [21]

    Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds

    Jinyu Li, Chenxu Luo, and Xiaodong Yang. Pillarnext: Re- thinking network designs for 3d object detection in lidar point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17567– 17576, 2023. 2, 3, 6, 7

  22. [22]

    Tanet: Robust 3d object detection from point clouds with triple attention

    Zhe Liu, Xin Zhao, Tengteng Huang, Ruolan Hu, Yu Zhou, and Xiang Bai. Tanet: Robust 3d object detection from point clouds with triple attention. In Proceedings of the AAAI con- ference on artificial intelligence, pages 11677–11684, 2020. 2, 3

  23. [23]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feicht- enhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 11976–11986,

  24. [24]

    Flatformer: Flattened window attention for efficient point cloud transformer

    Zhijian Liu, Xinyu Yang, Haotian Tang, Shang Yang, and Song Han. Flatformer: Flattened window attention for efficient point cloud transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1200–1211, 2023. 2, 7

  25. [25]

    Lion:Lineargrouprnnfor3dobjectdetectioninpointclouds

    Zhe Liu, Jinghua Hou, Xinyu Wang, Xiaoqing Ye, Jingdong Wang, Hengshuang Zhao, and Xiang Bai. Lion: Linear group rnn for 3d object detection in point clouds. arXiv preprint arXiv:2407.18232, 2024. 2, 3, 4

  26. [26]

    Link: Linear kernel for lidar-based 3d perception

    Tao Lu, Xiang Ding, Haisong Liu, Gangshan Wu, and Limin Wang. Link: Linear kernel for lidar-based 3d perception. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1105–1115,

  27. [27]

    Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection

    Weixin Mao, Tiancai Wang, Diankun Zhang, Junjie Yan, and Osamu Yoshie. Pillarnest: Embracing backbone scaling and pretraining for pillar-based 3d object detection. IEEE Trans- actions on Intelligent Vehicles, 2024. 3, 6, 7

  28. [28]

    3d object detection with pointformer

    Xuran Pan, Zhuofan Xia, Shiji Song, Li Erran Li, and Gao Huang. 3d object detection with pointformer. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 7463–7472, 2021. 3

  29. [29]

    Pointnet: Deep learning on point sets for 3d classification and segmentation

    Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 652–660,

  30. [30]

    Pointnet++: Deep hierarchical feature learning on point sets in a metric space

    Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. Advances in neural information processing systems, 30, 2017. 3

  31. [31]

    Frustum pointnets for 3d object detection from rgb- d data

    Charles R Qi, Wei Liu, Chenxia Wu, Hao Su, and Leonidas J Guibas. Frustum pointnets for 3d object detection from rgb- d data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 918–927, 2018. 3

  32. [32]

    Deep hough voting for 3d object detection in point clouds

    Charles R Qi, Or Litany, Kaiming He, and Leonidas J Guibas. Deep hough voting for 3d object detection in point clouds. In proceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 9277–9286, 2019. 3

  33. [33]

    Pillarnet: Real- time and high-performance pillar-based 3d object detection

    Guangsheng Shi, Ruifeng Li, and Chao Ma. Pillarnet: Real- time and high-performance pillar-based 3d object detection. In European Conference on Computer Vision, pages 35–52. Springer, 2022. 1, 2, 3, 6, 7

  34. [34]

    Pv-rcnn: Point- voxel feature set abstraction for 3d object detection

    Shaoshuai Shi, Chaoxu Guo, Li Jiang, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv-rcnn: Point- voxel feature set abstraction for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10529–10538, 2020. 3

  35. [35]

    From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work

    Shaoshuai Shi, Zhe Wang, Jianping Shi, Xiaogang Wang, and Hongsheng Li. From points to parts: 3d object detection from point cloud with part-aware and part-aggregation net- work. IEEE transactions on pattern analysis and machine intelligence, 43(8):2647–2664, 2020. 2, 3

  36. [36]

    Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection

    Shaoshuai Shi, Li Jiang, Jiajun Deng, Zhe Wang, Chaoxu Guo, Jianping Shi, Xiaogang Wang, and Hongsheng Li. Pv- rcnn++: Point-voxel feature set abstraction with local vector representation for 3d object detection. International Journal of Computer Vision, 131(2):531–551, 2023. 3

  37. [37]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, et al. Scalability in perception for autonomous driving: Waymo open dataset. In Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2446–2454, 2020. 6

  38. [38]

    Swformer: Sparse window transformer for 3d object detection in point clouds

    Pei Sun, Mingxing Tan, Weiyue Wang, Chenxi Liu, Fei Xia, Zhaoqi Leng, and Dragomir Anguelov. Swformer: Sparse window transformer for 3d object detection in point clouds. In European Conference on Computer Vision , pages 426–

  39. [39]

    Ca- group3d: Class-aware grouping for 3d object detection on point clouds

    Haiyang Wang, Lihe Ding, Shaocong Dong, Shaoshuai Shi, Aoxue Li, Jianan Li, Zhenguo Li, and Liwei Wang. Ca- group3d: Class-aware grouping for 3d object detection on point clouds. Advances in Neural Information Processing Systems, 35:29975–29988, 2022. 2, 3

  40. [40]

    Dsvt: Dy- namic sparse voxel transformer with rotated sets

    Haiyang Wang, Chen Shi, Shaoshuai Shi, Meng Lei, Sen Wang, Di He, Bernt Schiele, and Liwei Wang. Dsvt: Dy- namic sparse voxel transformer with rotated sets. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13520–13529, 2023. 1, 2, 3, 4, 6, 7, 8

  41. [41]

    Uni3detr: Unified 3d detection transformer

    Zhenyu Wang, Ya-Li Li, Xi Chen, Hengshuang Zhao, and Shengjin Wang. Uni3detr: Unified 3d detection transformer. Advances in Neural Information Processing Systems , 36,

  42. [42]

    Second: Sparsely embed- ded convolutional detection

    Yan Yan, Yuxing Mao, and Bo Li. Second: Sparsely embed- ded convolutional detection. Sensors, 18(10):3337, 2018. 2, 3, 7

  43. [43]

    Pvt-ssd: Single-stage 3d object detector with point-voxel transformer

    Honghui Yang, Wenxiao Wang, Minghao Chen, Binbin Lin, Tong He, Hua Chen, Xiaofei He, and Wanli Ouyang. Pvt-ssd: Single-stage 3d object detector with point-voxel transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13476–13487, 2023. 3

  44. [44]

    Dbq-ssd: Dynamic ball query for efficient 3d object detection

    Jinrong Yang, Lin Song, Songtao Liu, Weixin Mao, Zem- ing Li, Xiaoping Li, Hongbin Sun, Jian Sun, and Nanning Zheng. Dbq-ssd: Dynamic ball query for efficient 3d object detection. In International Conference on Learning Repre- sentations, 2023. 3

  45. [45]

    Std: Sparse-to-dense 3d object detector for point cloud

    Zetong Yang, Yanan Sun, Shu Liu, Xiaoyong Shen, and Jiaya Jia. Std: Sparse-to-dense 3d object detector for point cloud. In Proceedings of the IEEE/CVF international conference on computer vision, pages 1951–1960, 2019

  46. [46]

    3dssd: Point-based 3d single stage object detector

    Zetong Yang, Yanan Sun, Shu Liu, and Jiaya Jia. 3dssd: Point-based 3d single stage object detector. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11040–11048, 2020. 3

  47. [47]

    Center- based 3d object detection and tracking

    Tianwei Yin, Xingyi Zhou, and Philipp Krahenbuhl. Center- based 3d object detection and tracking. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11784–11793, 2021. 2, 3, 6, 7

  48. [48]

    Metaformer is actually what you need for vision

    Weihao Yu, Mi Luo, Pan Zhou, Chenyang Si, Yichen Zhou, Xinchao Wang, Jiashi Feng, and Shuicheng Yan. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10819–10829, 2022. 5

  49. [49]

    Safdnet: A simple and effective net- work for fully sparse 3d object detection

    Gang Zhang, Junnan Chen, Guohuan Gao, Jianmin Li, Si Liu, and Xiaolin Hu. Safdnet: A simple and effective net- work for fully sparse 3d object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14477–14486, 2024. 2, 3, 6, 7

  50. [50]

    Voxel mamba: Group-free state space models for point cloud based 3d object detection

    Guowen Zhang, Lue Fan, Chenhang He, Zhen Lei, Zhaox- iang Zhang, and Lei Zhang. V oxel mamba: Group-free state space models for point cloud based 3d object detection. arXiv preprint arXiv:2406.10700, 2024. 2, 3, 4

  51. [51]

    Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds

    Gang Zhang, Chen Junnan, Guohuan Gao, Jianmin Li, and Xiaolin Hu. Hednet: A hierarchical encoder-decoder net- work for 3d object detection in point clouds. Advances in Neural Information Processing Systems, 36, 2024. 3, 7

  52. [52]

    Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds

    Yifan Zhang, Qingyong Hu, Guoquan Xu, Yanxin Ma, Jian- wei Wan, and Yulan Guo. Not all points are equal: Learn- ing highly efficient point-based detectors for 3d lidar point clouds. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 18953–18962,

  53. [53]

    Octr: Octree-based transformer for 3d object detection

    Chao Zhou, Yanan Zhang, Jiaxin Chen, and Di Huang. Octr: Octree-based transformer for 3d object detection. In Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5166–5175, 2023. 7

  54. [54]

    Fastpillars: A deployment-friendly pillar-based 3d detector

    Sifan Zhou, Zhi Tian, Xiangxiang Chu, Xinyu Zhang, Bo Zhang, Xiaobo Lu, Chengjian Feng, Zequn Jie, Patrick Yin Chiang, and Lin Ma. Fastpillars: A deployment-friendly pillar-based 3d detector. arXiv preprint arXiv:2302.02367 , 9, 2023. 2, 3, 6, 7

  55. [55]

    V oxelnet: End-to-end learning for point cloud based 3d object detection

    Yin Zhou and Oncel Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE conference on computer vision and pattern recog- nition, pages 4490–4499, 2018. 2