STD: Sparse-to-Dense 3D Object Detector for Point Cloud

Jiaya Jia; Shu Liu; Xiaoyong Shen; Yanan Sun; Zetong Yang

arxiv: 1907.10471 · v1 · pith:P3WBJRNRnew · submitted 2019-07-22 · 💻 cs.CV

STD: Sparse-to-Dense 3D Object Detector for Point Cloud

Zetong Yang , Yanan Sun , Shu Liu , Xiaoyong Shen , Jiaya Jia This is my paper

Pith reviewed 2026-05-24 18:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D object detectionpoint cloudproposal generationKITTI datasetspherical anchorPointsPool

0 comments

The pith

Seeding spherical anchors at each raw point generates high-recall 3D proposals that PointsPool compacts for accurate box prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a two-stage detector that starts with raw point clouds and places a spherical anchor around every input point to create candidate boxes. This bottom-up stage reaches high recall while using less computation than earlier proposal methods. A PointsPool step then converts the sparse interior points of each proposal into a compact feature representation. In the second stage a parallel IoU branch runs alongside box regression to improve awareness of localization quality. Experiments on the KITTI dataset show gains over prior detectors, especially on hard examples, at inference speeds above 10 frames per second.

Core claim

The central claim is that a bottom-up proposal network seeding spherical anchors at every point, followed by PointsPool for sparse-to-compact feature conversion and a parallel IoU branch, produces more accurate 3D object and Bird's Eye View detections than previous methods while running faster than 10 FPS.

What carries the argument

Spherical anchor placed at each input point for proposal generation, together with the PointsPool operation that transforms sparse interior point features into a compact representation.

If this is right

The proposal stage reaches high recall with reduced computation compared with earlier bottom-up approaches.
PointsPool further lowers computation by turning sparse proposal points into compact features.
The parallel IoU branch improves localization accuracy awareness during box prediction.
Detection performance improves by a large margin over prior methods, especially on the hard subset.
The full pipeline runs faster than 10 FPS on KITTI while delivering the accuracy gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The anchoring idea might transfer to other sparse 3D inputs such as radar point sets without major redesign.
Avoiding dense voxel grids could simplify end-to-end pipelines that currently convert point clouds to regular volumes first.
Extending the two-stage structure to multi-frame sequences would test whether the speed benefit scales to online tracking.

Load-bearing premise

Placing spherical anchors at every point and running PointsPool will reliably produce high-recall proposals and useful features from real-world point clouds.

What would settle it

Measuring proposal recall on the KITTI validation set using only the first stage; if recall does not exceed that of prior bottom-up generators while using comparable or less compute, the efficiency advantage does not hold.

Figures

Figures reproduced from arXiv: 1907.10471 by Jiaya Jia, Shu Liu, Xiaoyong Shen, Yanan Sun, Zetong Yang.

**Figure 1.** Figure 1: Illustration of our framework consisting of three different parts. The first is a proposal generation module (PGM) to generate [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of networks in the proposal generation module. (a) 3D segmentation network (PointNet++). It takes a raw point cloud [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Small objects such as indicators are easy to detect on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of our results on KITTI test set. Cars, pedestrians and cyclists are highlighted in yellow, red and green respectively. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We present a new two-stage 3D object detection framework, named sparse-to-dense 3D Object Detector (STD). The first stage is a bottom-up proposal generation network that uses raw point cloud as input to generate accurate proposals by seeding each point with a new spherical anchor. It achieves a high recall with less computation compared with prior works. Then, PointsPool is applied for generating proposal features by transforming their interior point features from sparse expression to compact representation, which saves even more computation time. In box prediction, which is the second stage, we implement a parallel intersection-over-union (IoU) branch to increase awareness of localization accuracy, resulting in further improved performance. We conduct experiments on KITTI dataset, and evaluate our method in terms of 3D object and Bird's Eye View (BEV) detection. Our method outperforms other state-of-the-arts by a large margin, especially on the hard set, with inference speed more than 10 FPS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces STD, a two-stage 3D object detector for point clouds. The proposal generation stage seeds each input point with a spherical anchor to produce high-recall proposals efficiently from raw point clouds. PointsPool then converts sparse interior point features into compact proposal representations. The second stage adds a parallel IoU prediction branch alongside box regression to improve localization awareness. Experiments on the KITTI benchmark report superior 3D and BEV detection performance over prior state-of-the-art methods, especially on the hard difficulty subset, while maintaining inference speed above 10 FPS.

Significance. If the reported gains hold under the provided ablations and KITTI comparisons, the work offers a concrete advance in efficient, high-recall proposal generation for point-cloud detection. The explicit isolation of the spherical-anchor and PointsPool contributions, together with the IoU branch, supplies falsifiable evidence that these components drive the observed margin on the hard set without introducing additional free parameters beyond standard training.

minor comments (2)

[§4.2] §4.2: the description of the spherical anchor radius schedule could be clarified with an explicit equation or pseudocode to facilitate exact reproduction.
[Figure 4] Figure 4: the recall-vs-proposal-number curves would benefit from error bars or multiple runs to quantify variability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation to accept. The provided summary correctly reflects the key elements of our method and results.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical two-stage 3D object detection architecture (spherical anchors + PointsPool + parallel IoU branch) evaluated via direct comparisons and ablations on the external KITTI benchmark. No equations, predictions, or first-principles derivations are claimed that reduce by construction to fitted parameters, self-definitions, or self-citation chains; all performance margins are reported against independent test data and prior external methods.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Review performed on abstract only; therefore free parameters, axioms, and invented entities cannot be exhaustively audited. The abstract introduces two new named constructs whose details and justification are not supplied.

invented entities (2)

spherical anchor no independent evidence
purpose: seed proposals from individual points in the first stage
Introduced in the abstract as the core of the proposal generation network; no independent evidence supplied.
PointsPool no independent evidence
purpose: transform sparse interior point features into compact proposal representation
Named operation described in the abstract; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5707 in / 1185 out tokens · 30193 ms · 2026-05-24T18:08:00.870205+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

http: //www.cvlibs.net/datasets/kitti/eval_ object.php?obj_benchmark=3d, 2019

”kitti 3d object detection benchmark”. http: //www.cvlibs.net/datasets/kitti/eval_ object.php?obj_benchmark=3d, 2019

work page 2019
[2]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Is- ard, Y . Jia, R. J´ozefowicz, L. Kaiser, M. Kudlur, J. Leven- berg, D. Man´e, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tuck...

work page 2016
[3]

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 2018

work page 2018
[4]

X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017

work page 2017
[5]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017

work page 2017
[6]

Dai and M

A. Dai and M. Nießner. 3dmv: Joint 3d-multi-view predic- tion for 3d semantic scene segmentation. In ECCV, 2018

work page 2018
[7]

Engelcke, D

M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. V ote3deep: Fast object detection in 3d point clouds using efﬁcient convolutional neural networks. InICRA, 2017

work page 2017
[8]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the KITTI vision benchmark suite. In CVPR, 2012

work page 2012
[9]

Gonz ´alez, G

A. Gonz ´alez, G. Villalonga, J. Xu, D. V ´azquez, J. Amores, and A. M. L ´opez. Multiview random forest of local experts combining RGB and LIDAR data for pedestrian detection. In IV, 2015

work page 2015
[10]

Graham, M

B. Graham, M. Engelcke, and L. van der Maaten. 3d se- mantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018

work page 2018
[11]

Jiang, R

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang. Acquisition of localization conﬁdence for accurate object detection. In ECCV, 2018

work page 2018
[12]

Jiang, Y

M. Jiang, Y . Wu, and C. Lu. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. CoRR, 2018

work page 2018
[13]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, 2014

work page 2014
[14]

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander. Joint 3d proposal generation and object detection from view aggregation. CoRR, 2017

work page 2017
[15]

A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Bei- jbom. Pointpillars: Fast encoders for object detection from point clouds. CVPR, 2019

work page 2019
[16]

B. Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017

work page 2017
[17]

J. Li, B. M. Chen, and G. H. Lee. So-net: Self-organizing network for point cloud analysis. CoRR, 2018

work page 2018
[18]

Y . Li, R. Bu, M. Sun, and B. Chen. Pointcnn. CoRR, 2018

work page 2018
[19]

Liang*, B

M. Liang*, B. Yang*, Y . Chen, R. Hu, and R. Urtasun. Multi- task multi-sensor fusion for 3d object detection. In CVPR, 2019

work page 2019
[20]

T. Lin, P. Doll ´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detec- tion. In CVPR, 2017

work page 2017
[21]

T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Doll ´ar. Focal loss for dense object detection. In ICCV, 2017

work page 2017
[22]

S. Liu, C. Lu, and J. Jia. Box aggregation for proposal deci- mation: Last mile of object detection. In ICCV, 2015

work page 2015
[23]

Maturana and S

D. Maturana and S. Scherer. V oxnet: A 3d convolutional neural network for real-time object recognition. In IROS, 2015

work page 2015
[24]

Y . Park, V . Lepetit, and W. Woo. Multiple 3d object tracking for augmented reality. In ISMAR, 2008

work page 2008
[25]

Premebida, J

C. Premebida, J. Carreira, J. Batista, and U. Nunes. Pedes- trian detection combining RGB and dense LIDAR data. In ICoR, 2014

work page 2014
[26]

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CoRR, 2017

work page 2017
[27]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, 2017

work page 2017
[28]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017

work page 2017
[29]

L. Qi, S. Liu, J. Shi, and J. Jia. Sequential context encoding for duplicate removal. In NIPS, 2018

work page 2018
[30]

S. Shi, X. Wang, and H. Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019

work page 2019
[31]

K. Shin, Y . Kwon, and M. Tomizuka. Roarnet: A robust 3d object detection based on region approximation reﬁnement. arXiv preprint arXiv:1811.03818, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[32]

D. Z. Wang and I. Posner. V oting for voting in online point cloud object detection. In Robotics: Science and Systems XI, 2015

work page 2015
[33]

B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Con- volutional neural nets with recurrent CRF for real-time road- object segmentation from 3d lidar point cloud. In ICRA, 2018

work page 2018
[34]

Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 2018

work page 2018
[35]

B. Yang, W. Luo, and R. Urtasun. PIXOR: real-time 3d ob- ject detection from point clouds. In CVPR, 2018

work page 2018
[36]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017

work page 2017
[37]

Zhou and O

Y . Zhou and O. Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. CoRR, 2017. 9

work page 2017

[1] [1]

http: //www.cvlibs.net/datasets/kitti/eval_ object.php?obj_benchmark=3d, 2019

”kitti 3d object detection benchmark”. http: //www.cvlibs.net/datasets/kitti/eval_ object.php?obj_benchmark=3d, 2019

work page 2019

[2] [2]

Abadi, A

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Is- ard, Y . Jia, R. J´ozefowicz, L. Kaiser, M. Kudlur, J. Leven- berg, D. Man´e, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tuck...

work page 2016

[3] [3]

L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 2018

work page 2018

[4] [4]

X. Chen, H. Ma, J. Wan, B. Li, and T. Xia. Multi-view 3d object detection network for autonomous driving. In CVPR, 2017

work page 2017

[5] [5]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. A. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017

work page 2017

[6] [6]

Dai and M

A. Dai and M. Nießner. 3dmv: Joint 3d-multi-view predic- tion for 3d semantic scene segmentation. In ECCV, 2018

work page 2018

[7] [7]

Engelcke, D

M. Engelcke, D. Rao, D. Z. Wang, C. H. Tong, and I. Posner. V ote3deep: Fast object detection in 3d point clouds using efﬁcient convolutional neural networks. InICRA, 2017

work page 2017

[8] [8]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au- tonomous driving? the KITTI vision benchmark suite. In CVPR, 2012

work page 2012

[9] [9]

Gonz ´alez, G

A. Gonz ´alez, G. Villalonga, J. Xu, D. V ´azquez, J. Amores, and A. M. L ´opez. Multiview random forest of local experts combining RGB and LIDAR data for pedestrian detection. In IV, 2015

work page 2015

[10] [10]

Graham, M

B. Graham, M. Engelcke, and L. van der Maaten. 3d se- mantic segmentation with submanifold sparse convolutional networks. In CVPR, 2018

work page 2018

[11] [11]

Jiang, R

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang. Acquisition of localization conﬁdence for accurate object detection. In ECCV, 2018

work page 2018

[12] [12]

Jiang, Y

M. Jiang, Y . Wu, and C. Lu. Pointsift: A sift-like network module for 3d point cloud semantic segmentation. CoRR, 2018

work page 2018

[13] [13]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, 2014

work page 2014

[14] [14]

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. L. Waslander. Joint 3d proposal generation and object detection from view aggregation. CoRR, 2017

work page 2017

[15] [15]

A. H. Lang, S. V ora, H. Caesar, L. Zhou, J. Yang, and O. Bei- jbom. Pointpillars: Fast encoders for object detection from point clouds. CVPR, 2019

work page 2019

[16] [16]

B. Li. 3d fully convolutional network for vehicle detection in point cloud. In IROS, 2017

work page 2017

[17] [17]

J. Li, B. M. Chen, and G. H. Lee. So-net: Self-organizing network for point cloud analysis. CoRR, 2018

work page 2018

[18] [18]

Y . Li, R. Bu, M. Sun, and B. Chen. Pointcnn. CoRR, 2018

work page 2018

[19] [19]

Liang*, B

M. Liang*, B. Yang*, Y . Chen, R. Hu, and R. Urtasun. Multi- task multi-sensor fusion for 3d object detection. In CVPR, 2019

work page 2019

[20] [20]

T. Lin, P. Doll ´ar, R. B. Girshick, K. He, B. Hariharan, and S. J. Belongie. Feature pyramid networks for object detec- tion. In CVPR, 2017

work page 2017

[21] [21]

T. Lin, P. Goyal, R. B. Girshick, K. He, and P. Doll ´ar. Focal loss for dense object detection. In ICCV, 2017

work page 2017

[22] [22]

S. Liu, C. Lu, and J. Jia. Box aggregation for proposal deci- mation: Last mile of object detection. In ICCV, 2015

work page 2015

[23] [23]

Maturana and S

D. Maturana and S. Scherer. V oxnet: A 3d convolutional neural network for real-time object recognition. In IROS, 2015

work page 2015

[24] [24]

Y . Park, V . Lepetit, and W. Woo. Multiple 3d object tracking for augmented reality. In ISMAR, 2008

work page 2008

[25] [25]

Premebida, J

C. Premebida, J. Carreira, J. Batista, and U. Nunes. Pedes- trian detection combining RGB and dense LIDAR data. In ICoR, 2014

work page 2014

[26] [26]

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CoRR, 2017

work page 2017

[27] [27]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classiﬁcation and segmentation. In CVPR, 2017

work page 2017

[28] [28]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In NIPS, 2017

work page 2017

[29] [29]

L. Qi, S. Liu, J. Shi, and J. Jia. Sequential context encoding for duplicate removal. In NIPS, 2018

work page 2018

[30] [30]

S. Shi, X. Wang, and H. Li. Pointrcnn: 3d object proposal generation and detection from point cloud. In CVPR, 2019

work page 2019

[31] [31]

K. Shin, Y . Kwon, and M. Tomizuka. Roarnet: A robust 3d object detection based on region approximation reﬁnement. arXiv preprint arXiv:1811.03818, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [32]

D. Z. Wang and I. Posner. V oting for voting in online point cloud object detection. In Robotics: Science and Systems XI, 2015

work page 2015

[33] [33]

B. Wu, A. Wan, X. Yue, and K. Keutzer. Squeezeseg: Con- volutional neural nets with recurrent CRF for real-time road- object segmentation from 3d lidar point cloud. In ICRA, 2018

work page 2018

[34] [34]

Y . Yan, Y . Mao, and B. Li. Second: Sparsely embedded convolutional detection. Sensors, 2018

work page 2018

[35] [35]

B. Yang, W. Luo, and R. Urtasun. PIXOR: real-time 3d ob- ject detection from point clouds. In CVPR, 2018

work page 2018

[36] [36]

H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017

work page 2017

[37] [37]

Zhou and O

Y . Zhou and O. Tuzel. V oxelnet: End-to-end learning for point cloud based 3d object detection. CoRR, 2017. 9

work page 2017