Voxel-FPN: multi-scale voxel feature aggregation in 3D object detection from point clouds

Bei Wang; Jianping An; Jiayan Cao

arxiv: 1907.05286 · v2 · pith:C2RBEY2Xnew · submitted 2019-06-28 · 💻 cs.CV · cs.LG· stat.ML

Voxel-FPN: multi-scale voxel feature aggregation in 3D object detection from point clouds

Bei Wang , Jianping An , Jiayan Cao This is my paper

Pith reviewed 2026-05-25 13:51 UTC · model grok-4.3

classification 💻 cs.CV cs.LGstat.ML

keywords 3D object detectionpoint cloudsvoxel featuresmulti-scale fusionLIDARone-stage detectorKITTI-3Dautonomous driving

0 comments

The pith

Voxel-FPN uses bottom-up encoding and top-down decoding to aggregate multi-scale voxel features for 3D object detection from point clouds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Voxel-FPN, a one-stage 3D object detector that processes only raw LIDAR point cloud data through an encoder that extracts multi-scale voxel information in a bottom-up manner and a decoder that fuses feature maps from various scales in a top-down manner before a region proposal network. This setup is presented as a way to improve feature extraction from point data compared with prior approaches. A sympathetic reader would care if the fusion step yields measurably stronger detections while preserving speed, because 3D object detection from LIDAR is a core task in autonomous driving systems. The work supports its claim with experiments on the KITTI-3D benchmark showing gains over some baselines in both accuracy and runtime.

Core claim

The Voxel-FPN framework consists of an encoder network that extracts multi-scale voxel information in a bottom-up manner, a corresponding decoder that fuses multiple feature maps from various scales in a top-down way, and a region proposal network; this architecture is claimed to deliver better performance on extracting features from point data and to demonstrate superiority over some baselines on the KITTI-3D benchmark while achieving good speed and accuracy in real-world scenarios.

What carries the argument

The encoder-decoder structure that extracts multi-scale voxel information bottom-up and fuses feature maps top-down before region proposal.

If this is right

The multi-scale voxel aggregation produces better feature extraction from point clouds than the compared baselines.
The one-stage design maintains competitive speed while improving accuracy on the KITTI-3D benchmark.
The method operates using only LIDAR data and therefore simplifies sensor requirements for 3D detection.
Real-world scenarios benefit from the reported combination of speed and accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same bottom-up and top-down voxel fusion pattern could be tested on other 3D perception tasks such as segmentation.
The architecture's reliance on voxel grids suggests it may be sensitive to the choice of grid resolution, which could be varied in follow-up experiments.
Because the paper focuses on the KITTI-3D benchmark, direct evaluation on additional point-cloud datasets would be needed to assess broader applicability.

Load-bearing premise

That the described encoder-decoder multi-scale voxel fusion will produce measurably superior feature extraction and detection results relative to baselines when evaluated on the KITTI-3D benchmark.

What would settle it

A direct comparison on the KITTI-3D benchmark in which Voxel-FPN shows no accuracy or speed advantage over the baselines it is tested against.

read the original abstract

Object detection in point cloud data is one of the key components in computer vision systems, especially for autonomous driving applications. In this work, we present Voxel-FPN, a novel one-stage 3D object detector that utilizes raw data from LIDAR sensors only. The core framework consists of an encoder network and a corresponding decoder followed by a region proposal network. Encoder extracts multi-scale voxel information in a bottom-up manner while decoder fuses multiple feature maps from various scales in a top-down way. Extensive experiments show that the proposed method has better performance on extracting features from point data and demonstrates its superiority over some baselines on the challenging KITTI-3D benchmark, obtaining good performance on both speed and accuracy in real-world scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Voxel-FPN adapts FPN-style multi-scale fusion to voxel features in a one-stage 3D detector and shows KITTI results, but the gains are not isolated to the fusion step.

read the letter

Voxel-FPN takes the encoder-decoder fusion pattern from 2D feature pyramids and applies it to voxel features extracted from LIDAR point clouds. The encoder pulls multi-scale voxel information bottom-up, the decoder merges those maps top-down, and the output goes to a region proposal network for one-stage detection. The authors test the full system on KITTI-3D and report better feature extraction and competitive speed-accuracy numbers against some baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes Voxel-FPN, a one-stage 3D object detector that processes raw LIDAR point clouds via an encoder-decoder architecture: the encoder extracts multi-scale voxel features bottom-up, the decoder fuses them top-down, and the result feeds a region proposal network. It claims this yields superior feature extraction from point data and better speed-accuracy trade-offs than unspecified baselines on the KITTI-3D benchmark.

Significance. A well-controlled demonstration that the proposed bottom-up/top-down voxel fusion measurably improves 3D detection mAP over single-scale voxel baselines would be a useful incremental contribution to voxel-based detectors for autonomous driving. The one-stage design and focus on raw point clouds are practical strengths if the performance gains can be attributed to the fusion module.

major comments (2)

[Experiments] Experiments section: the central claim that the encoder-decoder multi-scale voxel fusion is responsible for better feature extraction and higher detection accuracy is unsupported. No ablation replaces the top-down decoder with a single-scale pathway while holding voxelization parameters, RPN design, data augmentation, and training schedule fixed; all reported gains are from complete end-to-end systems against unspecified baselines.
[Abstract] Abstract and method description: baselines are described only as 'some baselines' with no enumeration of the compared methods, their voxel resolutions, or their reported mAP/speed numbers, preventing verification that observed improvements originate from the fusion rather than other implementation differences.

minor comments (2)

The abstract contains no quantitative metrics, error bars, or specific KITTI-3D results, which should be added to allow immediate assessment of the claimed superiority.
Figure captions and method diagrams should explicitly label the bottom-up encoder paths versus top-down decoder fusion paths to clarify the multi-scale aggregation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major comment point by point below and agree that revisions will improve the clarity and rigor of the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: the central claim that the encoder-decoder multi-scale voxel fusion is responsible for better feature extraction and higher detection accuracy is unsupported. No ablation replaces the top-down decoder with a single-scale pathway while holding voxelization parameters, RPN design, data augmentation, and training schedule fixed; all reported gains are from complete end-to-end systems against unspecified baselines.

Authors: We agree that the manuscript does not contain an ablation that isolates the top-down decoder by replacing it with a single-scale pathway while holding voxelization, RPN, augmentation, and training fixed. The reported results compare the full end-to-end Voxel-FPN system against other published detectors. To directly support the claim, we will add this controlled ablation study in the revised experiments section. revision: yes
Referee: [Abstract] Abstract and method description: baselines are described only as 'some baselines' with no enumeration of the compared methods, their voxel resolutions, or their reported mAP/speed numbers, preventing verification that observed improvements originate from the fusion rather than other implementation differences.

Authors: We acknowledge that the abstract and method sections refer to baselines only generically. In the revision we will explicitly enumerate the compared methods, state their voxel resolutions, and report the corresponding mAP and speed numbers so that readers can verify the source of the observed gains. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on external benchmark evaluation.

full rationale

The paper introduces an encoder-decoder voxel feature fusion architecture and supports its claims solely through end-to-end experimental comparisons on the KITTI-3D benchmark. No equations, predictions, or first-principles derivations appear in the provided text. No self-citations are invoked as load-bearing premises, no fitted parameters are relabeled as predictions, and no uniqueness theorems or ansatzes are smuggled in. The derivation chain is therefore self-contained against external data rather than internally forced.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5658 in / 982 out tokens · 42788 ms · 2026-05-25T13:51:19.823796+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 4 internal anchors

[1]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016

work page 2016
[2]

R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015

work page 2015
[3]

S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R -CNN: towards real -time object detection with region proposal networks. In NIPS, pages 91–99, 2015

work page 2015
[4]

Redmon, S

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016
[5]

Redmon and A

J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525, 2017

work page 2017
[6]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[7]

In IEEE CVPR, volume 1, page 3, 2017

Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, volume 1, page 3, 2017

work page 2017
[8]

J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L.Waslander. Joint 3d proposal generation and object detection from view aggregation. CVPR, abs/1712.02294, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Song and J

S. Song and J. Xiao. Deep Sliding Shapes for amodal 3D object detection in RGB-D images. In CVPR, 2016

work page 2016
[10]

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CVPR, abs/1711.08488, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[11]

Engelcke, D

M. Engelcke, D. Rao, D. Z.Wang, C. H. Tong, and I. Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1355–1361. IEEE, 2017

work page 2017
[12]

Maturana and S

D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922 –928. IEEE, 2015

work page 2015
[13]

Y. Yan, Y. Mao, and B. Li. SECOND: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018

work page 2018
[14]

PointPillars: Fast Encoders for Object Detection from Point Clouds

Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom. PointPillars: Fast Encoders for Object Detection from Point Clouds. arXiv 2018, arXiv:1812.05784

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud

Shaoshuai Shi, Xiaogang Wang, Hongsheng Li. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. Computer Vision and Pattern Recognition (CVPR), IEEE, 2019

work page 2019
[16]

B. Li. 3d fully convolutional network for vehicle detection in point cloud. arXiv preprint arXiv:1611.0806 9, 2016

work page arXiv 2016
[17]

S.-L. Yu, T. Westfechtel, R. Hamada, K. Ohno, and S. Tadokoro. Vehicle detection and localization on birds eye view elevation images using convolutional neural network. 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR), 2017

work page 2017
[18]

D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015

work page 2015
[19]

Xiang, W

Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015

work page 2015
[20]

T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017

work page 2017
[21]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012
[22]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017. 12

work page 2017
[23]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017

work page 2017
[24]

X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015

work page 2015
[25]

http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d

Kitti 3d object detection benchmark leader board. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d. Accessed: 2017-11-14 12PM

work page 2017
[26]

http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=bev

Kitti bird’s eye view object detection benchmark leader board. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=bev. Accessed: 2017-11-14 12PM

work page 2017

[1] [1]

W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. E. Reed, C. Fu, and A. C. Berg. SSD: single shot multibox detector. In ECCV, pages 21–37, 2016

work page 2016

[2] [2]

R. B. Girshick. Fast R-CNN. In ICCV, pages 1440–1448, 2015

work page 2015

[3] [3]

S. Ren, K. He, R. B. Girshick, and J. Sun. Faster R -CNN: towards real -time object detection with region proposal networks. In NIPS, pages 91–99, 2015

work page 2015

[4] [4]

Redmon, S

J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 779–788, 2016

work page 2016

[5] [5]

Redmon and A

J. Redmon and A. Farhadi. YOLO9000: better, faster, stronger. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 6517–6525, 2017

work page 2017

[6] [6]

YOLOv3: An Incremental Improvement

J. Redmon and A. Farhadi. Yolov3: An incremental improvement. CoRR, abs/1804.02767, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[7] [7]

In IEEE CVPR, volume 1, page 3, 2017

Multi-view 3d object detection network for autonomous driving. In IEEE CVPR, volume 1, page 3, 2017

work page 2017

[8] [8]

J. Ku, M. Mozifian, J. Lee, A. Harakeh, and S. L.Waslander. Joint 3d proposal generation and object detection from view aggregation. CVPR, abs/1712.02294, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Song and J

S. Song and J. Xiao. Deep Sliding Shapes for amodal 3D object detection in RGB-D images. In CVPR, 2016

work page 2016

[10] [10]

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas. Frustum pointnets for 3d object detection from RGB-D data. CVPR, abs/1711.08488, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[11] [11]

Engelcke, D

M. Engelcke, D. Rao, D. Z.Wang, C. H. Tong, and I. Posner. Vote3deep: Fast object detection in 3d point clouds using efficient convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 1355–1361. IEEE, 2017

work page 2017

[12] [12]

Maturana and S

D. Maturana and S. Scherer. Voxnet: A 3d convolutional neural network for real-time object recognition. In Intelligent Robots and Systems (IROS), 2015 IEEE/RSJ International Conference on, pages 922 –928. IEEE, 2015

work page 2015

[13] [13]

Y. Yan, Y. Mao, and B. Li. SECOND: Sparsely embedded convolutional detection. Sensors, 18(10):3337, 2018

work page 2018

[14] [14]

PointPillars: Fast Encoders for Object Detection from Point Clouds

Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom. PointPillars: Fast Encoders for Object Detection from Point Clouds. arXiv 2018, arXiv:1812.05784

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud

Shaoshuai Shi, Xiaogang Wang, Hongsheng Li. PointRCNN: 3D Object Proposal Generation and Detection from Point Cloud. Computer Vision and Pattern Recognition (CVPR), IEEE, 2019

work page 2019

[16] [16]

B. Li. 3d fully convolutional network for vehicle detection in point cloud. arXiv preprint arXiv:1611.0806 9, 2016

work page arXiv 2016

[17] [17]

S.-L. Yu, T. Westfechtel, R. Hamada, K. Ohno, and S. Tadokoro. Vehicle detection and localization on birds eye view elevation images using convolutional neural network. 2017 IEEE International Symposium on Safety, Security and Rescue Robotics (SSRR), 2017

work page 2017

[18] [18]

D. Z. Wang and I. Posner. Voting for voting in online point cloud object detection. In Proceedings of Robotics: Science and Systems, Rome, Italy, July 2015

work page 2015

[19] [19]

Xiang, W

Y. Xiang, W. Choi, Y. Lin, and S. Savarese. Data-driven 3d voxel patterns for object category recognition. In Pro- ceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, 2015

work page 2015

[20] [20]

T.-Y. Lin, P. Doll´ ar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object detection. In CVPR, 2017

work page 2017

[21] [21]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012

work page 2012

[22] [22]

C. R. Qi, H. Su, K. Mo, and L. J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation. Proc. Computer Vision and Pattern Recognition (CVPR), IEEE, 1(2):4, 2017. 12

work page 2017

[23] [23]

C. R. Qi, L. Yi, H. Su, and L. J. Guibas. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in Neural Information Processing Systems, pages 5099–5108, 2017

work page 2017

[24] [24]

X. Chen, K. Kundu, Y. Zhu, A. G. Berneshawi, H. Ma, S. Fidler, and R. Urtasun. 3d object proposals for accurate object class detection. In NIPS, 2015

work page 2015

[25] [25]

http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d

Kitti 3d object detection benchmark leader board. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=3d. Accessed: 2017-11-14 12PM

work page 2017

[26] [26]

http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=bev

Kitti bird’s eye view object detection benchmark leader board. http://www.cvlibs.net/datasets/kitti/eval_object.php?obj_benchmark=bev. Accessed: 2017-11-14 12PM

work page 2017