Improving 3D Object Detection for Pedestrians with Virtual Multi-View Synthesis Orientation Estimation

Alex D. Pon; Jason Ku; Sean Walsh; Steven L. Waslander

arxiv: 1907.06777 · v1 · pith:QGIXZ6VWnew · submitted 2019-07-15 · 💻 cs.CV

Improving 3D Object Detection for Pedestrians with Virtual Multi-View Synthesis Orientation Estimation

Jason Ku , Alex D. Pon , Sean Walsh , Steven L. Waslander This is my paper

Pith reviewed 2026-05-24 21:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords virtual multi-view synthesis3D object detectionpedestrian orientation estimationKITTI benchmarkpoint cloud densificationautonomous drivingorientation estimationdepth completion

0 comments

The pith

A virtual multi-view synthesis module improves pedestrian orientation estimation in 3D object detection by generating novel viewpoints from densified point clouds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a Virtual Multi-View Synthesis module that can be added to existing 3D object detectors. The module densifies the input point cloud using structure-preserving depth completion, colorizes each point with matching RGB values from the image, and positions virtual cameras around candidate objects to produce new viewpoints. These views supply additional appearance details that help determine pedestrian orientation more precisely than standard single-view approaches allow. When the module is combined with the AVOD-FPN detector, the combined system exceeds prior published results on the KITTI pedestrian orientation, 3D detection, and bird's eye view benchmarks. A reader would care because reliable orientation estimates support better tracking and motion prediction for pedestrians in driving scenes.

Core claim

The Virtual Multi-View Synthesis module acquires the fine-grained semantic information required for accurate orientation estimation through a multi-step process that first densifies the scene's point cloud with a structure-preserving depth completion algorithm, colorizes each point using its corresponding RGB pixel, and then places virtual cameras around each object in the densified point cloud to generate novel viewpoints that preserve the object's appearance.

What carries the argument

The Virtual Multi-View Synthesis module, which densifies and colorizes the point cloud then places virtual cameras around each object to create novel viewpoints supplying semantic cues for orientation.

If this is right

Orientation estimation accuracy rises for the pedestrian class on the KITTI benchmark.
3D bounding box and bird's eye view metrics also improve for pedestrians when the module is paired with AVOD-FPN.
The module can be inserted into other 3D object detection pipelines without redesigning the base detector.
Better orientation estimates support improved tracking and behavior prediction for pedestrians.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The synthesis approach could be applied to other oriented object classes such as cyclists if appearance preservation remains reliable.
Efficiency of the densification and virtual-view generation steps would determine whether the module fits real-time autonomous driving constraints.
Testing the module on additional datasets beyond KITTI would reveal how well the viewpoint preservation generalizes.
Pairing the module with multi-modal inputs might further reduce orientation errors.

Load-bearing premise

The novel viewpoints generated by placing virtual cameras around objects in the densified and colorized point cloud will preserve appearance well enough to provide the semantic details needed for accurate orientation estimation.

What would settle it

Adding the Virtual Multi-View Synthesis module to AVOD-FPN and measuring no gain in orientation, 3D, or bird's eye view accuracy on the KITTI pedestrian test set compared with the baseline detector would falsify the central claim.

Figures

Figures reproduced from arXiv: 1907.06777 by Alex D. Pon, Jason Ku, Sean Walsh, Steven L. Waslander.

**Figure 1.** Figure 1: Virtual Multi-View Synthesis. The core idea of the method is to generate a set of virtual views for each detected pedestrian, and exploit these views in both the training and inference procedures to produce an accurate orientation estimation. challenging because of the varying scale and appearance of objects caused by the perspective transformation of the 3D scene into an image. Some methods [5], [6] attem… view at source ↗

**Figure 2.** Figure 2: Pedestrian Appearances at 20 m (top row) and 30 m (bottom row). From left to right: RGB image, LiDAR scan colored by intensity, depth completed point cloud colored with corresponding RGB pixels. Even for humans, the classification of objects such as the tree, and the orientations of the pedestrians are not readily apparent in the LiDAR scan. In our method, rich semantic image features are preserved and fu… view at source ↗

**Figure 3.** Figure 3: Architecture Diagram. A 3D detector is used to generate 3D detections, which are passed into the Virtual Multi-View Synthesis Module. The module places virtual cameras within the scene represented by a colorized depth completed LiDAR scan to generate N novel viewpoints. Finally, the Orientation Estimation Module predicts the object orientation from the generated views. models have also been used to create … view at source ↗

**Figure 4.** Figure 4: Virtual Camera Placement. Virtual cameras are placed at positions equidistant from each object centroid, with viewpoints ranging from −25◦ to 25◦ relative to the ray from the original camera center to the object centroid, shown by the dotted black line. Here, only three of the eleven camera positions are shown. angle bins with regressions within each bin. However, we hypothesize that this divides the train… view at source ↗

**Figure 5.** Figure 5: Qualitative Results. We show a comparison of AVOD-FPN detections with and without our orientation module. From left to right: AVOD-FPN [2], Ours, Ground Truth. AVOD-FPN detects all pedestrian instances, but the orientation estimation is poor for several objects and also includes false positives in the detections. Our method estimates orientations that more closely match the ground truth, while also removin… view at source ↗

read the original abstract

Accurately estimating the orientation of pedestrians is an important and challenging task for autonomous driving because this information is essential for tracking and predicting pedestrian behavior. This paper presents a flexible Virtual Multi-View Synthesis module that can be adopted into 3D object detection methods to improve orientation estimation. The module uses a multi-step process to acquire the fine-grained semantic information required for accurate orientation estimation. First, the scene's point cloud is densified using a structure preserving depth completion algorithm and each point is colorized using its corresponding RGB pixel. Next, virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object's appearance. We show that this module greatly improves the orientation estimation on the challenging pedestrian class on the KITTI benchmark. When used with the open-source 3D detector AVOD-FPN, we outperform all other published methods on the pedestrian Orientation, 3D, and Bird's Eye View benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds a virtual multi-view synthesis step that lifts pedestrian orientation numbers on KITTI when plugged into AVOD-FPN, but the gains rest on an untested claim that the synthesized views stay faithful enough to help.

read the letter

The core idea is a module that first densifies the input point cloud with structure-preserving depth completion, colorizes the points from the RGB image, then places virtual cameras around each pedestrian to feed extra viewpoints into the detector. When attached to the open AVOD-FPN baseline it beats prior published numbers on the pedestrian orientation, 3D, and bird's-eye-view tasks on KITTI. That is the concrete result worth noting. The combination of those three steps aimed at the pedestrian class is not described in earlier work, so the module itself counts as new for this narrow setting. The paper does a reasonable job of identifying that orientation is a weak point for pedestrians and showing an end-to-end lift on a standard benchmark. Practitioners who already run AVOD-FPN or similar detectors may find the numbers useful as a data point. The soft spot is the missing support for the central mechanism. The abstract asserts that the densified and colorized clouds plus virtual cameras preserve appearance well enough to supply fine-grained cues, yet it gives no ablation that isolates the contribution of each step, no error breakdown by distance or pose, and no visual check on whether the synthesized images actually look like real pedestrian views. The stress-test concern about fidelity therefore lands: if depth completion or color transfer distorts limb position or shading, the extra views add noise rather than signal. Without those checks the improvement could come from other changes in the pipeline. This is the sort of targeted engineering paper that people working on 3D detection for autonomous driving will want to read. A reader who needs better pedestrian handling on KITTI-style data will get something from the benchmark comparison. It is coherent on its own terms and engages the right literature, so it deserves a serious referee even though the evidence needs more detail to be convincing.

Referee Report

2 major / 0 minor

Summary. The paper proposes a Virtual Multi-View Synthesis module that densifies a scene's point cloud via structure-preserving depth completion, colorizes each point from its corresponding RGB pixel, and places virtual cameras around each object to generate novel viewpoints. These synthesized views are intended to supply fine-grained semantic cues that improve orientation estimation when the module is plugged into existing 3D detectors. The central empirical claim is that the module, when used with the open-source AVOD-FPN detector, yields state-of-the-art results on the KITTI pedestrian Orientation, 3D, and Bird's Eye View benchmarks.

Significance. If the synthesized viewpoints demonstrably preserve pedestrian limb configuration, clothing detail, and viewpoint-dependent shading without introducing systematic distortion, the module would constitute a practical, detector-agnostic enhancement for a notoriously difficult class in autonomous-driving perception. The approach targets a genuine bottleneck (orientation accuracy) rather than incremental mAP gains and could be adopted by other pipelines if the fidelity assumption holds.

major comments (2)

Abstract: the claim that the module 'greatly improves the orientation estimation on the challenging pedestrian class' and 'outperform[s] all other published methods' is presented without any numerical results, tables, ablation studies, or error analysis, so the magnitude and attribution of the reported gains cannot be evaluated.
Abstract (multi-step process description): the central assumption that 'virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object's appearance' receives no quantitative support (e.g., orientation accuracy delta with vs. without densification/colorization, or visual inspection of synthesized images), leaving the load-bearing fidelity claim unverified.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments on the manuscript. We respond to each major comment below, referencing the supporting material in the full paper.

read point-by-point responses

Referee: Abstract: the claim that the module 'greatly improves the orientation estimation on the challenging pedestrian class' and 'outperform[s] all other published methods' is presented without any numerical results, tables, ablation studies, or error analysis, so the magnitude and attribution of the reported gains cannot be evaluated.

Authors: The abstract provides a concise summary of the contributions. Detailed numerical results, tables comparing against prior methods on KITTI pedestrian Orientation, 3D, and BEV metrics, ablation studies, and error analysis are presented in the Experiments section of the manuscript, which quantify the improvements and attribute gains to the Virtual Multi-View Synthesis module when integrated with AVOD-FPN. revision: no
Referee: Abstract (multi-step process description): the central assumption that 'virtual cameras are placed around each object in the densified point cloud to generate novel viewpoints, which preserve the object's appearance' receives no quantitative support (e.g., orientation accuracy delta with vs. without densification/colorization, or visual inspection of synthesized images), leaving the load-bearing fidelity claim unverified.

Authors: The effectiveness of the multi-step process, including densification, colorization, and virtual viewpoint synthesis, is supported by the reported state-of-the-art results on the KITTI benchmarks for pedestrian orientation estimation. These empirical gains when the module is plugged into an existing detector provide validation for the fidelity of the synthesized views. Specific component-wise deltas or visual examples of synthesized images are not presented in the current version. revision: no

Circularity Check

0 steps flagged

No circularity: empirical module with external benchmark validation

full rationale

The paper proposes a Virtual Multi-View Synthesis module consisting of depth completion, colorization, and virtual camera placement to supply orientation cues for pedestrian detection. This is evaluated empirically by integration with the independent AVOD-FPN detector and reporting gains on the public KITTI pedestrian Orientation/3D/BEV benchmarks. No equations, fitted parameters, or predictions are defined in terms of themselves; no self-citations are used to justify uniqueness or ansatz choices; the central claim remains an externally falsifiable performance delta rather than a definitional identity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no specific free parameters, axioms, or invented entities can be identified from the text; the module itself is a new procedural component rather than a postulated physical entity.

pith-pipeline@v0.9.0 · 5703 in / 1191 out tokens · 19502 ms · 2026-05-24T21:08:10.157302+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The module uses a multi-step process... densified using a structure preserving depth completion algorithm... virtual cameras are placed around each object... to generate novel viewpoints
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

orientation is instead predicted as two values in a vector format, (xθ, yθ) = (cos(θ), sin(θ))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 1 internal anchor

[1]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012

work page 2012
[2]

Joint 3d proposal generation and object detection from view aggregation,

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” IROS, 2018

work page 2018
[3]

Second: Sparsely embedded convolutional detection,

Y . Yan, Y . Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018

work page 2018
[4]

3d-rcnn: Instance-level 3d object reconstruction via render-and-compare,

A. Kundu, Y . Li, and J. M. Rehg, “3d-rcnn: Instance-level 3d object reconstruction via render-and-compare,” in CVPR, June 2018

work page 2018
[5]

Subcategory-aware convolutional neural networks for object proposals and detection,

Y . Xiang, W. Choi, Y . Lin, and S. Savarese, “Subcategory-aware convolutional neural networks for object proposals and detection,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 924–933

work page 2017
[6]

Viewpoints and keypoints,

S. Tulsiani and J. Malik, “Viewpoints and keypoints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 1510–1519

work page 2015
[7]

Frustum pointnets for 3d object detection from rgb-d data,

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in CVPR, June 2018

work page 2018
[8]

Leveraging pre- trained 3d object detection models for fast ground truth generation,

J. Lee, S. Walsh, A. Harakeh, and S. L. Waslander, “Leveraging pre- trained 3d object detection models for fast ground truth generation,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC) . IEEE, 2018, pp. 2504–2510

work page 2018
[9]

Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views,

H. Su, C. R. Qi, Y . Li, and L. J. Guibas, “Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2686–2694

work page 2015
[10]

Deep manta: A coarse-to-ﬁne many-task network for joint 2d and 3d vehicle analysis from monocular image,

F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulire, and T. Chateau, “Deep manta: A coarse-to-ﬁne many-task network for joint 2d and 3d vehicle analysis from monocular image,” in CVPR, 2017

work page 2017
[11]

Objectnet3d: A large scale database for 3d object recognition,

Y . Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese, “Objectnet3d: A large scale database for 3d object recognition,” in European Conference on Computer Vision . Springer, 2016, pp. 160–176

work page 2016
[12]

Beyond pascal: A benchmark for 3d object detection in the wild,

Y . Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in IEEE Winter Conference on Applications of Computer Vision . IEEE, 2014, pp. 75–82

work page 2014
[13]

Nyc3dcars: A dataset of 3d vehicles in geographic context,

K. Matzen and N. Snavely, “Nyc3dcars: A dataset of 3d vehicles in geographic context,” in Proceedings of the IEEE International Conference on Computer Vision , 2013, pp. 761–768

work page 2013
[14]

Is faster r-cnn doing well for pedestrian detection?

L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-cnn doing well for pedestrian detection?” in European conference on computer vision . Springer, 2016, pp. 443–457

work page 2016
[15]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence , vol. 40, no. 4, pp. 834– 848, 2018

work page 2018
[16]

Pyramid methods in image processing,

E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid methods in image processing,” RCA engineer, vol. 29, no. 6, pp. 33–41, 1984

work page 1984
[17]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 2117–2125

work page 2017
[18]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969

work page 2017
[19]

Faster r-cnn: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 91–99

work page 2015
[20]

Acquisition of localization conﬁdence for accurate object detection,

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang, “Acquisition of localization conﬁdence for accurate object detection,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 784–799

work page 2018
[21]

Deeppose: Human pose estimation via deep neural networks,

A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 1653–1660

work page 2014
[22]

Efﬁcient object localization using convolutional networks,

J. Tompson, R. Goroshin, A. Jain, Y . LeCun, and C. Bregler, “Efﬁcient object localization using convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 648–656

work page 2015
[23]

Human pose estimation with iterative error feedback,

J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 4733–4742

work page 2016
[24]

Part localization us- ing multi-proposal consensus for ﬁne-grained categorization,

K. J. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part localization us- ing multi-proposal consensus for ﬁne-grained categorization,” BMVC, 2015

work page 2015
[25]

6- dof object pose from semantic keypoints,

G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis, “6- dof object pose from semantic keypoints,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 2011–2018

work page 2017
[26]

Single image 3d interpreter network,

J. Wu, T. Xue, J. J. Lim, Y . Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single image 3d interpreter network,” in European Conference on Computer Vision . Springer, 2016, pp. 365–382

work page 2016
[27]

Epnp: An accurate o (n) solution to the pnp problem,

V . Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” International journal of computer vision , vol. 81, no. 2, p. 155, 2009

work page 2009
[28]

Fast and globally convergent pose estimation from video images,

C.-P. Lu, G. D. Hager, and E. Mjolsness, “Fast and globally convergent pose estimation from video images,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 22, no. 6, pp. 610–622, 2000

work page 2000
[29]

Linear pose estimation from points or lines,

A. Ansar and K. Daniilidis, “Linear pose estimation from points or lines,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 25, no. 5, pp. 578–589, 2003

work page 2003
[30]

Multi- view convolutional neural networks for 3d shape recognition,

H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi- view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 945–953

work page 2015
[31]

Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency,

S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency,” in CVPR, vol. 1, no. 2, 2017, p. 3

work page 2017
[32]

Multi-view consistency as supervisory signal for learning shape and pose prediction,

S. Tulsiani, A. A. Efros, and J. Malik, “Multi-view consistency as supervisory signal for learning shape and pose prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2897–2905

work page 2018
[33]

CAPNet: Continuous Approximation Projection For 3D Point Cloud Reconstruction Using 2D Supervision

P. Mandikal, M. Agarwal, R. V . Babu et al. , “Capnet: Continuous approximation projection for 3d point cloud reconstruction using 2d supervision,” arXiv preprint arXiv:1811.11731 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Pose-rcnn: Joint object detection and pose estimation using 3d object proposals,

M. Braun, Q. Rao, Y . Wang, and F. Flohr, “Pose-rcnn: Joint object detection and pose estimation using 3d object proposals,” in2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2016, pp. 1546–1551

work page 2016
[35]

Biternion nets: Continuous head pose regression from discrete training labels,

L. Beyer, A. Hermans, and B. Leibe, “Biternion nets: Continuous head pose regression from discrete training labels,” in German Conference on Pattern Recognition . Springer, 2015, pp. 157–168

work page 2015
[36]

3d bounding box estimation using deep learning and geometry,

A. Mousavian, D. Anguelov, J. J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” CVPR, pp. 5632– 5640, 2017

work page 2017
[37]

In defense of classical image processing: Fast depth completion on the cpu,

J. Ku, A. Harakeh, and S. L. Waslander, “In defense of classical image processing: Fast depth completion on the cpu,” in CRV, 2018

work page 2018
[38]

Soccer on your tabletop,

K. Rematas, I. Kemelmacher-Shlizerman, B. Curless, and S. Seitz, “Soccer on your tabletop,” in CVPR, 2018

work page 2018
[39]

A uniﬁed multi-scale deep convolutional neural network for fast object detection,

Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos, “A uniﬁed multi-scale deep convolutional neural network for fast object detection,” in ECCV, 2016

work page 2016
[40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, pp. 770–778, 2016

work page 2016
[41]

3d object proposals for accurate object class detection,

X. Chen, K. Kundu, Y . Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for accurate object class detection,” in NIPS, 2015

work page 2015
[42]

Monocular 3d object detection leveraging accurate proposals and shape reconstruction,

J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection leveraging accurate proposals and shape reconstruction,” in CVPR, 2019

work page 2019

[1] [1]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” in CVPR, 2012

work page 2012

[2] [2]

Joint 3d proposal generation and object detection from view aggregation,

J. Ku, M. Moziﬁan, J. Lee, A. Harakeh, and S. Waslander, “Joint 3d proposal generation and object detection from view aggregation,” IROS, 2018

work page 2018

[3] [3]

Second: Sparsely embedded convolutional detection,

Y . Yan, Y . Mao, and B. Li, “Second: Sparsely embedded convolutional detection,” Sensors, vol. 18, no. 10, p. 3337, 2018

work page 2018

[4] [4]

3d-rcnn: Instance-level 3d object reconstruction via render-and-compare,

A. Kundu, Y . Li, and J. M. Rehg, “3d-rcnn: Instance-level 3d object reconstruction via render-and-compare,” in CVPR, June 2018

work page 2018

[5] [5]

Subcategory-aware convolutional neural networks for object proposals and detection,

Y . Xiang, W. Choi, Y . Lin, and S. Savarese, “Subcategory-aware convolutional neural networks for object proposals and detection,” in Applications of Computer Vision (WACV), 2017 IEEE Winter Conference on. IEEE, 2017, pp. 924–933

work page 2017

[6] [6]

Viewpoints and keypoints,

S. Tulsiani and J. Malik, “Viewpoints and keypoints,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 1510–1519

work page 2015

[7] [7]

Frustum pointnets for 3d object detection from rgb-d data,

C. R. Qi, W. Liu, C. Wu, H. Su, and L. J. Guibas, “Frustum pointnets for 3d object detection from rgb-d data,” in CVPR, June 2018

work page 2018

[8] [8]

Leveraging pre- trained 3d object detection models for fast ground truth generation,

J. Lee, S. Walsh, A. Harakeh, and S. L. Waslander, “Leveraging pre- trained 3d object detection models for fast ground truth generation,” in 2018 21st International Conference on Intelligent Transportation Systems (ITSC) . IEEE, 2018, pp. 2504–2510

work page 2018

[9] [9]

Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views,

H. Su, C. R. Qi, Y . Li, and L. J. Guibas, “Render for cnn: Viewpoint estimation in images using cnns trained with rendered 3d model views,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2686–2694

work page 2015

[10] [10]

Deep manta: A coarse-to-ﬁne many-task network for joint 2d and 3d vehicle analysis from monocular image,

F. Chabot, M. Chaouch, J. Rabarisoa, C. Teulire, and T. Chateau, “Deep manta: A coarse-to-ﬁne many-task network for joint 2d and 3d vehicle analysis from monocular image,” in CVPR, 2017

work page 2017

[11] [11]

Objectnet3d: A large scale database for 3d object recognition,

Y . Xiang, W. Kim, W. Chen, J. Ji, C. Choy, H. Su, R. Mottaghi, L. Guibas, and S. Savarese, “Objectnet3d: A large scale database for 3d object recognition,” in European Conference on Computer Vision . Springer, 2016, pp. 160–176

work page 2016

[12] [12]

Beyond pascal: A benchmark for 3d object detection in the wild,

Y . Xiang, R. Mottaghi, and S. Savarese, “Beyond pascal: A benchmark for 3d object detection in the wild,” in IEEE Winter Conference on Applications of Computer Vision . IEEE, 2014, pp. 75–82

work page 2014

[13] [13]

Nyc3dcars: A dataset of 3d vehicles in geographic context,

K. Matzen and N. Snavely, “Nyc3dcars: A dataset of 3d vehicles in geographic context,” in Proceedings of the IEEE International Conference on Computer Vision , 2013, pp. 761–768

work page 2013

[14] [14]

Is faster r-cnn doing well for pedestrian detection?

L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-cnn doing well for pedestrian detection?” in European conference on computer vision . Springer, 2016, pp. 443–457

work page 2016

[15] [15]

Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,

L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence , vol. 40, no. 4, pp. 834– 848, 2018

work page 2018

[16] [16]

Pyramid methods in image processing,

E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J. Burt, and J. M. Ogden, “Pyramid methods in image processing,” RCA engineer, vol. 29, no. 6, pp. 33–41, 1984

work page 1984

[17] [17]

Feature pyramid networks for object detection,

T.-Y . Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie, “Feature pyramid networks for object detection,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2017, pp. 2117–2125

work page 2017

[18] [18]

Mask r-cnn,

K. He, G. Gkioxari, P. Doll ´ar, and R. Girshick, “Mask r-cnn,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2961–2969

work page 2017

[19] [19]

Faster r-cnn: Towards real- time object detection with region proposal networks,

S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real- time object detection with region proposal networks,” in Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., 2015, pp. 91–99

work page 2015

[20] [20]

Acquisition of localization conﬁdence for accurate object detection,

B. Jiang, R. Luo, J. Mao, T. Xiao, and Y . Jiang, “Acquisition of localization conﬁdence for accurate object detection,” in Proceedings of the European Conference on Computer Vision (ECCV) , 2018, pp. 784–799

work page 2018

[21] [21]

Deeppose: Human pose estimation via deep neural networks,

A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2014, pp. 1653–1660

work page 2014

[22] [22]

Efﬁcient object localization using convolutional networks,

J. Tompson, R. Goroshin, A. Jain, Y . LeCun, and C. Bregler, “Efﬁcient object localization using convolutional networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , 2015, pp. 648–656

work page 2015

[23] [23]

Human pose estimation with iterative error feedback,

J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2016, pp. 4733–4742

work page 2016

[24] [24]

Part localization us- ing multi-proposal consensus for ﬁne-grained categorization,

K. J. Shih, A. Mallya, S. Singh, and D. Hoiem, “Part localization us- ing multi-proposal consensus for ﬁne-grained categorization,” BMVC, 2015

work page 2015

[25] [25]

6- dof object pose from semantic keypoints,

G. Pavlakos, X. Zhou, A. Chan, K. G. Derpanis, and K. Daniilidis, “6- dof object pose from semantic keypoints,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on . IEEE, 2017, pp. 2011–2018

work page 2017

[26] [26]

Single image 3d interpreter network,

J. Wu, T. Xue, J. J. Lim, Y . Tian, J. B. Tenenbaum, A. Torralba, and W. T. Freeman, “Single image 3d interpreter network,” in European Conference on Computer Vision . Springer, 2016, pp. 365–382

work page 2016

[27] [27]

Epnp: An accurate o (n) solution to the pnp problem,

V . Lepetit, F. Moreno-Noguer, and P. Fua, “Epnp: An accurate o (n) solution to the pnp problem,” International journal of computer vision , vol. 81, no. 2, p. 155, 2009

work page 2009

[28] [28]

Fast and globally convergent pose estimation from video images,

C.-P. Lu, G. D. Hager, and E. Mjolsness, “Fast and globally convergent pose estimation from video images,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 22, no. 6, pp. 610–622, 2000

work page 2000

[29] [29]

Linear pose estimation from points or lines,

A. Ansar and K. Daniilidis, “Linear pose estimation from points or lines,” IEEE Transactions on Pattern Analysis and Machine Intelli- gence, vol. 25, no. 5, pp. 578–589, 2003

work page 2003

[30] [30]

Multi- view convolutional neural networks for 3d shape recognition,

H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller, “Multi- view convolutional neural networks for 3d shape recognition,” in Proceedings of the IEEE international conference on computer vision , 2015, pp. 945–953

work page 2015

[31] [31]

Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency,

S. Tulsiani, T. Zhou, A. A. Efros, and J. Malik, “Multi-view supervi- sion for single-view reconstruction via differentiable ray consistency,” in CVPR, vol. 1, no. 2, 2017, p. 3

work page 2017

[32] [32]

Multi-view consistency as supervisory signal for learning shape and pose prediction,

S. Tulsiani, A. A. Efros, and J. Malik, “Multi-view consistency as supervisory signal for learning shape and pose prediction,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2897–2905

work page 2018

[33] [33]

CAPNet: Continuous Approximation Projection For 3D Point Cloud Reconstruction Using 2D Supervision

P. Mandikal, M. Agarwal, R. V . Babu et al. , “Capnet: Continuous approximation projection for 3d point cloud reconstruction using 2d supervision,” arXiv preprint arXiv:1811.11731 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[34] [34]

Pose-rcnn: Joint object detection and pose estimation using 3d object proposals,

M. Braun, Q. Rao, Y . Wang, and F. Flohr, “Pose-rcnn: Joint object detection and pose estimation using 3d object proposals,” in2016 IEEE 19th International Conference on Intelligent Transportation Systems (ITSC). IEEE, 2016, pp. 1546–1551

work page 2016

[35] [35]

Biternion nets: Continuous head pose regression from discrete training labels,

L. Beyer, A. Hermans, and B. Leibe, “Biternion nets: Continuous head pose regression from discrete training labels,” in German Conference on Pattern Recognition . Springer, 2015, pp. 157–168

work page 2015

[36] [36]

3d bounding box estimation using deep learning and geometry,

A. Mousavian, D. Anguelov, J. J. Flynn, and J. Kosecka, “3d bounding box estimation using deep learning and geometry,” CVPR, pp. 5632– 5640, 2017

work page 2017

[37] [37]

In defense of classical image processing: Fast depth completion on the cpu,

J. Ku, A. Harakeh, and S. L. Waslander, “In defense of classical image processing: Fast depth completion on the cpu,” in CRV, 2018

work page 2018

[38] [38]

Soccer on your tabletop,

K. Rematas, I. Kemelmacher-Shlizerman, B. Curless, and S. Seitz, “Soccer on your tabletop,” in CVPR, 2018

work page 2018

[39] [39]

A uniﬁed multi-scale deep convolutional neural network for fast object detection,

Z. Cai, Q. Fan, R. Feris, and N. Vasconcelos, “A uniﬁed multi-scale deep convolutional neural network for fast object detection,” in ECCV, 2016

work page 2016

[40] [40]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” CVPR, pp. 770–778, 2016

work page 2016

[41] [41]

3d object proposals for accurate object class detection,

X. Chen, K. Kundu, Y . Zhu, A. Berneshawi, H. Ma, S. Fidler, and R. Urtasun, “3d object proposals for accurate object class detection,” in NIPS, 2015

work page 2015

[42] [42]

Monocular 3d object detection leveraging accurate proposals and shape reconstruction,

J. Ku, A. D. Pon, and S. L. Waslander, “Monocular 3d object detection leveraging accurate proposals and shape reconstruction,” in CVPR, 2019

work page 2019