MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane
Pith reviewed 2026-05-21 10:16 UTC · model grok-4.3
The pith
MoCA3D predicts projected 3D bounding box corners and depths directly in the image plane without camera intrinsics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps, allowing it to output projected 3D bounding box corners and per-corner depths without camera intrinsics at inference time.
What carries the argument
Dense prediction through corner heatmaps and depth maps that directly output image-plane corner positions and depths.
If this is right
- Improves image-plane corner PAG by 22.8 percent while staying comparable on 3D IoU.
- Uses up to 57 times fewer trainable parameters than prior methods.
- Supports downstream tasks that require projected box corners under unknown intrinsics.
- Enables class-agnostic 3D prediction that works in the wild without calibration.
Where Pith is reading between the lines
- The method could reduce the need for per-camera calibration in robotics or augmented reality pipelines.
- Extending the heatmap formulation to other geometric outputs like keypoints or wireframes may follow naturally.
- Performance on long-tail object categories could be tested to check whether the class-agnostic design generalizes.
Load-bearing premise
Dense prediction of corner heatmaps and depth maps produces geometrically consistent projected 3D boxes without camera intrinsics or post-processing that implicitly relies on them.
What would settle it
Compare predicted image-plane corners against the true projections of ground-truth 3D boxes on a dataset with known intrinsics, when the model receives no intrinsics and no post-processing step.
Figures
read the original abstract
Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MoCA3D, a monocular class-agnostic model for predicting projected 3D bounding box corners directly in the image plane along with per-corner depths. It formulates the task as dense prediction using corner heatmaps and depth maps, avoiding explicit camera intrinsics at inference. A new Pixel-Aligned Geometry (PAG) metric is proposed to evaluate image-plane corner and depth consistency. Experiments report state-of-the-art results with a 22.8% PAG improvement on image-plane corners, comparable 3D IoU to baselines, up to 57x fewer trainable parameters, and demonstrations on downstream tasks previously limited by unknown intrinsics.
Significance. If the geometric consistency claims hold under the no-intrinsics setting, the work would enable practical 3D object understanding for applications in the wild where camera parameters are unavailable. The parameter reduction and direct image-plane formulation are strengths, as is the introduction of the PAG metric for evaluating projected geometry. These elements could influence downstream tasks like augmented reality or robotics under variable camera conditions.
major comments (3)
- [§3.2] §3.2 (Method, dense prediction formulation): The central claim that independent predictions of 8 corner heatmaps and associated depth maps yield geometrically valid 3D boxes without camera intrinsics at inference requires an explicit description of the back-projection or lifting procedure. If this step implicitly relies on fixed focal length or principal point statistics from the training distribution (as is common in monocular depth methods), the reported 22.8% PAG gain and 'no intrinsics' property become dataset-specific rather than general; a robustness test across varied intrinsics should be added.
- [Table 3] Table 3 (Quantitative results): The PAG improvement of 22.8% is load-bearing for the image-plane claim, yet the table lacks per-category breakdown or error analysis on depth consistency for the corner predictions. Without this, it is unclear whether the gains stem from better localization, depth estimation, or post-processing that could reintroduce implicit intrinsics assumptions.
- [§4.3] §4.3 (Ablation studies): The parameter reduction (up to 57x) is highlighted as a strength, but the ablation does not isolate the contribution of the class-agnostic design versus the dense heatmap formulation; this is needed to confirm that the efficiency does not trade off against the geometric consistency required by the PAG metric.
minor comments (2)
- [Figure 2] Figure 2 (Qualitative results): The visualized projected boxes would benefit from overlaying the predicted depth values at corners to directly illustrate PAG consistency.
- [§2] §2 (Related work): The discussion of prior monocular 3D methods could include more recent dense-prediction approaches for direct comparison of the no-intrinsics advantage.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. We plan to incorporate several revisions to address the concerns raised.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Method, dense prediction formulation): The central claim that independent predictions of 8 corner heatmaps and associated depth maps yield geometrically valid 3D boxes without camera intrinsics at inference requires an explicit description of the back-projection or lifting procedure. If this step implicitly relies on fixed focal length or principal point statistics from the training distribution (as is common in monocular depth methods), the reported 22.8% PAG gain and 'no intrinsics' property become dataset-specific rather than general; a robustness test across varied intrinsics should be added.
Authors: We thank the referee for highlighting this important clarification. In the revised manuscript, we will expand §3.2 to provide a detailed description of the inference procedure. Specifically, MoCA3D predicts the 2D image-plane locations of the eight bounding box corners using heatmaps and assigns depths to each corner via depth maps. These predictions are made directly in pixel space without any camera parameters as input to the network. For tasks that operate purely in the image plane, such as computing the PAG metric or certain downstream applications like augmented reality overlays, no back-projection is required, and thus no intrinsics are needed at inference. When 3D metrics like 3D IoU are computed for comparison with baselines, we use the ground-truth camera intrinsics from the evaluation dataset to lift the predicted corners and depths into 3D space; however, this lifting is performed post-inference and is not part of the model's forward pass. Regarding potential implicit reliance on training distribution statistics, we acknowledge that monocular depth estimation can be sensitive to focal length variations. We will add a discussion of this limitation in the revised paper. We will also include qualitative examples from datasets with different camera parameters to support the claims. revision: partial
-
Referee: [Table 3] Table 3 (Quantitative results): The PAG improvement of 22.8% is load-bearing for the image-plane claim, yet the table lacks per-category breakdown or error analysis on depth consistency for the corner predictions. Without this, it is unclear whether the gains stem from better localization, depth estimation, or post-processing that could reintroduce implicit intrinsics assumptions.
Authors: We agree that a more detailed breakdown would strengthen the presentation of results. In the revised version, we will augment Table 3 with per-category performance metrics for the PAG scores. Additionally, we will include a new subsection or paragraph in the experiments section providing an error analysis focused on depth consistency for the predicted corners. This analysis will help attribute the observed improvements to specific components of the model, such as localization accuracy versus depth estimation quality, and confirm that no post-processing steps reintroduce intrinsics assumptions. revision: yes
-
Referee: [§4.3] §4.3 (Ablation studies): The parameter reduction (up to 57x) is highlighted as a strength, but the ablation does not isolate the contribution of the class-agnostic design versus the dense heatmap formulation; this is needed to confirm that the efficiency does not trade off against the geometric consistency required by the PAG metric.
Authors: We appreciate this suggestion for improving the ablation analysis. We will revise §4.3 to include additional ablation experiments that separately evaluate the impact of the class-agnostic aspect (e.g., by comparing against class-specific variants) and the dense heatmap formulation. These new ablations will report PAG metrics to demonstrate that the efficiency gains from our design choices maintain or improve geometric consistency without trade-offs. revision: yes
Circularity Check
No circularity: predictions learned from supervision, not reduced by construction
full rationale
The paper introduces a neural network that performs dense prediction of corner heatmaps and per-corner depth maps to output image-plane 3D box corners. All claims rest on empirical training and evaluation against ground-truth annotations using standard losses and metrics (PAG, 3D IoU). No equations, self-citations, or ansatzes are invoked to derive the outputs from themselves; the architecture is a standard encoder-decoder trained end-to-end. The 'no intrinsics' property is an explicit design choice enabled by predicting depths directly in image space rather than lifting via camera parameters. Performance gains are demonstrated via ablation and comparison tables, not by re-labeling fitted quantities as predictions. This is a self-contained empirical contribution with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps... predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time.
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We supervise MoCA3D with a peak-weighted heatmap regression loss and a coordinate-level refinement loss... virtual depth parameterization
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. CVPR (2021)
work page 2021
-
[2]
In: NeurIPS Datasets and Benchmarks Track (Round 1) (2021)
Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., Shulman, E.: ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In: NeurIPS Datasets and Benchmarks Track (Round 1) (2021)
work page 2021
-
[3]
Bhat, S.F., Mitra, N., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depthconditioning.In:ACMSIGGRAPH2024ConferencePapers.pp.1–11(2024)
work page 2024
-
[4]
In: 2025 International Conference on 3D Vision (3DV)
Bian, W., Wang, Z., Vedaldi, A.: Catfree3d: Category-agnostic 3d object detection with diffusion. In: 2025 International Conference on 3D Vision (3DV). pp. 101–111. IEEE (2025)
work page 2025
- [5]
-
[6]
Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR (2020)
work page 2020
-
[7]
In: European conference on computer vision
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
work page 2020
-
[8]
In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems
Choi, C., Baek, S.M., Lee, S.: Real-time 3d object pose estimation and tracking for natural landmark based visual servo. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 3983–3989. IEEE (2008)
work page 2008
-
[9]
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 5206848
-
[10]
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)
work page 2012
-
[11]
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Pro- ceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 315–323. JMLR Workshop and Conference Proceedings (2011)
work page 2011
-
[12]
He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
work page 2016
-
[13]
arXiv preprint arXiv:2510.04390 (2025)
He, X., Zhou, S., Venkateswaran, T., Zheng, K., Wan, Z., Kadambi, A., Wang, X.E.: Morphosim: An interactive, controllable, and editable language-guided 4d world simulator. arXiv preprint arXiv:2510.04390 (2025)
-
[14]
Founda- tions of Crystallography32(5), 922–923 (1976)
Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Founda- tions of Crystallography32(5), 922–923 (1976)
work page 1976
-
[15]
In: 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)
Kim, J., Lee, G., Kim, J.S., Kim, H.J., Kim, K.: Monocular 3d object detection for an indoor robot environment. In: 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). pp. 438–445. IEEE (2020)
work page 2020
-
[16]
In: European Conference on Computer Vision
Krishnan, A., Kundu, A., Maninis, K.K., Hays, J., Brown, M.: Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In: European Conference on Computer Vision. pp. 127–145. Springer (2024) MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane 25
work page 2024
-
[17]
In: European conference on computer vision
Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: Cosypose: Consistent multi-view multi-object 6d pose estimation. In: European conference on computer vision. pp. 574–591. Springer (2020)
work page 2020
-
[18]
Lassner, C., Zollhöfer, M.: Pulsar: Efficient sphere-based neural rendering. arXiv:2004.07484 (2020)
-
[19]
In: Proceed- ings of the European conference on computer vision (ECCV)
Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceed- ings of the European conference on computer vision (ECCV). pp. 734–750 (2018)
work page 2018
-
[20]
In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition
Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 13619–13627 (2022)
work page 2022
-
[21]
IEEE Robotics and Automation Letters6(3), 5565–5572 (2021)
Li, P., Zhao, H.: Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robotics and Automation Letters6(3), 5565–5572 (2021)
work page 2021
-
[22]
In: European Conference on Computer Vision
Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In: European Conference on Computer Vision. pp. 644–660. Springer (2020)
work page 2020
-
[23]
Depth Anything 3: Recovering the Visual Space from Any Views
Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
In: The 25th annual international conference on mobile com- puting and networking
Liu, L., Li, H., Gruteser, M.: Edge assisted real-time object detection for mobile augmented reality. In: The 25th annual international conference on mobile com- puting and networking. pp. 1–16 (2019)
work page 2019
-
[25]
Dab-detr: Dynamic anchor boxes are better queries for detr
Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022)
-
[26]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops
Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 996–997 (2020)
work page 2020
-
[27]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[28]
Decoupled Weight Decay Regularization
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[29]
Computers & Graphics85, 15–22 (2019)
Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indi- rect part detection and contextual information. Computers & Graphics85, 15–22 (2019)
work page 2019
-
[30]
In: Proceedings of the IEEE/CVF in- ternational conference on computer vision
Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Con- ditional detr for fast training convergence. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 3651–3660 (2021)
work page 2021
-
[31]
In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 7074–7082 (2017)
work page 2017
-
[32]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Pandey, K., Guerrero, P., Gadelha, M., Hold-Geoffroy, Y., Singh, K., Mitra, N.J.: Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7695–7704 (2024)
work page 2024
-
[33]
In: 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality
Park, Y., Lepetit, V., Woo, W.: Multiple 3d object tracking for augmented real- ity. In: 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. pp. 117–120. IEEE (2008)
work page 2008
-
[34]
IEEE transactions on pattern analysis and machine intelligence 44(9), 5170–5184 (2021) 26 C
Qin, Z., Wang, J., Lu, Y.: Monogrnet: A general framework for monocular 3d object detection. IEEE transactions on pattern analysis and machine intelligence 44(9), 5170–5184 (2021) 26 C. Jeon et al
work page 2021
-
[35]
SAM 2: Segment Anything in Images and Videos
Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)
work page 2021
-
[37]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)
work page 2022
-
[38]
In: Proceedings of the Winter Confer- ence on Applications of Computer Vision
Sajnani, R., Vanbaar, J., Min, J., Katyal, K.D., Sridhar, S.: Geodiffuser: Geometry- based image editing with diffusion models. In: Proceedings of the Winter Confer- ence on Applications of Computer Vision. pp. 472–482 (2025)
work page 2025
-
[39]
Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Song,S.,Lichtenberg,S.P.,Xiao,J.:Sunrgb-d:Argb-dsceneunderstandingbench- mark suite. In: CVPR (2015)
work page 2015
-
[41]
IEEE Robotics and Automation Letters9(4), 3578–3585 (2024)
Swerdlow, A., Xu, R., Zhou, B.: Street-view image generation from a bird’s-eye view layout. IEEE Robotics and Automation Letters9(4), 3578–3585 (2024)
work page 2024
-
[42]
In: Proceedings of the IEEE conference on computer vision and pattern recognition
Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 292–301 (2018)
work page 2018
-
[43]
Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, A., Fergus, R., LeCun, Y., Xie, S.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms (2024)
work page 2024
-
[44]
Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects
Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[46]
In: Proceed- ings of the IEEE/CVF international conference on computer vision
Wan, B., Shi, Y., Xu, K.: Socs: Semantically-aware object coordinate space for category-level 6d object pose estimation under large shape variations. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 14065– 14074 (2023)
work page 2023
-
[47]
IEEE Transactions on Intelligent Transporta- tion Systems23(8), 12953–12965 (2021)
Wang, G., Wu, J., Tian, B., Teng, S., Chen, L., Cao, D.: Centernet3d: An anchor free object detector for point cloud. IEEE Transactions on Intelligent Transporta- tion Systems23(8), 12953–12965 (2021)
work page 2021
-
[48]
In: TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(June 2019)
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(June 2019)
work page 2019
-
[49]
Wu, Z., Rubanova, Y., Kabra, R., Hudson, D.A., Gilitschenski, I., Aytar, Y., Van Steenkiste, S., Allen, K.R., Kipf, T.: Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. Advances in Neural Information Pro- cessing Systems37, 76289–76318 (2024) MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane 27
work page 2024
-
[50]
arXiv preprint arXiv:2308.01661 (2023)
Yang, K., Ma, E., Peng, J., Guo, Q., Lin, D., Yu, K.: Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661 (2023)
-
[51]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Yang, Y.H., Piccinelli, L., Segu, M., Li, S., Huang, R., Fu, Y., Pollefeys, M., Blum, H., Bauer, Z.: 3d-mood: Lifting 2d to 3d for monocular open-set object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7429–7439 (2025)
work page 2025
-
[52]
In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision
Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W., Zhang, Y.: Instadrive: Instance- aware driving world models for realistic and consistent video generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25410–25420 (2025)
work page 2025
-
[53]
arXiv preprint arXiv:2411.16833 (2024)
Yao, J., Gu, H., Chen, X., Wang, J., Cheng, Z.: Open vocabulary monocular 3d object detection. arXiv preprint arXiv:2411.16833 (2024)
-
[54]
arXiv preprint arXiv:2504.07955 (2025)
Yu,Y.,He,X.,Zhao,C.,Yu,J.,Yang,J.,Hu,R.,Shen,Y.,Zhu,X.,Zhou,X.,Peng, S.: Boxdreamer: Dreaming box corners for generalizable object pose estimation. arXiv preprint arXiv:2504.07955 (2025)
-
[55]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao, H., Li, H., Zhu, H., Yang, Z.: Detect anything 3d in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5048–5059 (2025)
work page 2025
-
[56]
Zhang, H., Alaluf, Y., Ma, S., Kadambi, A., Wang, J., Aberman, K.: Instantre- store: Single-step personalized face restoration with shared-image attention. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. SIGGRAPH Conference Papers ’25, Association for Computing Machinery, New York...
-
[57]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Zhang, J., Sheng, H., Cai, S., Deng, B., Liang, Q., Li, W., Fu, Y., Ye, J., Gu, S.: Perldiff: Controllable street view synthesis using perspective-layout diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26306–26315 (2025)
work page 2025
-
[58]
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[59]
Deformable DETR: Deformable Transformers for End-to-End Object Detection
Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)
work page internal anchor Pith review Pith/arXiv arXiv 2010
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.