pith. sign in

arxiv: 2603.19538 · v2 · pith:A3WUHTWXnew · submitted 2026-03-20 · 💻 cs.CV

MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane

Pith reviewed 2026-05-21 10:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular 3D detectionimage-plane geometrycorner heatmapsdepth mapspixel-aligned geometryprojected bounding boxesclass-agnostic 3D prediction
0
0 comments X

The pith

MoCA3D predicts projected 3D bounding box corners and depths directly in the image plane without camera intrinsics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes monocular 3D object detection as a pixel-space task rather than a 2D-to-3D lifting problem. It shows that dense prediction of corner heatmaps and depth maps can localize projected box corners and assign depths while remaining independent of camera parameters at inference. This matters for applications that need accurate image-plane geometry in unconstrained environments where intrinsics are unavailable. The approach delivers stronger geometric consistency on a new Pixel-Aligned Geometry metric while matching standard 3D IoU scores with far fewer parameters.

Core claim

MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps, allowing it to output projected 3D bounding box corners and per-corner depths without camera intrinsics at inference time.

What carries the argument

Dense prediction through corner heatmaps and depth maps that directly output image-plane corner positions and depths.

If this is right

  • Improves image-plane corner PAG by 22.8 percent while staying comparable on 3D IoU.
  • Uses up to 57 times fewer trainable parameters than prior methods.
  • Supports downstream tasks that require projected box corners under unknown intrinsics.
  • Enables class-agnostic 3D prediction that works in the wild without calibration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could reduce the need for per-camera calibration in robotics or augmented reality pipelines.
  • Extending the heatmap formulation to other geometric outputs like keypoints or wireframes may follow naturally.
  • Performance on long-tail object categories could be tested to check whether the class-agnostic design generalizes.

Load-bearing premise

Dense prediction of corner heatmaps and depth maps produces geometrically consistent projected 3D boxes without camera intrinsics or post-processing that implicitly relies on them.

What would settle it

Compare predicted image-plane corners against the true projections of ground-truth 3D boxes on a dataset with known intrinsics, when the model receives no intrinsics and no post-processing step.

Figures

Figures reproduced from arXiv: 2603.19538 by Achuta Kadambi, Changwoo Jeon, Rishi Upadhyay.

Figure 1
Figure 1. Figure 1: MoCA3D architecture. Given an RGB image and a tight oracle 2D bounding box, MoCA3D uses a frozen DINOv3 backbone and a box-conditioned 3D Geometry Transformer with dense modules to predict eight corner heatmaps and per-corner depth maps, yielding pixel-aligned projected 3D box corners and depths. 3 Method 3.1 Overview Given a single RGB image I ∈ R H×W×3 and a 2D bounding box b = (x1, y1, x2, y2) around an… view at source ↗
Figure 2
Figure 2. Figure 2: Heatmap Comparison. Predicted corner heatmaps with (a) peak weight = 50.0 and (b) = 1.0 (uniform). Larger peak weight sharpens and localizes responses near GT corners, improving soft-argmax stability, while uniform weighting yields flatter heatmaps. exceeds a threshold τ : \mathbf {A}_i(x,y)= \begin {cases} \lambda , & \mathbf {W}_i(x,y)>\tau ,\\ 1, & \text {otherwise}, \end {cases} \label {eq:peak_weight}… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Results. MoCA3D vs. DetAny3D predictions under oracle 2D boxes on samples from the KITTI, Omni3D, and Hypersim datasets. Detections in green have a 3D IoU of less than 0.1, making them low-quality detections. 3D lifting to predict a 3D box. This enables an apples-to-apples comparison with MoCA3D; both can be mapped into a shared representation using camera in- [PITH_FULL_IMAGE:figures/full_fig… view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency of MoCA3D. We compare (a) trainable parameters and (b) trade￾off between efficiency and performance (PAGuv). Efficiency is defined as the inverse of end-to-end inference time per example on CV-Bench [43]. Method PAGuv ↓ (px) PAGd ↓ (%) IoU3D ↑ #Params (M) Train (GPU-hrs) MoCA3D (baseline) 16.05 10.83 0.3768 19.0 27.0 MoCA3D w/ DA 16.14 10.92 0.3682 19.8 – ∆ vs. MoCA3D +0.09 +0.09 -0.0086 MoCA3D … view at source ↗
Figure 5
Figure 5. Figure 5: Driving scene variation guided by MoCA3D. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: MoCA3D-Cube Adapter. MoCA3D-Cube combines MoCA3D’s predicted corner-depth pairs with an RoI-aligned decoder feature and camera intrinsics K to regress a parametric 3D bounding box. The adapter reuses image-plane geometry as the primary cue while adding a lightweight Cube MLP for conventional box prediction. where fv and Hv are fixed hyperparameters (fv=Hv=512.0 in our implementa￾tion) shared across the dat… view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative comparisons under oracle 2D boxes. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Additional driving scene generation results. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure Cases. We show representative failure cases of MoCA3D. For each example, the left panel shows the input image with the tight oracle 2D box, and the right panel shows the corresponding MoCA3D prediction. predicts pixel-aligned geometry directly in the image plane, its predictions are naturally constrained by the visible image extent. As a result, when a portion of the object lies outside the frame, … view at source ↗
read the original abstract

Monocular 3D object understanding has largely been cast as a 2D RoI-to-3D box lifting problem. However, emerging downstream applications require image-plane geometry (e.g., projected 3D box corners) which cannot be easily obtained without known intrinsics, a problem for object detection in the wild. We introduce MoCA3D, a Monocular, Class-Agnostic 3D model that predicts projected 3D bounding box corners and per-corner depths without requiring camera intrinsics at inference time. MoCA3D formulates pixel-space localization and depth assignment as dense prediction via corner heatmaps and depth maps. To evaluate image-plane geometric fidelity, we propose Pixel-Aligned Geometry (PAG), which directly measures image-plane corner and depth consistency. Extensive experiments demonstrate that MoCA3D achieves state-of-the-art performance, improving image-plane corner PAG by 22.8% while remaining comparable on 3D IoU, using up to 57 times fewer trainable parameters. Finally, we apply MoCA3D to downstream tasks which were previously impractical under unknown intrinsics, highlighting its utility beyond standard baseline models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MoCA3D, a monocular class-agnostic model for predicting projected 3D bounding box corners directly in the image plane along with per-corner depths. It formulates the task as dense prediction using corner heatmaps and depth maps, avoiding explicit camera intrinsics at inference. A new Pixel-Aligned Geometry (PAG) metric is proposed to evaluate image-plane corner and depth consistency. Experiments report state-of-the-art results with a 22.8% PAG improvement on image-plane corners, comparable 3D IoU to baselines, up to 57x fewer trainable parameters, and demonstrations on downstream tasks previously limited by unknown intrinsics.

Significance. If the geometric consistency claims hold under the no-intrinsics setting, the work would enable practical 3D object understanding for applications in the wild where camera parameters are unavailable. The parameter reduction and direct image-plane formulation are strengths, as is the introduction of the PAG metric for evaluating projected geometry. These elements could influence downstream tasks like augmented reality or robotics under variable camera conditions.

major comments (3)
  1. [§3.2] §3.2 (Method, dense prediction formulation): The central claim that independent predictions of 8 corner heatmaps and associated depth maps yield geometrically valid 3D boxes without camera intrinsics at inference requires an explicit description of the back-projection or lifting procedure. If this step implicitly relies on fixed focal length or principal point statistics from the training distribution (as is common in monocular depth methods), the reported 22.8% PAG gain and 'no intrinsics' property become dataset-specific rather than general; a robustness test across varied intrinsics should be added.
  2. [Table 3] Table 3 (Quantitative results): The PAG improvement of 22.8% is load-bearing for the image-plane claim, yet the table lacks per-category breakdown or error analysis on depth consistency for the corner predictions. Without this, it is unclear whether the gains stem from better localization, depth estimation, or post-processing that could reintroduce implicit intrinsics assumptions.
  3. [§4.3] §4.3 (Ablation studies): The parameter reduction (up to 57x) is highlighted as a strength, but the ablation does not isolate the contribution of the class-agnostic design versus the dense heatmap formulation; this is needed to confirm that the efficiency does not trade off against the geometric consistency required by the PAG metric.
minor comments (2)
  1. [Figure 2] Figure 2 (Qualitative results): The visualized projected boxes would benefit from overlaying the predicted depth values at corners to directly illustrate PAG consistency.
  2. [§2] §2 (Related work): The discussion of prior monocular 3D methods could include more recent dense-prediction approaches for direct comparison of the no-intrinsics advantage.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. We plan to incorporate several revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Method, dense prediction formulation): The central claim that independent predictions of 8 corner heatmaps and associated depth maps yield geometrically valid 3D boxes without camera intrinsics at inference requires an explicit description of the back-projection or lifting procedure. If this step implicitly relies on fixed focal length or principal point statistics from the training distribution (as is common in monocular depth methods), the reported 22.8% PAG gain and 'no intrinsics' property become dataset-specific rather than general; a robustness test across varied intrinsics should be added.

    Authors: We thank the referee for highlighting this important clarification. In the revised manuscript, we will expand §3.2 to provide a detailed description of the inference procedure. Specifically, MoCA3D predicts the 2D image-plane locations of the eight bounding box corners using heatmaps and assigns depths to each corner via depth maps. These predictions are made directly in pixel space without any camera parameters as input to the network. For tasks that operate purely in the image plane, such as computing the PAG metric or certain downstream applications like augmented reality overlays, no back-projection is required, and thus no intrinsics are needed at inference. When 3D metrics like 3D IoU are computed for comparison with baselines, we use the ground-truth camera intrinsics from the evaluation dataset to lift the predicted corners and depths into 3D space; however, this lifting is performed post-inference and is not part of the model's forward pass. Regarding potential implicit reliance on training distribution statistics, we acknowledge that monocular depth estimation can be sensitive to focal length variations. We will add a discussion of this limitation in the revised paper. We will also include qualitative examples from datasets with different camera parameters to support the claims. revision: partial

  2. Referee: [Table 3] Table 3 (Quantitative results): The PAG improvement of 22.8% is load-bearing for the image-plane claim, yet the table lacks per-category breakdown or error analysis on depth consistency for the corner predictions. Without this, it is unclear whether the gains stem from better localization, depth estimation, or post-processing that could reintroduce implicit intrinsics assumptions.

    Authors: We agree that a more detailed breakdown would strengthen the presentation of results. In the revised version, we will augment Table 3 with per-category performance metrics for the PAG scores. Additionally, we will include a new subsection or paragraph in the experiments section providing an error analysis focused on depth consistency for the predicted corners. This analysis will help attribute the observed improvements to specific components of the model, such as localization accuracy versus depth estimation quality, and confirm that no post-processing steps reintroduce intrinsics assumptions. revision: yes

  3. Referee: [§4.3] §4.3 (Ablation studies): The parameter reduction (up to 57x) is highlighted as a strength, but the ablation does not isolate the contribution of the class-agnostic design versus the dense heatmap formulation; this is needed to confirm that the efficiency does not trade off against the geometric consistency required by the PAG metric.

    Authors: We appreciate this suggestion for improving the ablation analysis. We will revise §4.3 to include additional ablation experiments that separately evaluate the impact of the class-agnostic aspect (e.g., by comparing against class-specific variants) and the dense heatmap formulation. These new ablations will report PAG metrics to demonstrate that the efficiency gains from our design choices maintain or improve geometric consistency without trade-offs. revision: yes

Circularity Check

0 steps flagged

No circularity: predictions learned from supervision, not reduced by construction

full rationale

The paper introduces a neural network that performs dense prediction of corner heatmaps and per-corner depth maps to output image-plane 3D box corners. All claims rest on empirical training and evaluation against ground-truth annotations using standard losses and metrics (PAG, 3D IoU). No equations, self-citations, or ansatzes are invoked to derive the outputs from themselves; the architecture is a standard encoder-decoder trained end-to-end. The 'no intrinsics' property is an explicit design choice enabled by predicting depths directly in image space rather than lifting via camera parameters. Performance gains are demonstrated via ablation and comparison tables, not by re-labeling fitted quantities as predictions. This is a self-contained empirical contribution with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5741 in / 1007 out tokens · 36314 ms · 2026-05-21T10:16:52.773870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 8 internal anchors

  1. [1]

    CVPR (2021)

    Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. CVPR (2021)

  2. [2]

    In: NeurIPS Datasets and Benchmarks Track (Round 1) (2021)

    Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., Shulman, E.: ARKitscenes - a diverse real-world dataset for 3d indoor scene understanding using mobile RGB-d data. In: NeurIPS Datasets and Benchmarks Track (Round 1) (2021)

  3. [3]

    Bhat, S.F., Mitra, N., Wonka, P.: Loosecontrol: Lifting controlnet for generalized depthconditioning.In:ACMSIGGRAPH2024ConferencePapers.pp.1–11(2024)

  4. [4]

    In: 2025 International Conference on 3D Vision (3DV)

    Bian, W., Wang, Z., Vedaldi, A.: Catfree3d: Category-agnostic 3d object detection with diffusion. In: 2025 International Conference on 3D Vision (3DV). pp. 101–111. IEEE (2025)

  5. [5]

    In: CVPR

    Brazil, G., Kumar, A., Straub, J., Ravi, N., Johnson, J., Gkioxari, G.: Omni3D: A large benchmark and model for 3D object detection in the wild. In: CVPR. IEEE, Vancouver, Canada (June 2023)

  6. [6]

    In: CVPR (2020)

    Caesar, H., Bankiti, V., Lang, A.H., Vora, S., Liong, V.E., Xu, Q., Krishnan, A., Pan, Y., Baldan, G., Beijbom, O.: nuscenes: A multimodal dataset for autonomous driving. In: CVPR (2020)

  7. [7]

    In: European conference on computer vision

    Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End- to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)

  8. [8]

    In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems

    Choi, C., Baek, S.M., Lee, S.: Real-time 3d object pose estimation and tracking for natural landmark based visual servo. In: 2008 IEEE/RSJ International Conference on Intelligent Robots and Systems. pp. 3983–3989. IEEE (2008)

  9. [9]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 5206848

  10. [10]

    In: CVPR (2012)

    Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: CVPR (2012)

  11. [11]

    In: Pro- ceedings of the fourteenth international conference on artificial intelligence and statistics

    Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Pro- ceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 315–323. JMLR Workshop and Conference Proceedings (2011)

  12. [12]

    He,K.,Zhang,X.,Ren,S.,Sun,J.:Deepresiduallearningforimagerecognition.In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)

  13. [13]

    arXiv preprint arXiv:2510.04390 (2025)

    He, X., Zhou, S., Venkateswaran, T., Zheng, K., Wan, Z., Kadambi, A., Wang, X.E.: Morphosim: An interactive, controllable, and editable language-guided 4d world simulator. arXiv preprint arXiv:2510.04390 (2025)

  14. [14]

    Founda- tions of Crystallography32(5), 922–923 (1976)

    Kabsch, W.: A solution for the best rotation to relate two sets of vectors. Founda- tions of Crystallography32(5), 922–923 (1976)

  15. [15]

    In: 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN)

    Kim, J., Lee, G., Kim, J.S., Kim, H.J., Kim, K.: Monocular 3d object detection for an indoor robot environment. In: 2020 29th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). pp. 438–445. IEEE (2020)

  16. [16]

    In: European Conference on Computer Vision

    Krishnan, A., Kundu, A., Maninis, K.K., Hays, J., Brown, M.: Omninocs: A unified nocs dataset and model for 3d lifting of 2d objects. In: European Conference on Computer Vision. pp. 127–145. Springer (2024) MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane 25

  17. [17]

    In: European conference on computer vision

    Labbé, Y., Carpentier, J., Aubry, M., Sivic, J.: Cosypose: Consistent multi-view multi-object 6d pose estimation. In: European conference on computer vision. pp. 574–591. Springer (2020)

  18. [18]

    arXiv:2004.07484 (2020)

    Lassner, C., Zollhöfer, M.: Pulsar: Efficient sphere-based neural rendering. arXiv:2004.07484 (2020)

  19. [19]

    In: Proceed- ings of the European conference on computer vision (ECCV)

    Law, H., Deng, J.: Cornernet: Detecting objects as paired keypoints. In: Proceed- ings of the European conference on computer vision (ECCV). pp. 734–750 (2018)

  20. [20]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Li, F., Zhang, H., Liu, S., Guo, J., Ni, L.M., Zhang, L.: Dn-detr: Accelerate detr training by introducing query denoising. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 13619–13627 (2022)

  21. [21]

    IEEE Robotics and Automation Letters6(3), 5565–5572 (2021)

    Li, P., Zhao, H.: Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robotics and Automation Letters6(3), 5565–5572 (2021)

  22. [22]

    In: European Conference on Computer Vision

    Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: Real-time monocular 3d detection from object keypoints for autonomous driving. In: European Conference on Computer Vision. pp. 644–660. Springer (2020)

  23. [23]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647 (2025)

  24. [24]

    In: The 25th annual international conference on mobile com- puting and networking

    Liu, L., Li, H., Gruteser, M.: Edge assisted real-time object detection for mobile augmented reality. In: The 25th annual international conference on mobile com- puting and networking. pp. 1–16 (2019)

  25. [25]

    Dab-detr: Dynamic anchor boxes are better queries for detr

    Liu, S., Li, F., Zhang, H., Yang, X., Qi, X., Su, H., Zhu, J., Zhang, L.: Dab-detr: Dynamic anchor boxes are better queries for detr. arXiv preprint arXiv:2201.12329 (2022)

  26. [26]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops

    Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. pp. 996–997 (2020)

  27. [27]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)

  28. [28]

    Decoupled Weight Decay Regularization

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  29. [29]

    Computers & Graphics85, 15–22 (2019)

    Luvizon, D.C., Tabia, H., Picard, D.: Human pose regression by combining indi- rect part detection and contextual information. Computers & Graphics85, 15–22 (2019)

  30. [30]

    In: Proceedings of the IEEE/CVF in- ternational conference on computer vision

    Meng, D., Chen, X., Fan, Z., Zeng, G., Li, H., Yuan, Y., Sun, L., Wang, J.: Con- ditional detr for fast training convergence. In: Proceedings of the IEEE/CVF in- ternational conference on computer vision. pp. 3651–3660 (2021)

  31. [31]

    In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition

    Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3d bounding box estimation using deep learning and geometry. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. pp. 7074–7082 (2017)

  32. [32]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pandey, K., Guerrero, P., Gadelha, M., Hold-Geoffroy, Y., Singh, K., Mitra, N.J.: Diffusion handles enabling 3d edits for diffusion models by lifting activations to 3d. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7695–7704 (2024)

  33. [33]

    In: 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality

    Park, Y., Lepetit, V., Woo, W.: Multiple 3d object tracking for augmented real- ity. In: 2008 7th IEEE/ACM International Symposium on Mixed and Augmented Reality. pp. 117–120. IEEE (2008)

  34. [34]

    IEEE transactions on pattern analysis and machine intelligence 44(9), 5170–5184 (2021) 26 C

    Qin, Z., Wang, J., Lu, Y.: Monogrnet: A general framework for monocular 3d object detection. IEEE transactions on pattern analysis and machine intelligence 44(9), 5170–5184 (2021) 26 C. Jeon et al

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., Rädle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Dollár, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024),https://arxiv.org/ abs/2408.00714

  36. [36]

    In: ICCV (2021)

    Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

  37. [37]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  38. [38]

    In: Proceedings of the Winter Confer- ence on Applications of Computer Vision

    Sajnani, R., Vanbaar, J., Min, J., Katyal, K.D., Sridhar, S.: Geodiffuser: Geometry- based image editing with diffusion models. In: Proceedings of the Winter Confer- ence on Applications of Computer Vision. pp. 472–482 (2025)

  39. [39]

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khalidov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., Massa, F., Haziza, D., Wehrstedt, L., Wang, J., Darcet, T., Moutakanni, T., Sentana, L., Roberts, C., Vedaldi, A., Tolan, J., Brandt, J., Couprie, C., Mairal, J., Jégou, H., Labatut, P., Bojanowski, P.: DINOv3 (2025),https://ar...

  40. [40]

    In: CVPR (2015)

    Song,S.,Lichtenberg,S.P.,Xiao,J.:Sunrgb-d:Argb-dsceneunderstandingbench- mark suite. In: CVPR (2015)

  41. [41]

    IEEE Robotics and Automation Letters9(4), 3578–3585 (2024)

    Swerdlow, A., Xu, R., Zhou, B.: Street-view image generation from a bird’s-eye view layout. IEEE Robotics and Automation Letters9(4), 3578–3585 (2024)

  42. [42]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Tekin, B., Sinha, S.N., Fua, P.: Real-time seamless single shot 6d object pose prediction. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 292–301 (2018)

  43. [43]

    Tong, S., Brown, E., Wu, P., Woo, S., Middepogu, M., Akula, S.C., Yang, J., Yang, S., Iyer, A., Pan, X., Wang, A., Fergus, R., LeCun, Y., Xie, S.: Cambrian-1: A fully open, vision-centric exploration of multimodal llms (2024)

  44. [44]

    Deep Object Pose Estimation for Semantic Robotic Grasping of Household Objects

    Tremblay, J., To, T., Sundaralingam, B., Xiang, Y., Fox, D., Birchfield, S.: Deep object pose estimation for semantic robotic grasping of household objects. arXiv preprint arXiv:1809.10790 (2018)

  45. [45]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  46. [46]

    In: Proceed- ings of the IEEE/CVF international conference on computer vision

    Wan, B., Shi, Y., Xu, K.: Socs: Semantically-aware object coordinate space for category-level 6d object pose estimation under large shape variations. In: Proceed- ings of the IEEE/CVF international conference on computer vision. pp. 14065– 14074 (2023)

  47. [47]

    IEEE Transactions on Intelligent Transporta- tion Systems23(8), 12953–12965 (2021)

    Wang, G., Wu, J., Tian, B., Teng, S., Chen, L., Cao, D.: Centernet3d: An anchor free object detector for point cloud. IEEE Transactions on Intelligent Transporta- tion Systems23(8), 12953–12965 (2021)

  48. [48]

    In: TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(June 2019)

    Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: TheIEEEConferenceonComputerVisionandPatternRecognition(CVPR)(June 2019)

  49. [49]

    Advances in Neural Information Pro- cessing Systems37, 76289–76318 (2024) MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane 27

    Wu, Z., Rubanova, Y., Kabra, R., Hudson, D.A., Gilitschenski, I., Aytar, Y., Van Steenkiste, S., Allen, K.R., Kipf, T.: Neural assets: 3d-aware multi-object scene synthesis with image diffusion models. Advances in Neural Information Pro- cessing Systems37, 76289–76318 (2024) MoCA3D: Monocular 3D Bounding Box Prediction in the Image Plane 27

  50. [50]

    arXiv preprint arXiv:2308.01661 (2023)

    Yang, K., Ma, E., Peng, J., Guo, Q., Lin, D., Yu, K.: Bevcontrol: Accurately controlling street-view elements with multi-perspective consistency via bev sketch layout. arXiv preprint arXiv:2308.01661 (2023)

  51. [51]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, Y.H., Piccinelli, L., Segu, M., Li, S., Huang, R., Fu, Y., Pollefeys, M., Blum, H., Bauer, Z.: 3d-mood: Lifting 2d to 3d for monocular open-set object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7429–7439 (2025)

  52. [52]

    In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, Z., Guo, X., Ding, C., Wang, C., Wu, W., Zhang, Y.: Instadrive: Instance- aware driving world models for realistic and consistent video generation. In: Pro- ceedings of the IEEE/CVF International Conference on Computer Vision. pp. 25410–25420 (2025)

  53. [53]

    arXiv preprint arXiv:2411.16833 (2024)

    Yao, J., Gu, H., Chen, X., Wang, J., Cheng, Z.: Open vocabulary monocular 3d object detection. arXiv preprint arXiv:2411.16833 (2024)

  54. [54]

    arXiv preprint arXiv:2504.07955 (2025)

    Yu,Y.,He,X.,Zhao,C.,Yu,J.,Yang,J.,Hu,R.,Shen,Y.,Zhu,X.,Zhou,X.,Peng, S.: Boxdreamer: Dreaming box corners for generalizable object pose estimation. arXiv preprint arXiv:2504.07955 (2025)

  55. [55]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao, H., Li, H., Zhu, H., Yang, Z.: Detect anything 3d in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5048–5059 (2025)

  56. [56]

    In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers

    Zhang, H., Alaluf, Y., Ma, S., Kadambi, A., Wang, J., Aberman, K.: Instantre- store: Single-step personalized face restoration with shared-image attention. In: Proceedings of the Special Interest Group on Computer Graphics and Interactive Techniques Conference Conference Papers. SIGGRAPH Conference Papers ’25, Association for Computing Machinery, New York...

  57. [57]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, J., Sheng, H., Cai, S., Deng, B., Liang, Q., Li, W., Fu, Y., Ye, J., Gu, S.: Perldiff: Controllable street view synthesis using perspective-layout diffusion model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 26306–26315 (2025)

  58. [58]

    Objects as Points

    Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. arXiv preprint arXiv:1904.07850 (2019)

  59. [59]

    Deformable DETR: Deformable Transformers for End-to-End Object Detection

    Zhu, X., Su, W., Lu, L., Li, B., Wang, X., Dai, J.: Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159 (2020)