arxiv: 2604.08626 · v2 · submitted 2026-04-09 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

WildDet3D: Scaling Promptable 3D Detection in the Wild

Weikai Huang , Jieyu Zhang , Sijun Li , Taoyang Jia , Jiafei Duan , Yunqian Cheng , Jaemin Cho , Matthew Wallingford

show 10 more authors

Rustin Soraki Chris Dongjoo Kim Shuo Liu Donovan Clay Taira Anderson Winson Han Ali Farhadi Bharath Hariharan Zhongzheng Ren Ranjay Krishna

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D object detectionmonocular detectionpromptable detectionopen-world detectionlarge-scale 3D datasetgeometry-aware modelwild scenesdepth integration

0 comments

The pith

A unified model detects 3D objects from single images using text, point or box prompts and gains from depth cues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WildDet3D, an architecture that detects objects in three dimensions from one RGB image while accepting text, point, or box prompts. It pairs this with WildDet3D-Data, a dataset of over one million images spanning 13,500 categories built by generating candidate 3D boxes from existing 2D labels and keeping only human-verified ones. The system is designed for open-world use beyond narrow category sets and controlled environments. It reports new performance levels on a fresh benchmark and on existing ones, with further lifts when depth signals are added at inference time. This setup aims to make monocular 3D detection practical for diverse real scenes.

Core claim

WildDet3D is a geometry-aware architecture that natively accepts text, point, and box prompts for monocular 3D object detection and incorporates auxiliary depth signals at inference time. Combined with WildDet3D-Data, the largest open 3D detection dataset constructed by generating candidate 3D boxes from 2D annotations and retaining only human-verified ones across 13.5K categories in diverse real-world scenes, it reaches 22.6/24.8 AP3D on the new WildDet3D-Bench with text and box prompts, 34.2/36.4 AP3D on Omni3D, and 40.3/48.9 ODS in zero-shot tests on Argoverse 2 and ScanNet, with depth cues adding +20.7 AP on average.

What carries the argument

WildDet3D, a unified geometry-aware architecture that accepts multiple prompt modalities and integrates depth signals during inference.

Load-bearing premise

Generating candidate 3D boxes from 2D annotations and human verification produces accurate, unbiased 3D ground truth across 13.5K categories and real scenes without systematic errors or selection bias.

What would settle it

Independent re-annotation of a random sample of the dataset's 3D boxes to measure error rates against the verified labels, or evaluation on a new benchmark containing categories and scenes entirely absent from the construction process.

read the original abstract

Understanding objects in 3D from a single image is a cornerstone of spatial intelligence. A key step toward this goal is monocular 3D object detection--recovering the extent, location, and orientation of objects from an input RGB image. To be practical in the open world, such a detector must generalize beyond closed-set categories, support diverse prompt modalities, and leverage geometric cues when available. Progress is hampered by two bottlenecks: existing methods are designed for a single prompt type and lack a mechanism to incorporate additional geometric cues, and current 3D datasets cover only narrow categories in controlled environments, limiting open-world transfer. In this work we address both gaps. First, we introduce WildDet3D, a unified geometry-aware architecture that natively accepts text, point, and box prompts and can incorporate auxiliary depth signals at inference time. Second, we present WildDet3D-Data, the largest open 3D detection dataset to date, constructed by generating candidate 3D boxes from existing 2D annotations and retaining only human-verified ones, yielding over 1M images across 13.5K categories in diverse real-world scenes. WildDet3D establishes a new state-of-the-art across multiple benchmarks and settings. In the open-world setting, it achieves 22.6/24.8 AP3D on our newly introduced WildDet3D-Bench with text and box prompts. On Omni3D, it reaches 34.2/36.4 AP3D with text and box prompts, respectively. In zero-shot evaluation, it achieves 40.3/48.9 ODS on Argoverse 2 and ScanNet. Notably, incorporating depth cues at inference time yields substantial additional gains (+20.7 AP on average across settings).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildDet3D adds a large open-world 3D dataset and a multi-prompt architecture, but the unvalidated 3D ground truth from 2D lifting is a real risk to the claimed gains.

read the letter

The paper's core move is scaling monocular 3D detection past closed sets by releasing WildDet3D-Data (over 1M images, 13.5K categories) and a single model that takes text, point, or box prompts while folding in depth at inference. That combination is new enough to matter for anyone trying to move beyond narrow benchmarks like Omni3D or KITTI. They show concrete lifts: 22.6/24.8 AP3D on their new bench with text/box prompts, 34.2/36.4 on Omni3D, solid zero-shot ODS on Argoverse 2 and ScanNet, and a reported +20.7 AP average from depth. The architecture description in the abstract at least sketches a geometry-aware design that avoids separate heads per prompt type, which is cleaner than bolting prompts onto existing detectors.

Referee Report

2 major / 0 minor

Summary. The paper introduces WildDet3D, a unified geometry-aware architecture for monocular 3D object detection that natively supports text, point, and box prompts and incorporates auxiliary depth signals at inference. It also presents WildDet3D-Data, the largest open 3D detection dataset constructed by lifting 2D annotations to candidate 3D boxes and retaining only human-verified instances, spanning over 1M images and 13.5K categories in diverse scenes. The work claims new state-of-the-art results including 22.6/24.8 AP3D on the new WildDet3D-Bench (text/box prompts), 34.2/36.4 AP3D on Omni3D, zero-shot ODS of 40.3/48.9 on Argoverse 2 and ScanNet, and average gains of +20.7 AP from depth cues.

Significance. If the dataset supplies reliable unbiased 3D ground truth and the empirical gains are reproducible, the work would meaningfully advance open-world monocular 3D detection by scaling promptable detection to thousands of categories while integrating geometric cues. The dataset scale and prompt flexibility address documented bottlenecks in the field. Credit is due for the empirical breadth across open-world, closed-set, and zero-shot settings.

major comments (2)

[WildDet3D-Data section] Dataset construction (WildDet3D-Data section): the pipeline of generating candidate 3D boxes from 2D annotations followed by human verification is described at high level only, with no reported quantitative error analysis, inter-annotator agreement, per-category statistics, or comparison to LiDAR/multi-view references. This is load-bearing for the central claim, as every reported AP3D and ODS number depends on the accuracy and lack of systematic bias in this supervision across 13.5K categories and 1M+ images.
[Experiments section] Experiments section: the abstract and results tables report concrete AP3D/ODS numbers and depth gains without architecture diagrams, training procedure details, ablation studies on prompt/depth components, or error bars/statistical tests. This prevents assessment of whether the SOTA claims arise from the unified architecture or from dataset effects.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, agreeing where revisions are warranted and providing clarifications on the current manuscript content.

read point-by-point responses

Referee: [WildDet3D-Data section] Dataset construction (WildDet3D-Data section): the pipeline of generating candidate 3D boxes from 2D annotations followed by human verification is described at high level only, with no reported quantitative error analysis, inter-annotator agreement, per-category statistics, or comparison to LiDAR/multi-view references. This is load-bearing for the central claim, as every reported AP3D and ODS number depends on the accuracy and lack of systematic bias in this supervision across 13.5K categories and 1M+ images.

Authors: We agree that the dataset construction requires more quantitative support to validate the 3D ground truth quality. In the revised version we will expand the WildDet3D-Data section with: inter-annotator agreement metrics on a sampled subset, quantitative error analysis comparing verified boxes to available LiDAR or multi-view references, per-category statistics on instance counts and verification pass rates, and discussion of potential systematic biases. We will also release the annotation guidelines and a verification subset to enable external assessment. revision: yes
Referee: [Experiments section] Experiments section: the abstract and results tables report concrete AP3D/ODS numbers and depth gains without architecture diagrams, training procedure details, ablation studies on prompt/depth components, or error bars/statistical tests. This prevents assessment of whether the SOTA claims arise from the unified architecture or from dataset effects.

Authors: The manuscript already contains an architecture diagram (Figure 2) and training procedure details (Section 4). However, we acknowledge the absence of targeted ablations on prompt modalities and depth integration as well as error bars and statistical tests. We will add these in the revision: ablations isolating each prompt type and the depth cue, error bars from multiple runs, and significance tests on the reported gains. This will clarify the sources of improvement and strengthen reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture and dataset with independent evaluation

full rationale

The paper introduces WildDet3D as a promptable 3D detector and WildDet3D-Data via 2D-to-3D lifting plus human verification, then reports empirical AP3D/ODS metrics on new and existing benchmarks including zero-shot transfer. No equations, fitted parameters, or self-citations are presented that reduce any reported gain to a quantity defined by construction from the inputs; the derivation chain consists of standard training and evaluation steps whose outputs are not tautological with the dataset construction or model design. This is self-contained empirical work with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the assumption that human verification of 3D boxes derived from 2D annotations yields reliable ground truth and that the unified architecture can effectively fuse prompt and depth modalities; no explicit free parameters or invented entities are named in the abstract.

axioms (1)

domain assumption Human verification of candidate 3D boxes produces accurate and unbiased 3D ground truth across diverse categories and scenes
Invoked in the construction of WildDet3D-Data

pith-pipeline@v0.9.0 · 5697 in / 1341 out tokens · 50539 ms · 2026-05-10T17:31:42.106118+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking (D=3 from linking) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

we generate per-pixel ray directions ri,j = K^{-1}[u, v, 1]^T and encode them using 8th-order real spherical harmonics: ϕ(r) = RSH8(r / ||r||) ∈ R^{81}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 38 canonical work pages · 11 internal anchors

[1]

Ahmadyan, L

A. Ahmadyan, L. Zhang, A. Ablavatski, J. Wei, and M. Grundmann. Objectron: A large scale dataset of object-centric videos in the wild with pose annotations.CVPR, 2021

2021
[2]

S. Bai, Y. Cai, R. Chen, K. Chen, X. Chen, Z. Cheng, L. Deng, W. Ding, C. Gao, C. Ge, W. Ge, Z. Guo, Q. Huang, J. Huang, F. Huang, B. Hui, S. Jiang, Z. Li, M. Li, M. Li, K. Li, Z. Lin, J. Lin, X. Liu, J. Liu, C. Liu, Y. Liu, D. Liu, S. Liu, D. Lu, R. Luo, C. Lv, R. Men, L. Meng, X. Ren, X. Ren, S. Song, Y. Sun, J. Tang, J. Tu, J. Wan, P. Wang, P. Wang, Q....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Z. Bai, T. He, H. Mei, P. Wang, Z. Gao, J. Chen, L. Liu, Z. Zhang, and M. Z. Shou. One token to seg them all: Language instructed reasoning segmentation in videos. InNeurIPS, 2024

2024
[4]

Baruch, Z

G. Baruch, Z. Chen, A. Dehghan, T. Dimry, Y. Feigin, P. Fu, T. Gebauer, B. Joffe, D. Kurz, A. Schwartz, and E. Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data,
[5]

URLhttps://arxiv.org/abs/2111.08897

work page arXiv
[6]

Brazil and X

G. Brazil and X. Liu. M3d-rpn: Monocular 3d region proposal network for object detection. InICCV, 2019

2019
[7]

Brazil, A

G. Brazil, A. Kumar, J. Straub, N. Ravi, J. Johnson, and G. Gkioxari. Omni3d: A large benchmark and model for 3d object detection in the wild, 2023. URLhttps://arxiv.org/abs/2207.10660

work page arXiv 2023
[8]

Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving, 2020. URLhttps://arxiv.org/abs/1903.11027

work page arXiv 2020
[9]

SAM 3: Segment Anything with Concepts

N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, J. Lei, T. Ma, B. Guo, A. Kalla, M. Marks, J. Greer, M. Wang, P. Sun, R. Rädle, T. Afouras, E. Mavroudi, K. Xu, T.-H. Wu, Y. Zhou, L. Momeni, R. Hazra, S. Ding, S. Vaze, F. Porcher, F. Li, S. Li, A. Kamath, H. K. Cheng, P. Dollár, N. Ravi, K. Saenko...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Clark, Y

C. Clark, Y. Yang, J. S. Park, Z. Ma, J. Zhang, R. Tripathi, M. Salehi, S. Lee, T. Anderson, W. Han, et al. Molmopoint: Better pointing for vlms with grounding tokens.arXiv preprint arXiv:2603.28069, 2026

work page arXiv 2026
[11]

Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026

C. Clark, J. Zhang, Z. Ma, J. S. Park, M. Salehi, R. Tripathi, S. Lee, Z. Ren, C. D. Kim, Y. Yang, V. Shao, Y. Yang, W. Huang, Z. Gao, T. Anderson, J. Zhang, J. Jain, G. Stoica, W. Han, A. Farhadi, and R. Krishna. Molmo2: Open weights and data for vision-language models with video understanding and grounding, 2026. URL https://arxiv.org/abs/2601.10611

work page arXiv 2026
[12]

Cordts, M

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The Cityscapes dataset for semantic urban scene understanding. InCVPR, 2016

2016
[13]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017
[14]

Deitke, C

M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J. S. Park, M. Salehi, N. Muennighoff, K. Lo, L. Soldaini, J. Lu, T. Anderson, E. Bransom, K. Ehsani, H. Ngo, Y. Chen, A. Patel, M. Yatskar, C. Callison-Burch, A. Head, R. Hendrix, F. Bastani, E. VanderBilt, N. Lambert, Y. Chou, A. Chheda, J. Sparks, S. Skjonsberg, M. Schmitz, A. Sarnat, B. Bischoff, P. W...

2025
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021. URLhttps://arxiv.org/abs/2010.11929

work page internal anchor Pith review Pith/arXiv arXiv 2021
[16]

Eigen, C

D. Eigen, C. Puhrsch, and R. Fergus. Depth map prediction from a single image using a multi-scale deep network,
[17]

URLhttps://arxiv.org/abs/1406.2283

work page arXiv
[18]

Z. Gao, J. Zhang, W. O. Ikezogwo, J. S. Park, T. G. You, D. Ogbu, C. Zheng, W. Huang, Y. Yang, W. Han, Q. Kong, R. Saini, and R. Krishna. Synthetic visual genome 2: Extracting large-scale spatio-temporal scene graphs from videos, 2026. URLhttps://arxiv.org/abs/2602.23543. 24

work page arXiv 2026
[19]

Geiger, P

A. Geiger, P. Lenz, and R. Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In CVPR, 2012

2012
[20]

Gupta, P

A. Gupta, P. Dollar, and R. Girshick. LVIS: A dataset for large vocabulary instance segmentation. InCVPR, 2019

2019
[21]

Huang, H

R. Huang, H. Zheng, Y. Wang, Z. Xia, M. Pavone, and G. Huang. Training an open-vocabulary monocular 3d detection model without 3d data. InNeurIPS, 2024

2024
[22]

Huang, J

W. Huang, J. Zhang, T. Jia, C. Zheng, Z. Gao, J. S. Park, W. Han, and R. Krishna. Synthetic object compositions for scalable and accurate learning in detection, segmentation, and grounding, 2026. URLhttps: //arxiv.org/abs/2510.09110

work page arXiv 2026
[23]

L. Jin, J. Zhang, Y. Hold-Geoffroy, O. Wang, K. Matzen, M. Sticha, and D. F. Fouhey. Perspective fields for single image camera calibration, 2023. URLhttps://arxiv.org/abs/2212.03239

work page arXiv 2023
[24]

L. Jin, R. Tucker, Z. Li, D. Fouhey, N. Snavely, and A. Holynski. Stereo4D: Learning How Things Move in 3D from Internet Stereo Videos. InCVPR, 2025

2025
[25]

X. Lai, Z. Tian, Y. Chen, Y. Li, Y. Yuan, S. Liu, and J. Jia. Lisa: Reasoning segmentation via large language model.arXiv preprint arXiv:2308.00692, 2023

work page arXiv 2023
[26]

Lazarow, D

J. Lazarow, D. Griffiths, G. Kohavi, F. Crespo, and A. Dehghan. Cubify anything: Scaling indoor 3d object detection, 2024. URLhttps://arxiv.org/abs/2412.04458

work page arXiv 2024
[27]

Lazarow, D

J. Lazarow, D. Griffiths, G. Kohavi, F. Crespo, and A. Dehghan. Cubify anything: Scaling indoor 3d object detection. InCVPR, 2025

2025
[28]

L. H. Li, P. Zhang, H. Zhang, J. Yang, C. Li, Y. Zhong, L. Wang, L. Yuan, L. Zhang, J.-N. Hwang, et al. Grounded language-image pre-training. InCVPR, 2022

2022
[29]

R. Li, Y. Dong, T. Hu, A. Liang, Y. Liu, D. Lu, L. Pan, L. Kong, J. Liang, and Z. Liu. 3eed: Ground everything everywhere in 3d, 2025. URLhttps://arxiv.org/abs/2511.01755

work page arXiv 2025
[30]

Z. Li, X. Xu, S. Lim, and H. Zhao. Unimode: Unified monocular 3d object detection. InCVPR, 2024

2024
[31]

T.-Y. Lin, M. Maire, S. Belongie, L. Bourdev, R. Girshick, J. Hays, P. Perona, D. Ramanan, C. L. Zitnick, and P. Dollár. Microsoft coco: Common objects in context, 2015. URLhttps://arxiv.org/abs/1405.0312

work page internal anchor Pith review arXiv 2015
[32]

S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection.arXiv preprint arXiv:2303.05499, 2023

work page Pith review arXiv 2023
[33]

Z. Liu, Z. Wu, and R. Tóth. Smoke: Single-stage monocular 3d object detection via keypoint estimation. In CVPR Workshop, 2020

2020
[34]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization, 2019. URLhttps://arxiv.org/abs/1711. 05101

2019
[35]

Y. Man, S. Wang, G. Zhang, J. Bjorck, Z. Li, L.-Y. Gui, J. Fan, J. Kautz, Y.-X. Wang, and Z. Yu. Locateanything3d: Vision-language 3d detection with chain-of-sight, 2026. URLhttps://arxiv.org/abs/2511.20648

work page arXiv 2026
[36]

Minderer, A

M. Minderer, A. Gritsenko, A. Stone, M. Neumann, D. Weissenborn, A. Dosovitskiy, A. Mahendran, A. Arnab, M. Dehghani, Z. Shen, et al. Simple open-vocabulary object detection. InECCV, 2022

2022
[37]

Minderer, A

M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection.NeurIPS, 2023

2023
[38]

GPT-4 Technical Report

OpenAI, J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, R. Avila, I. Babuschkin, S. Balaji, V. Balcom, P. Baltescu, H. Bao, M. Bavarian, J. Belgum, I. Bello, J. Berdine, G. Bernadett-Shapiro, C. Berner, L. Bogdonoff, O. Boiko, M. Boyd, A.-L. Brakman, G. Brockman, T. Brooks, M. Brundag...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P.-Y. Huang, S.-W. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski. Dinov2: Learning robust visual features without supervisi...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[40]

Prolific academic online research, 2025

Prolific. Prolific academic online research, 2025. URLhttps://www.prolific.com. Accessed: March 20, 2026

2025
[41]

Learning Transferable Visual Models From Natural Language Supervision

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision, 2021. URL https://arxiv.org/abs/2103.00020

work page internal anchor Pith review Pith/arXiv arXiv 2021
[42]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos. InICLR, 2025

2025
[43]

Reading, A

C. Reading, A. Harakeh, J. Chae, and S. L. Waslander. Categorical depth distribution network for monocular 3d object detection. InCVPR, 2021

2021
[44]

Generalizedintersectionoverunion:Ametricandalossfor bounding box regression

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese. Generalized intersection over union: A metric and a loss for bounding box regression, 2019. URLhttps://arxiv.org/abs/1902.09630

work page arXiv 2019
[45]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021
[46]

Rukhovich, A

D. Rukhovich, A. Vorontsova, and A. Konushin. Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. InW ACV, 2022

2022
[47]

S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun. Objects365: A large-scale, high-quality dataset for object detection. InICCV, 2019

2019
[48]

S. Song, S. P. Lichtenberg, and J. Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InCVPR, 2015

2015
[49]

Z. Song, L. Liu, F. Jia, Y. Luo, C. Jia, G. Zhang, L. Yang, and L. Wang. Robustness-aware 3d object detection in autonomous driving: A review and outlook.IEEE Transactions on Intelligent Transportation Systems, 2024

2024
[50]

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V. Patnaik, P. Tsui, J. Guo, Y. Zhou, Y. Chai, B. Caine, V. Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y. Zhang, J. Shlens, Z. Chen, and D. Anguelov. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF Conf...

2020
[51]

B. Tan, C. Sun, X. Qin, H. Adai, Z. Fu, T. Zhou, H. Zhang, Y. Xu, X. Zhu, Y. Shen, and N. Xue. Masked depth modeling for spatial perception.arXiv preprint arXiv:2601.17895, 2026

work page arXiv 2026
[52]

S. D. Team, X. Chen, F.-J. Chu, P. Gleize, K. J. Liang, A. Sax, H. Tang, W. Wang, M. Guo, T. Hardin, X. Li, A. Lin, J. Liu, Z. Ma, A. Sagar, B. Song, X. Wang, J. Yang, B. Zhang, P. Dollár, G. Gkioxari, M. Feiszli, and J. Malik. Sam 3d: 3dfy anything in images, 2025. URLhttps://arxiv.org/abs/2511.16624. 26

work page internal anchor Pith review arXiv 2025
[53]

J. Wang, P. Zhang, T. Chu, Y. Cao, Y. Zhou, T. Wu, B. Wang, C. He, and D. Lin. V3det: Vast vocabulary visual detection dataset, 2023. URLhttps://arxiv.org/abs/2304.03752

work page arXiv 2023
[54]

R. Wang, S. Xu, Y. Dong, Y. Deng, J. Xiang, Z. Lv, G. Sun, X. Tong, and J. Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details, 2025. URLhttps://arxiv.org/abs/2507.02546

work page internal anchor Pith review arXiv 2025
[55]

T. Wang, X. Zhu, J. Pang, and D. Lin. Fcos3d: Fully convolutional one-stage monocular 3d object detection. In CVPR, 2021

2021
[56]

B. Wen, W. Yang, J. Kautz, and S. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects, 2024. URLhttps://arxiv.org/abs/2312.08344

work page arXiv 2024
[57]

Argoverse 2: Next Generation Datasets for Self-Driving Perception and Forecasting

B. Wilson, W. Qi, T. Agarwal, J. Lambert, J. Singh, S. Khandelwal, B. Pan, R. Kumar, A. Hartnett, J. K. Pontes, D. Ramanan, P. Carr, and J. Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting, 2023. URLhttps://arxiv.org/abs/2301.00493

work page internal anchor Pith review arXiv 2023
[58]

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv, C. Zheng, D. Liu, F. Zhou, F. Huang, F. Hu, H. Ge, H. Wei, H. Lin, J. Tang, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Zhou, J. Lin, K. Dang, K. Bao, K. Yang, L. Yu, L. Deng, M. Li, M. Xue, M. Li, P. Zhang, P. Wang, Q. Zhu, R. Men, R. Gao, S. Liu, S. Luo, T. ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

R. Yang, Z. Zhu, Y. Li, J. Huang, S. Yan, S. Zhou, Z. Liu, X. Li, S. Li, W. Wang, Y. Lin, and H. Zhao. Visual spatial tuning, 2025. URLhttps://arxiv.org/abs/2511.05491

work page arXiv 2025
[60]

S. Yang, T. Qu, X. Lai, Z. Tian, B. Peng, S. Liu, and J. Jia. Lisa++: An improved baseline for reasoning segmentation with large language model.arXiv preprint arXiv:2312.17240, 2023

work page arXiv 2023
[61]

Y.-H. Yang, L. Piccinelli, M. Segu, S. Li, R. Huang, Y. Fu, M. Pollefeys, H. Blum, and Z. Bauer. 3d-mood: Lifting 2d to 3d for monocular open-set object detection, 2025. URLhttps://arxiv.org/abs/2507.23567

work page arXiv 2025
[62]

J. Yao, H. Gu, X. Chen, J. Wang, and Z. Cheng. Open vocabulary monocular 3d object detection, 2025. URL https://arxiv.org/abs/2411.16833

work page arXiv 2025
[63]

J. Yao, R. M. Redoy, S. Elbaum, M. B. Dwyer, and Z. Cheng. Labelany3d: Label any object 3d in the wild, 2026. URLhttps://arxiv.org/abs/2601.01676

work page arXiv 2026
[64]

W. Yuan, J. Duan, V. Blukis, W. Pumacay, R. Krishna, A. Murali, A. Mousavian, and D. Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. InCoRL, 2024

2024
[65]

Z. Yue, K. Liao, and C. C. Loy. Arbitrary-steps image super-resolution via diffusion inversion, 2025. URL https://arxiv.org/abs/2412.09013

work page arXiv 2025
[66]

Zhang, H

H. Zhang, H. Jiang, Q. Yao, Y. Sun, R. Zhang, H. Zhao, H. Li, H. Zhu, and Z. Yang. Detect anything 3d in the wild, 2025. URLhttps://arxiv.org/abs/2504.07958

work page arXiv 2025
[67]

Adding conditional control to text-to-image diffusion models,

L. Zhang, A. Rao, and M. Agrawala. Adding conditional control to text-to-image diffusion models, 2023. URL https://arxiv.org/abs/2302.05543

work page arXiv 2023
[68]

Zhang, H

R. Zhang, H. Qiu, T. Wang, Z. Guo, Z. Cui, Y. Qiao, H. Li, and P. Gao. Monodetr: Depth-guided transformer for monocular 3d object detection. InICCV, 2023

2023
[69]

Y. Zhou, C. Barnes, J. Lu, J. Yang, and H. Li. On the continuity of rotation representations in neural networks,
[70]

URLhttps://arxiv.org/abs/1812.07035

work page arXiv
[71]

S. Zhu, A. Kumar, M. Hu, and X. Liu. Tame a wild camera: In-the-wild monocular camera calibration. In NeurIPS, 2023

2023
[72]

X. Zou, J. Yang, H. Zhang, F. Li, L. Li, J. Wang, L. Wang, J. Gao, and Y. J. Lee. Segment everything everywhere all at once. InNeurIPS, 2023. 27 Appendix The appendix includes the following sections: •§A - Model and loss details •§B - Training details •§C - Evaluation details •§D - Dataset details •§E - Dataset examples •§F - Qualitative results A Model a...

2023