arxiv: 2604.05212 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Daniel DeTone , Tianwei Shen , Fan Zhang , Lingni Ma , Julian Straub , Richard Newcombe , Jakob Engel

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D bounding box estimationopen-world detection2D to 3D liftingtransformer networkmulti-view fusiondepth encodingobject localizationcomputer vision

0 comments

The pith

A transformer-based network lifts 2D open-vocabulary bounding boxes to accurate 3D bounding boxes using camera poses and optional depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Boxer as a way to turn 2D detections of arbitrary object categories into static 3D bounding boxes. It uses existing 2D detectors for the hard part of finding objects and trains a dedicated network only on the lifting step, then combines results from multiple views to remove duplicates and enforce consistency in world space. A reader would care because 3D localization has remained difficult for new categories while 2D detection has advanced quickly; this split reduces the need for large 3D-annotated datasets. If the approach holds, spatial awareness in robotics or augmented reality could extend to far more object types from ordinary image collections.

Core claim

Boxer estimates static 3D bounding boxes from 2D open-vocabulary object detections, posed images and optional depth either as a sparse point cloud or dense map. Its core is a transformer-based network that lifts 2D bounding box proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. The network adds uncertainty modeling for robust regression and a median depth patch encoding to work with sparse inputs, trained at large scale on over a million unique 3D boxes.

What carries the argument

BoxerNet, a transformer-based network that lifts 2D bounding box proposals into 3D while modeling aleatoric uncertainty and encoding median depth patches for varied input density.

If this is right

Existing 2D open-vocabulary detectors can be reused directly for 3D localization without retraining the full system.
Multi-view fusion and filtering produce globally consistent boxes even when depth is only sparse.
The separation of detection and lifting lowers the cost of creating training data for new 3D tasks.
Performance remains higher than prior lifting methods both without dense depth and when dense depth is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lifting-plus-fusion pattern could be tested on video streams to track slowly changing object positions across time.
Feeding the resulting 3D boxes into existing reconstruction pipelines might improve overall scene geometry without extra labeling.
If depth inputs come from a separate dense estimator, end-to-end accuracy could rise further in textureless regions.

Load-bearing premise

The approach depends on 2D detectors supplying accurate and complete proposals and on objects remaining static so that multi-view fusion can produce consistent boxes.

What would settle it

Apply the full pipeline to a sequence containing moving objects and measure whether the output 3D boxes show large positional drift or duplication errors relative to independent ground-truth measurements.

Figures

Figures reproduced from arXiv: 2604.05212 by Daniel DeTone, Fan Zhang, Jakob Engel, Julian Straub, Lingni Ma, Richard Newcombe, Tianwei Shen.

**Figure 1.** Figure 1: Boxer takes as input posed images with optional depth and off-the-shelf 2DBB open-world detections, estimating static, global 3D bounding boxes. Boxer is run on a sequence and various scenes are highlighted to show the accuracy and open-world coverage of objects such as spice jar, hairdryer, sink drain and TV remote. Abstract. Detecting and localizing objects in space is a fundamental computer vision probl… view at source ↗

**Figure 2.** Figure 2: Boxer algorithm overview. Boxer operates on a set of posed and calibrated images with optional dense depth or sparse point cloud to produce metric, static, 3D bounding boxes for open-set objects. To bridge the gap between 2D reasoning and 3D geometric consistency, recent works such as ConceptGraphs [14] and EgoLifter [15] lift open-world 2D segmentation masks into 3D, constructing object-centric scene repr… view at source ↗

**Figure 3.** Figure 3: BoxerNet lifting module. BoxerNet conditions on the image, camera calibration and poses and an optional depth input to lift 2D bounding boxes into metric 7-DoF 3D bounding box predictions. Each box attends independently to the patch tokens, with no attention between box tokens, making the formulation permutation invariant. This representation is passed to two prediction heads. Each head is a two-layer MLP… view at source ↗

**Figure 4.** Figure 4: Per-frame 3D IoU visualization. First 2 rows: [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Lifted per-frame 3DBB pseudo heatmaps. We compare the Ground Truth per-scene 3DBBs (left) to all the per-frame 3DBBs from CuTR (middle) and Boxer (right), prompted with GT 2DBB input, into a consistent coordinate frame and show the boxes rendered on top of one another creating a pseudo-heatmap. Boxer exhibits a sharper heatmap compared to CuTR which corresponds to more consistent predictions. Colors loosel… view at source ↗

**Figure 6.** Figure 6: Importance of Aleotoric Uncertainty for Ranking. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Data augmentation examples. Four augmentation types are shown by row (Photometric, Camera, 3D Point, and 2D Box), with four different examples per type shown by column. Third, we augment the projected 3D points / dense depth by randomly dropping all points, individual points, or contiguous blocks of points in 3D space, simulating sparsity and partial observability in the geometric input. We visualize this … view at source ↗

**Figure 8.** Figure 8: Pseudo open-set annotation examples on ScanNet. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗

**Figure 9.** Figure 9: Example visualization of SAM3D on ADT. Example output of SAM3+SAM3D (RGB+Depth) on Aria Digital Twin. Top-left shows SAM3 masks; top-right shows projected 3DBBs; bottom row shows two 3D views of predictions overlaid on the point cloud (bottom-left: behind view, bottom-right: bird’s-eye view) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 10.** Figure 10: PR Curve By Object Size. Precision–recall curves for three object-size buckets—small (left), medium (middle), and large (right)—for BoxerNet vs. CuTR. D Acknowledgments We thank Pierre Moulon for feedback in group discussions, Manuel Lopez Antequera for support with internal annotations, Dan Barnes and Raul Mur-Artal for support with tooling for annotation, and Yawar Siddiqui for valuable discussions and… view at source ↗

read the original abstract

Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Boxer adds useful robustness tweaks to 2D-to-3D lifting but its gains depend on untested 2D detector quality.

read the letter

Boxer lifts 2D open-vocabulary boxes to metric 3D boxes using posed images and optional depth. The key piece is BoxerNet, which extends the earlier CuTR formulation by adding aleatoric uncertainty to the regression and a median depth patch encoding that handles sparse inputs. They train on more than 1.2 million unique 3DBBs and then apply multi-view fusion plus geometric filtering to get consistent, de-duplicated results in world space. This design lets them reuse existing 2D detectors like DETIC, OWLv2, and SAM3 instead of learning detection and lifting together.

Referee Report

2 major / 2 minor

Summary. The paper proposes Boxer, an algorithm for lifting 2D open-vocabulary bounding box detections to static 3D bounding boxes using posed images and optional depth (sparse or dense). Its core is BoxerNet, a transformer network that performs the 2D-to-3D lift with aleatoric uncertainty modeling and median depth patch encoding; this is followed by multi-view fusion and geometric filtering for globally consistent, de-duplicated 3DBBs in metric space. The approach delegates detection to external models (DETiC, OWLv2, SAM3) and reports large-scale training on >1.2M unique 3DBBs. It claims substantial gains over baselines including CuTR: 0.532 vs. 0.010 mAP in egocentric settings without dense depth and 0.412 vs. 0.250 mAP on CA-1M with dense depth available.

Significance. If the quantitative claims hold under scrutiny, the work is significant for open-world 3D localization because it decouples detection from lifting, leverages mature 2D open-vocabulary detectors, and incorporates uncertainty and sparse-depth handling. The scale of training data and explicit extension of the CuTR formulation with aleatoric uncertainty are concrete technical contributions that could reduce reliance on expensive 3D annotations.

major comments (2)

[Experimental evaluation (results on egocentric and CA-1M benchmarks)] The headline performance claims (0.532 mAP without dense depth, 0.412 mAP with dense depth) are obtained by feeding outputs from specific external 2D detectors into BoxerNet + fusion, yet the manuscript provides no ablation or sensitivity analysis on detector noise, missing detections, or box jitter. Because open-world categories are precisely where 2D detectors are least reliable, this omission directly undermines the robustness and generalizability of the reported 3D mAP gains.
[Method description of multi-view fusion] The method explicitly assumes objects are static to enable multi-view fusion for globally consistent boxes, but no quantitative evaluation of this assumption (e.g., performance degradation under object motion or viewpoint changes) is presented, leaving the central claim of 'robust' lifting partially untested.

minor comments (2)

[Abstract and training details] The abstract states training on 'over 1.2 million unique 3DBBs' but does not specify the exact data sources, splits, or filtering criteria used to reach this count; adding these details would improve reproducibility.
[BoxerNet architecture description] Notation for the aleatoric uncertainty parameters and the median depth patch encoding could be introduced earlier with a clear equation reference to aid readers unfamiliar with the CuTR extension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed changes to improve the manuscript.

read point-by-point responses

Referee: [Experimental evaluation (results on egocentric and CA-1M benchmarks)] The headline performance claims (0.532 mAP without dense depth, 0.412 mAP with dense depth) are obtained by feeding outputs from specific external 2D detectors into BoxerNet + fusion, yet the manuscript provides no ablation or sensitivity analysis on detector noise, missing detections, or box jitter. Because open-world categories are precisely where 2D detectors are least reliable, this omission directly undermines the robustness and generalizability of the reported 3D mAP gains.

Authors: We agree that robustness to 2D detector imperfections is critical for open-world applicability. Our reported results already employ real outputs from DETIC, OWLv2, and SAM3 on the respective benchmarks, which include natural noise, misses, and jitter. To directly address the concern, we will add a new ablation section in the revised manuscript that introduces controlled perturbations (Gaussian box jitter, random missing detections at varying rates) to the 2D inputs and measures the resulting degradation in 3D mAP. This will quantify sensitivity and further support the generalizability claims. revision: yes
Referee: [Method description of multi-view fusion] The method explicitly assumes objects are static to enable multi-view fusion for globally consistent boxes, but no quantitative evaluation of this assumption (e.g., performance degradation under object motion or viewpoint changes) is presented, leaving the central claim of 'robust' lifting partially untested.

Authors: Boxer is explicitly designed and stated for static scenes, where multi-view fusion and geometric filtering produce consistent 3DBBs; this is a core premise of the approach and is noted in the abstract and method sections. The term 'robust' in the title and claims refers to handling of aleatoric uncertainty, sparse depth, and open-vocabulary detection noise rather than object motion. For dynamic objects, integration with tracking would be required, which lies outside the paper's scope. We will expand the limitations discussion to explicitly restate the static assumption and its implications, but no new motion-based experiments will be added. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents BoxerNet as a new transformer-based architecture for lifting 2D open-vocabulary detections to 3D bounding boxes, with explicit extensions to the CuTR formulation (aleatoric uncertainty, median depth patch encoding) and training on an independent large-scale dataset of 1.2M 3DBBs. It delegates detection to external models (DETiC, OWLv2, SAM3) and evaluates via direct comparison to published baselines without any equations, fitted parameters, or self-citations that reduce the claimed 3D outputs or performance metrics back to the inputs by construction. The derivation remains self-contained as a learned regression model with geometric post-processing.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the learned lifting function of BoxerNet, which depends on large-scale 3D training data and the assumption that 2D detections are reliable inputs; no new physical entities are postulated.

free parameters (2)

BoxerNet model parameters
Learned via training on over 1.2 million unique 3DBBs to map 2D proposals to 3D
aleatoric uncertainty parameters
Incorporated for robust regression as an extension to CuTR

axioms (2)

domain assumption Transformer-based regression can accurately lift 2D boxes to 3D given posed images and optional depth
Core modeling choice of BoxerNet
domain assumption Multi-view fusion and geometric filtering produce globally consistent de-duplicated 3DBBs
Post-processing step after per-view lifting

pith-pipeline@v0.9.0 · 5618 in / 1464 out tokens · 88251 ms · 2026-05-10T19:09:59.586468+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BoxerNet ... lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 5 internal anchors

[1]

CVPR (2021)

Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. CVPR (2021)

2021
[2]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A.X., Niessner, M.: Scan2CAD: Learning CAD model alignment in rgb-d scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2619–2628 (2019)

2019
[3]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

Avetisyan, A., Xie, C., Howard-Jenkins, H., Yang, T.Y., Aroudj, S., Patra, S., Zhang, F., Frost, D., Holland, L., Orme, C., Engel, J., Miller, E., Newcombe, R., Balntas, V.: Scenescript: Reconstructing scenes with an autoregressive structured language model. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

2024
[4]

Qwen3-VL Technical Report

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d in- door scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021)

work page internal anchor Pith review arXiv 2021
[6]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Brazil, G., Kumar, A., Straub, J., Ravi, N., Johnson, J., Gkioxari, G.: Omni3D: A large benchmark and model for 3D object detection in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9630–9641 (2023)

2023
[7]

IEEE Transactions on Robotics37(6), 1874–1890 (2021)

Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M.M., Tardós, J.D.: ORB- SLAM3: An accurate open-source library for visual, visual-inertial, and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021)

2021
[8]

In: International Conference on Learning Representations (ICLR) (2026)

Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, R., Do...

2026
[9]

SAM 3D: 3Dfy Anything in Images

Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

work page internal anchor Pith review arXiv 2025
[10]

arXiv preprint arXiv:2405.03685 , year=

Cho, J.H., Ivanovic, B., Cao, Y., Schmerling, E., Wang, Y., Weng, X., Li, B., You, Y., Krähenbühl, P., Wang, Y., et al.: Language-image models with 3d understanding. arXiv preprint arXiv:2405.03685 (2024)

work page arXiv 2024
[11]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5828–5839 (2017)

2017
[12]

arXiv preprint arXiv:2603.18496 (2026)

DeTone, D., Bogo, F., Le, E.T., Frost, D., Straub, J., Siddiqui, Y., Ye, Y., Engel, J., Newcombe, R., Ma, L.: NymeriaPlus: Enriching nymeria dataset with additional annotations and data. arXiv preprint arXiv:2603.18496 (2026)

work page arXiv 2026
[13]

Project Aria: A New Tool for Egocentric Multi-Modal AI Research

Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Talattof, A., Yuan, A., Souti, B., Meredith, B., et al.: Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561 (2023) Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D 25

work page internal anchor Pith review arXiv 2023
[14]

In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

2024
[15]

In: European conference on computer vision

Gu, Q., Lv, Z., Frost, D., Green, S., Straub, J., Sweeney, C.: Egolifter: Open-world 3d segmentation for egocentric perception. In: European conference on computer vision. pp. 382–400. Springer (2024)

2024
[16]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

Gupta, A., Dollár, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

2019
[17]

Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (2017)

2017
[18]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

2023
[19]

Aria gen 2 pilot dataset.arXiv preprint arXiv:2510.16134, 2025

Kong, C., Fort, J., Kang, A., Wittmer, J., Green, S., Shen, T., Zhao, Y., Peng, C., Solaira, G., Berkovich, A., et al.: Aria gen 2 pilot dataset. arXiv preprint arXiv:2510.16134 (2025)

work page arXiv 2025
[20]

International journal of computer vision128(7), 1956–1981 (2020)

Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020)

1956
[21]

Lazarow, D

Lazarow, J., Griffiths, D., Kohavi, G., Crespo, F., Dehghan, A.: Cubify anything: Scaling indoor 3d object detection. arXiv preprint arXiv:2412.04458 (2024)

work page arXiv 2024
[22]

In: Proceedings of the AAAI conference on artificial intelligence

Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., Li, Z.: Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 1477–1485 (2023)

2023
[23]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(3), 2020–2036 (2024)

Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal trans- formers. IEEE Transactions on Pattern Analysis and Machine Intelligence47(3), 2020–2036 (2024)

2020
[24]

In: Proceedings of the European Conference on Computer Vision (ECCV)

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 740–755 (2014)

2014
[25]

In: European conference on computer vision

Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3d object detection. In: European conference on computer vision. pp. 531–548. Springer (2022)

2022
[26]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

Lu, Y., et al.: Geometry uncertainty projection network for monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

2021
[27]

In: Advances in Neural Information Processing Systems (2023)

Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Advances in Neural Information Processing Systems (2023)

2023
[28]

Ma, L., Ye, Y., Hong, F., Guzov, V., Jiang, Y., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V., Kim, H.J., Bailey, K., Fosas, D.S., Liu, C.K., Liu, Z., Engel, J., Nardi, R.D., Newcombe, R.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild (2024),https://arxiv.org/abs/2406.09905

work page arXiv 2024
[29]

arXiv preprint arXiv:2511.20648 (2025) 26 D

Man, Y., Wang, S., Zhang, G., Bjorck, J., Li, Z., Gui, L.Y., Fan, J., Kautz, J., Wang, Y.X., Yu, Z.: Locateanything3d: Vision-language 3d detection with chain-of-sight. arXiv preprint arXiv:2511.20648 (2025) 26 D. DeTone et al

work page arXiv 2025
[30]

In: Advances in Neural Information Processing Systems (2025)

Mao, Y., Zhong, J., Fang, C., Zheng, J., Tang, R., Zhu, H., Tan, P., Zhou, Z.: Spatiallm: Training large language models for structured indoor modeling. In: Advances in Neural Information Processing Systems (2025)

2025
[31]

In: Advances in Neural Information Processing Systems

Minderer, M., Gritsenko, A., Houlsby, N.: Scaling open-vocabulary object detection. In: Advances in Neural Information Processing Systems. vol. 36, pp. 72832–72859 (2023)

2023
[32]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

Misra, I., et al.: 3detr: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

2021
[33]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)

2023
[34]

In: European conference on computer vision

Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: European conference on computer vision. pp. 194–210. Springer (2020)

2020
[35]

https://www.projectaria.com/ariagen2devicepaper (2025), accessed: 2026-03-02

Project Aria Team: Aria gen 2: An advanced research device for egocentric ai research. https://www.projectaria.com/ariagen2devicepaper (2025), accessed: 2026-03-02

2025
[36]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

Qi, C.R., Litany, O., He, K., Guibas, L.J.: Votenet: Deep hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

2019
[37]

In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

2019
[38]

In: ICCV (2021)

Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

2021
[39]

In: European Conference on Computer Vision

Rukhovich, D., Vorontsova, A., Konushin, A.: Fcaf3d: Fully convolutional anchor- free 3d object detection. In: European Conference on Computer Vision. pp. 477–493. Springer (2022)

2022
[40]

In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 2397–2406 (2022)

2022
[41]

In: 2023 IEEE International Conference on Image Processing (ICIP)

Rukhovich, D., Vorontsova, A., Konushin, A.: Tr3d: Towards real-time indoor 3d object detection. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 281–285. IEEE (2023)

2023
[42]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

2016
[43]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman

Straub, J., DeTone, D., Shen, T., Yang, N., Sweeney, C., Newcombe, R.: Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models. arXiv preprint arXiv:2406.10224 (2024)

work page arXiv 2024
[45]

In: European Conference on Com- puter Vision

Veicht, A., Sarlin, P.E., Lindenberger, P., Pollefeys, M.: Geocalib: Learning single- image calibration with geometric optimization. In: European Conference on Com- puter Vision. pp. 1–20. Springer (2024) Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D 27

2024
[46]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Wang, Y., Doersch, C., Arandjelovic, R., Carreira, J., Zisserman, A.: Input-level inductive biases for 3d reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6166–6176 (2022)

2022
[47]

In: Conference on robot learning

Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on robot learning. pp. 180–191. PMLR (2022)

2022
[48]

N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

Wang, Y., Ke, L., Zhang, B., Qu, T., Yu, H., Huang, Z., Yu, M., Xu, D., Yu, D.: N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models. arXiv preprint arXiv:2512.16561 (2025)

work page arXiv 2025
[49]

In: ICCV (2023)

Xie, Y., Jiang, H., Gkioxari, G., Straub, J.: Pixel-aligned recurrent queries for multi-view 3D object detection. In: ICCV (2023)

2023
[50]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Yang, Y.H., Piccinelli, L., Segu, M., Li, S., Huang, R., Fu, Y., Pollefeys, M., Blum, H., Bauer, Z.: 3d-mood: Lifting 2d to 3d for monocular open-set object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7429–7439 (2025)

2025
[51]

arXiv preprint arXiv:2411.16833 (2024)

Yao, J., Gu, H., Chen, X., Wang, J., Cheng, Z.: Open vocabulary monocular 3d object detection. arXiv preprint arXiv:2411.16833 (2024)

work page arXiv 2024
[52]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao, H., Li, H., Zhu, H., Yang, Z.: Detect anything 3d in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5048–5059 (2025)

2025
[53]

In: European conference on computer vision

Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3dnet: 3d object detection using hybrid geometric primitives. In: European conference on computer vision. pp. 311–329. Springer (2020)

2020
[54]

In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

2022