pith. machine review for the scientific record. sign in

arxiv: 2604.05212 · v1 · submitted 2026-04-06 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D bounding box estimationopen-world detection2D to 3D liftingtransformer networkmulti-view fusiondepth encodingobject localizationcomputer vision
0
0 comments X

The pith

A transformer-based network lifts 2D open-vocabulary bounding boxes to accurate 3D bounding boxes using camera poses and optional depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Boxer as a way to turn 2D detections of arbitrary object categories into static 3D bounding boxes. It uses existing 2D detectors for the hard part of finding objects and trains a dedicated network only on the lifting step, then combines results from multiple views to remove duplicates and enforce consistency in world space. A reader would care because 3D localization has remained difficult for new categories while 2D detection has advanced quickly; this split reduces the need for large 3D-annotated datasets. If the approach holds, spatial awareness in robotics or augmented reality could extend to far more object types from ordinary image collections.

Core claim

Boxer estimates static 3D bounding boxes from 2D open-vocabulary object detections, posed images and optional depth either as a sparse point cloud or dense map. Its core is a transformer-based network that lifts 2D bounding box proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. The network adds uncertainty modeling for robust regression and a median depth patch encoding to work with sparse inputs, trained at large scale on over a million unique 3D boxes.

What carries the argument

BoxerNet, a transformer-based network that lifts 2D bounding box proposals into 3D while modeling aleatoric uncertainty and encoding median depth patches for varied input density.

If this is right

  • Existing 2D open-vocabulary detectors can be reused directly for 3D localization without retraining the full system.
  • Multi-view fusion and filtering produce globally consistent boxes even when depth is only sparse.
  • The separation of detection and lifting lowers the cost of creating training data for new 3D tasks.
  • Performance remains higher than prior lifting methods both without dense depth and when dense depth is available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lifting-plus-fusion pattern could be tested on video streams to track slowly changing object positions across time.
  • Feeding the resulting 3D boxes into existing reconstruction pipelines might improve overall scene geometry without extra labeling.
  • If depth inputs come from a separate dense estimator, end-to-end accuracy could rise further in textureless regions.

Load-bearing premise

The approach depends on 2D detectors supplying accurate and complete proposals and on objects remaining static so that multi-view fusion can produce consistent boxes.

What would settle it

Apply the full pipeline to a sequence containing moving objects and measure whether the output 3D boxes show large positional drift or duplication errors relative to independent ground-truth measurements.

Figures

Figures reproduced from arXiv: 2604.05212 by Daniel DeTone, Fan Zhang, Jakob Engel, Julian Straub, Lingni Ma, Richard Newcombe, Tianwei Shen.

Figure 1
Figure 1. Figure 1: Boxer takes as input posed images with optional depth and off-the-shelf 2DBB open-world detections, estimating static, global 3D bounding boxes. Boxer is run on a sequence and various scenes are highlighted to show the accuracy and open-world coverage of objects such as spice jar, hairdryer, sink drain and TV remote. Abstract. Detecting and localizing objects in space is a fundamental computer vision probl… view at source ↗
Figure 2
Figure 2. Figure 2: Boxer algorithm overview. Boxer operates on a set of posed and calibrated images with optional dense depth or sparse point cloud to produce metric, static, 3D bounding boxes for open-set objects. To bridge the gap between 2D reasoning and 3D geometric consistency, recent works such as ConceptGraphs [14] and EgoLifter [15] lift open-world 2D segmentation masks into 3D, constructing object-centric scene repr… view at source ↗
Figure 3
Figure 3. Figure 3: BoxerNet lifting module. BoxerNet conditions on the image, camera cali￾bration and poses and an optional depth input to lift 2D bounding boxes into metric 7-DoF 3D bounding box predictions. Each box attends independently to the patch tokens, with no attention between box tokens, making the formulation permutation invariant. This representation is passed to two prediction heads. Each head is a two-layer MLP… view at source ↗
Figure 4
Figure 4. Figure 4: Per-frame 3D IoU visualization. First 2 rows: [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Lifted per-frame 3DBB pseudo heatmaps. We compare the Ground Truth per-scene 3DBBs (left) to all the per-frame 3DBBs from CuTR (middle) and Boxer (right), prompted with GT 2DBB input, into a consistent coordinate frame and show the boxes rendered on top of one another creating a pseudo-heatmap. Boxer exhibits a sharper heatmap compared to CuTR which corresponds to more consistent predictions. Colors loosel… view at source ↗
Figure 6
Figure 6. Figure 6: Importance of Aleotoric Uncertainty for Ranking. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Data augmentation examples. Four augmentation types are shown by row (Photometric, Camera, 3D Point, and 2D Box), with four different examples per type shown by column. Third, we augment the projected 3D points / dense depth by randomly dropping all points, individual points, or contiguous blocks of points in 3D space, simulating sparsity and partial observability in the geometric input. We visualize this … view at source ↗
Figure 8
Figure 8. Figure 8: Pseudo open-set annotation examples on ScanNet. [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example visualization of SAM3D on ADT. Example output of SAM3+SAM3D (RGB+Depth) on Aria Digital Twin. Top-left shows SAM3 masks; top-right shows projected 3DBBs; bottom row shows two 3D views of predictions overlaid on the point cloud (bottom-left: behind view, bottom-right: bird’s-eye view) [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PR Curve By Object Size. Precision–recall curves for three object-size buckets—small (left), medium (middle), and large (right)—for BoxerNet vs. CuTR. D Acknowledgments We thank Pierre Moulon for feedback in group discussions, Manuel Lopez Ante￾quera for support with internal annotations, Dan Barnes and Raul Mur-Artal for support with tooling for annotation, and Yawar Siddiqui for valuable discussions and… view at source ↗
read the original abstract

Detecting and localizing objects in space is a fundamental computer vision problem. While much progress has been made to solve 2D object detection, 3D object localization is much less explored and far from solved, especially for open-world categories. To address this research challenge, we propose Boxer, an algorithm to estimate static 3D bounding boxes (3DBBs) from 2D open-vocabulary object detections, posed images and optional depth either represented as a sparse point cloud or dense depth. At its core is BoxerNet, a transformer-based network which lifts 2D bounding box (2DBB) proposals into 3D, followed by multi-view fusion and geometric filtering to produce globally consistent de-duplicated 3DBBs in metric world space. Boxer leverages the power of existing 2DBB detection algorithms (e.g. DETIC, OWLv2, SAM3) to localize objects in 2D. This allows the main BoxerNet model to focus on lifting to 3D rather than detecting, ultimately reducing the demand for costly annotated 3DBB training data. Extending the CuTR formulation, we incorporate an aleatoric uncertainty for robust regression, a median depth patch encoding to support sparse depth inputs, and large-scale training with over 1.2 million unique 3DBBs. BoxerNet outperforms state-of-the-art baselines in open-world 3DBB lifting, including CuTR in egocentric settings without dense depth (0.532 vs. 0.010 mAP) and on CA-1M with dense depth available (0.412 vs. 0.250 mAP).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Boxer, an algorithm for lifting 2D open-vocabulary bounding box detections to static 3D bounding boxes using posed images and optional depth (sparse or dense). Its core is BoxerNet, a transformer network that performs the 2D-to-3D lift with aleatoric uncertainty modeling and median depth patch encoding; this is followed by multi-view fusion and geometric filtering for globally consistent, de-duplicated 3DBBs in metric space. The approach delegates detection to external models (DETiC, OWLv2, SAM3) and reports large-scale training on >1.2M unique 3DBBs. It claims substantial gains over baselines including CuTR: 0.532 vs. 0.010 mAP in egocentric settings without dense depth and 0.412 vs. 0.250 mAP on CA-1M with dense depth available.

Significance. If the quantitative claims hold under scrutiny, the work is significant for open-world 3D localization because it decouples detection from lifting, leverages mature 2D open-vocabulary detectors, and incorporates uncertainty and sparse-depth handling. The scale of training data and explicit extension of the CuTR formulation with aleatoric uncertainty are concrete technical contributions that could reduce reliance on expensive 3D annotations.

major comments (2)
  1. [Experimental evaluation (results on egocentric and CA-1M benchmarks)] The headline performance claims (0.532 mAP without dense depth, 0.412 mAP with dense depth) are obtained by feeding outputs from specific external 2D detectors into BoxerNet + fusion, yet the manuscript provides no ablation or sensitivity analysis on detector noise, missing detections, or box jitter. Because open-world categories are precisely where 2D detectors are least reliable, this omission directly undermines the robustness and generalizability of the reported 3D mAP gains.
  2. [Method description of multi-view fusion] The method explicitly assumes objects are static to enable multi-view fusion for globally consistent boxes, but no quantitative evaluation of this assumption (e.g., performance degradation under object motion or viewpoint changes) is presented, leaving the central claim of 'robust' lifting partially untested.
minor comments (2)
  1. [Abstract and training details] The abstract states training on 'over 1.2 million unique 3DBBs' but does not specify the exact data sources, splits, or filtering criteria used to reach this count; adding these details would improve reproducibility.
  2. [BoxerNet architecture description] Notation for the aleatoric uncertainty parameters and the median depth patch encoding could be introduced earlier with a clear equation reference to aid readers unfamiliar with the CuTR extension.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and proposed changes to improve the manuscript.

read point-by-point responses
  1. Referee: [Experimental evaluation (results on egocentric and CA-1M benchmarks)] The headline performance claims (0.532 mAP without dense depth, 0.412 mAP with dense depth) are obtained by feeding outputs from specific external 2D detectors into BoxerNet + fusion, yet the manuscript provides no ablation or sensitivity analysis on detector noise, missing detections, or box jitter. Because open-world categories are precisely where 2D detectors are least reliable, this omission directly undermines the robustness and generalizability of the reported 3D mAP gains.

    Authors: We agree that robustness to 2D detector imperfections is critical for open-world applicability. Our reported results already employ real outputs from DETIC, OWLv2, and SAM3 on the respective benchmarks, which include natural noise, misses, and jitter. To directly address the concern, we will add a new ablation section in the revised manuscript that introduces controlled perturbations (Gaussian box jitter, random missing detections at varying rates) to the 2D inputs and measures the resulting degradation in 3D mAP. This will quantify sensitivity and further support the generalizability claims. revision: yes

  2. Referee: [Method description of multi-view fusion] The method explicitly assumes objects are static to enable multi-view fusion for globally consistent boxes, but no quantitative evaluation of this assumption (e.g., performance degradation under object motion or viewpoint changes) is presented, leaving the central claim of 'robust' lifting partially untested.

    Authors: Boxer is explicitly designed and stated for static scenes, where multi-view fusion and geometric filtering produce consistent 3DBBs; this is a core premise of the approach and is noted in the abstract and method sections. The term 'robust' in the title and claims refers to handling of aleatoric uncertainty, sparse depth, and open-vocabulary detection noise rather than object motion. For dynamic objects, integration with tracking would be required, which lies outside the paper's scope. We will expand the limitations discussion to explicitly restate the static assumption and its implications, but no new motion-based experiments will be added. revision: partial

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper presents BoxerNet as a new transformer-based architecture for lifting 2D open-vocabulary detections to 3D bounding boxes, with explicit extensions to the CuTR formulation (aleatoric uncertainty, median depth patch encoding) and training on an independent large-scale dataset of 1.2M 3DBBs. It delegates detection to external models (DETiC, OWLv2, SAM3) and evaluates via direct comparison to published baselines without any equations, fitted parameters, or self-citations that reduce the claimed 3D outputs or performance metrics back to the inputs by construction. The derivation remains self-contained as a learned regression model with geometric post-processing.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the learned lifting function of BoxerNet, which depends on large-scale 3D training data and the assumption that 2D detections are reliable inputs; no new physical entities are postulated.

free parameters (2)
  • BoxerNet model parameters
    Learned via training on over 1.2 million unique 3DBBs to map 2D proposals to 3D
  • aleatoric uncertainty parameters
    Incorporated for robust regression as an extension to CuTR
axioms (2)
  • domain assumption Transformer-based regression can accurately lift 2D boxes to 3D given posed images and optional depth
    Core modeling choice of BoxerNet
  • domain assumption Multi-view fusion and geometric filtering produce globally consistent de-duplicated 3DBBs
    Post-processing step after per-view lifting

pith-pipeline@v0.9.0 · 5618 in / 1464 out tokens · 88251 ms · 2026-05-10T19:09:59.586468+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 5 internal anchors

  1. [1]

    CVPR (2021)

    Ahmadyan, A., Zhang, L., Ablavatski, A., Wei, J., Grundmann, M.: Objectron: A large scale dataset of object-centric videos in the wild with pose annotations. CVPR (2021)

  2. [2]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Avetisyan, A., Dahnert, M., Dai, A., Savva, M., Chang, A.X., Niessner, M.: Scan2CAD: Learning CAD model alignment in rgb-d scans. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2619–2628 (2019)

  3. [3]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

    Avetisyan, A., Xie, C., Howard-Jenkins, H., Yang, T.Y., Aroudj, S., Patra, S., Zhang, F., Frost, D., Holland, L., Orme, C., Engel, J., Miller, E., Newcombe, R., Balntas, V.: Scenescript: Reconstructing scenes with an autoregressive structured language model. In: Proceedings of the European Conference on Computer Vision (ECCV) (2024)

  4. [4]

    Qwen3-VL Technical Report

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., et al.: Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 (2025)

  5. [5]

    ARKitScenes: A Diverse Real-World Dataset For 3D Indoor Scene Understanding Using Mobile RGB-D Data

    Baruch, G., Chen, Z., Dehghan, A., Dimry, T., Feigin, Y., Fu, P., Gebauer, T., Joffe, B., Kurz, D., Schwartz, A., et al.: Arkitscenes: A diverse real-world dataset for 3d in- door scene understanding using mobile rgb-d data. arXiv preprint arXiv:2111.08897 (2021)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Brazil, G., Kumar, A., Straub, J., Ravi, N., Johnson, J., Gkioxari, G.: Omni3D: A large benchmark and model for 3D object detection in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 9630–9641 (2023)

  7. [7]

    IEEE Transactions on Robotics37(6), 1874–1890 (2021)

    Campos, C., Elvira, R., Rodríguez, J.J.G., Montiel, J.M.M., Tardós, J.D.: ORB- SLAM3: An accurate open-source library for visual, visual-inertial, and multi-map SLAM. IEEE Transactions on Robotics37(6), 1874–1890 (2021)

  8. [8]

    In: International Conference on Learning Representations (ICLR) (2026)

    Carion, N., Gustafson, L., Hu, Y.T., Debnath, S., Hu, R., Suris, D., Ryali, C., Alwala, K.V., Khedr, H., Huang, A., Lei, J., Ma, T., Guo, B., Kalla, A., Marks, M., Greer, J., Wang, M., Sun, P., Rädle, R., Afouras, T., Mavroudi, E., Xu, K., Wu, T.H., Zhou, Y., Momeni, L., Hazra, R., Ding, S., Vaze, S., Porcher, F., Li, F., Li, S., Kamath, A., Cheng, R., Do...

  9. [9]

    SAM 3D: 3Dfy Anything in Images

    Chen, X., Chu, F.J., Gleize, P., Liang, K.J., Sax, A., Tang, H., Wang, W., Guo, M., Hardin, T., Li, X., et al.: Sam 3d: 3dfy anything in images. arXiv preprint arXiv:2511.16624 (2025)

  10. [10]

    arXiv preprint arXiv:2405.03685 , year=

    Cho, J.H., Ivanovic, B., Cao, Y., Schmerling, E., Wang, Y., Weng, X., Li, B., You, Y., Krähenbühl, P., Wang, Y., et al.: Language-image models with 3d understanding. arXiv preprint arXiv:2405.03685 (2024)

  11. [11]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scannet: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5828–5839 (2017)

  12. [12]

    arXiv preprint arXiv:2603.18496 (2026)

    DeTone, D., Bogo, F., Le, E.T., Frost, D., Straub, J., Siddiqui, Y., Ye, Y., Engel, J., Newcombe, R., Ma, L.: NymeriaPlus: Enriching nymeria dataset with additional annotations and data. arXiv preprint arXiv:2603.18496 (2026)

  13. [13]

    Project Aria: A New Tool for Egocentric Multi-Modal AI Research

    Engel, J., Somasundaram, K., Goesele, M., Sun, A., Gamino, A., Turner, A., Talattof, A., Yuan, A., Souti, B., Meredith, B., et al.: Project aria: A new tool for egocentric multi-modal ai research. arXiv preprint arXiv:2308.13561 (2023) Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D 25

  14. [14]

    In: 2024 IEEE International Conference on Robotics and Automation (ICRA)

    Gu, Q., Kuwajerwala, A., Morin, S., Jatavallabhula, K.M., Sen, B., Agarwal, A., Rivera, C., Paul, W., Ellis, K., Chellappa, R., et al.: Conceptgraphs: Open- vocabulary 3d scene graphs for perception and planning. In: 2024 IEEE International Conference on Robotics and Automation (ICRA). pp. 5021–5028. IEEE (2024)

  15. [15]

    In: European conference on computer vision

    Gu, Q., Lv, Z., Frost, D., Green, S., Straub, J., Sweeney, C.: Egolifter: Open-world 3d segmentation for egocentric perception. In: European conference on computer vision. pp. 382–400. Springer (2024)

  16. [16]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Gupta, A., Dollár, P., Girshick, R.: Lvis: A dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

  17. [17]

    Kendall, A., Gal, Y.: What uncertainties do we need in bayesian deep learning for computer vision? In: Advances in Neural Information Processing Systems (2017)

  18. [18]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., Dollár, P., Girshick, R.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2023)

  19. [19]

    Aria gen 2 pilot dataset.arXiv preprint arXiv:2510.16134, 2025

    Kong, C., Fort, J., Kang, A., Wittmer, J., Green, S., Shen, T., Zhao, Y., Peng, C., Solaira, G., Berkovich, A., et al.: Aria gen 2 pilot dataset. arXiv preprint arXiv:2510.16134 (2025)

  20. [20]

    International journal of computer vision128(7), 1956–1981 (2020)

    Kuznetsova, A., Rom, H., Alldrin, N., Uijlings, J., Krasin, I., Pont-Tuset, J., Kamali, S., Popov, S., Malloci, M., Kolesnikov, A., et al.: The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale. International journal of computer vision128(7), 1956–1981 (2020)

  21. [21]

    Lazarow, D

    Lazarow, J., Griffiths, D., Kohavi, G., Crespo, F., Dehghan, A.: Cubify anything: Scaling indoor 3d object detection. arXiv preprint arXiv:2412.04458 (2024)

  22. [22]

    In: Proceedings of the AAAI conference on artificial intelligence

    Li, Y., Ge, Z., Yu, G., Yang, J., Wang, Z., Shi, Y., Sun, J., Li, Z.: Bevdepth: Acquisition of reliable depth for multi-view 3d object detection. In: Proceedings of the AAAI conference on artificial intelligence. vol. 37, pp. 1477–1485 (2023)

  23. [23]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(3), 2020–2036 (2024)

    Li, Z., Wang, W., Li, H., Xie, E., Sima, C., Lu, T., Yu, Q., Dai, J.: Bevformer: learning bird’s-eye-view representation from lidar-camera via spatiotemporal trans- formers. IEEE Transactions on Pattern Analysis and Machine Intelligence47(3), 2020–2036 (2024)

  24. [24]

    In: Proceedings of the European Conference on Computer Vision (ECCV)

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: Common objects in context. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 740–755 (2014)

  25. [25]

    In: European conference on computer vision

    Liu, Y., Wang, T., Zhang, X., Sun, J.: Petr: Position embedding transformation for multi-view 3d object detection. In: European conference on computer vision. pp. 531–548. Springer (2022)

  26. [26]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Lu, Y., et al.: Geometry uncertainty projection network for monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  27. [27]

    In: Advances in Neural Information Processing Systems (2023)

    Luo, T., Rockwell, C., Lee, H., Johnson, J.: Scalable 3d captioning with pretrained models. In: Advances in Neural Information Processing Systems (2023)

  28. [28]

    Ma, L., Ye, Y., Hong, F., Guzov, V., Jiang, Y., Postyeni, R., Pesqueira, L., Gamino, A., Baiyya, V., Kim, H.J., Bailey, K., Fosas, D.S., Liu, C.K., Liu, Z., Engel, J., Nardi, R.D., Newcombe, R.: Nymeria: A massive collection of multimodal egocentric daily motion in the wild (2024),https://arxiv.org/abs/2406.09905

  29. [29]

    arXiv preprint arXiv:2511.20648 (2025) 26 D

    Man, Y., Wang, S., Zhang, G., Bjorck, J., Li, Z., Gui, L.Y., Fan, J., Kautz, J., Wang, Y.X., Yu, Z.: Locateanything3d: Vision-language 3d detection with chain-of-sight. arXiv preprint arXiv:2511.20648 (2025) 26 D. DeTone et al

  30. [30]

    In: Advances in Neural Information Processing Systems (2025)

    Mao, Y., Zhong, J., Fang, C., Zheng, J., Tang, R., Zhu, H., Tan, P., Zhou, Z.: Spatiallm: Training large language models for structured indoor modeling. In: Advances in Neural Information Processing Systems (2025)

  31. [31]

    In: Advances in Neural Information Processing Systems

    Minderer, M., Gritsenko, A., Houlsby, N.: Scaling open-vocabulary object detection. In: Advances in Neural Information Processing Systems. vol. 36, pp. 72832–72859 (2023)

  32. [32]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

    Misra, I., et al.: 3detr: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2021)

  33. [33]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pan, X., Charron, N., Yang, Y., Peters, S., Whelan, T., Kong, C., Parkhi, O., Newcombe, R., Ren, Y.C.: Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 20133–20143 (2023)

  34. [34]

    In: European conference on computer vision

    Philion, J., Fidler, S.: Lift, splat, shoot: Encoding images from arbitrary camera rigs by implicitly unprojecting to 3d. In: European conference on computer vision. pp. 194–210. Springer (2020)

  35. [35]

    https://www.projectaria.com/ariagen2devicepaper (2025), accessed: 2026-03-02

    Project Aria Team: Aria gen 2: An advanced research device for egocentric ai research. https://www.projectaria.com/ariagen2devicepaper (2025), accessed: 2026-03-02

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

    Qi, C.R., Litany, O., He, K., Guibas, L.J.: Votenet: Deep hough voting for 3d object detection in point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) (2019)

  37. [37]

    In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP)

    Reimers, N., Gurevych, I.: Sentence-bert: Sentence embeddings using siamese bert- networks. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). pp. 3982–3992 (2019)

  38. [38]

    In: ICCV (2021)

    Roberts, M., Ramapuram, J., Ranjan, A., Kumar, A., Bautista, M.A., Paczan, N., Webb, R., Susskind, J.M.: Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In: ICCV (2021)

  39. [39]

    In: European Conference on Computer Vision

    Rukhovich, D., Vorontsova, A., Konushin, A.: Fcaf3d: Fully convolutional anchor- free 3d object detection. In: European Conference on Computer Vision. pp. 477–493. Springer (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Rukhovich, D., Vorontsova, A., Konushin, A.: Imvoxelnet: Image to voxels projection for monocular and multi-view general-purpose 3d object detection. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 2397–2406 (2022)

  41. [41]

    In: 2023 IEEE International Conference on Image Processing (ICIP)

    Rukhovich, D., Vorontsova, A., Konushin, A.: Tr3d: Towards real-time indoor 3d object detection. In: 2023 IEEE International Conference on Image Processing (ICIP). pp. 281–285. IEEE (2023)

  42. [42]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

  43. [43]

    DINOv3

    Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  44. [44]

    Xingyuan Sun, Jiajun Wu, Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Tianfan Xue, Joshua B Tenenbaum, and William T Freeman

    Straub, J., DeTone, D., Shen, T., Yang, N., Sweeney, C., Newcombe, R.: Efm3d: A benchmark for measuring progress towards 3d egocentric foundation models. arXiv preprint arXiv:2406.10224 (2024)

  45. [45]

    In: European Conference on Com- puter Vision

    Veicht, A., Sarlin, P.E., Lindenberger, P., Pollefeys, M.: Geocalib: Learning single- image calibration with geometric optimization. In: European Conference on Com- puter Vision. pp. 1–20. Springer (2024) Boxer: Robust Lifting of Open-World 2D Bounding Boxes to 3D 27

  46. [46]

    In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Wang, Y., Doersch, C., Arandjelovic, R., Carreira, J., Zisserman, A.: Input-level inductive biases for 3d reconstruction. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6166–6176 (2022)

  47. [47]

    In: Conference on robot learning

    Wang, Y., Guizilini, V.C., Zhang, T., Wang, Y., Zhao, H., Solomon, J.: Detr3d: 3d object detection from multi-view images via 3d-to-2d queries. In: Conference on robot learning. pp. 180–191. PMLR (2022)

  48. [48]

    N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models.arXiv preprint arXiv:2512.16561, 2025

    Wang, Y., Ke, L., Zhang, B., Qu, T., Yu, H., Huang, Z., Yu, M., Xu, D., Yu, D.: N3d-vlm: Native 3d grounding enables accurate spatial reasoning in vision-language models. arXiv preprint arXiv:2512.16561 (2025)

  49. [49]

    In: ICCV (2023)

    Xie, Y., Jiang, H., Gkioxari, G., Straub, J.: Pixel-aligned recurrent queries for multi-view 3D object detection. In: ICCV (2023)

  50. [50]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yang, Y.H., Piccinelli, L., Segu, M., Li, S., Huang, R., Fu, Y., Pollefeys, M., Blum, H., Bauer, Z.: 3d-mood: Lifting 2d to 3d for monocular open-set object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7429–7439 (2025)

  51. [51]

    arXiv preprint arXiv:2411.16833 (2024)

    Yao, J., Gu, H., Chen, X., Wang, J., Cheng, Z.: Open vocabulary monocular 3d object detection. arXiv preprint arXiv:2411.16833 (2024)

  52. [52]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Zhang, H., Jiang, H., Yao, Q., Sun, Y., Zhang, R., Zhao, H., Li, H., Zhu, H., Yang, Z.: Detect anything 3d in the wild. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5048–5059 (2025)

  53. [53]

    In: European conference on computer vision

    Zhang, Z., Sun, B., Yang, H., Huang, Q.: H3dnet: 3d object detection using hybrid geometric primitives. In: European conference on computer vision. pp. 311–329. Springer (2020)

  54. [54]

    In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

    Zhou, X., Girdhar, R., Joulin, A., Krähenbühl, P., Misra, I.: Detecting twenty- thousand classes using image-level supervision. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)