pith. machine review for the scientific record. sign in

arxiv: 2512.15577 · v2 · submitted 2025-12-17 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:32 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular 3D segmentationzero-shot segmentationonline 3D segmentationreconstructive foundation modelsinstance segmentationtemporal consistencyvisual foundation models3D query refinement
0
0 comments X

The pith

MoonSeg3R performs online zero-shot 3D instance segmentation from monocular RGB video alone by converting 2D foundation model masks into consistent 3D queries with reconstructive priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reliable geometric priors extracted from a single RGB stream by a reconstructive foundation model can replace the need for depth sensors or known camera poses in 3D segmentation. It introduces a self-supervised refinement step that distills spatial and semantic information to create discriminative 3D queries from 2D masks, then uses a query memory and identity tokens to keep those queries consistent across frames. This matters because it opens 3D segmentation to ordinary video without specialized hardware. A sympathetic reader would care if the approach scales because it removes a major practical barrier between 2D foundation models and 3D scene understanding.

Core claim

MoonSeg3R is the first method for online monocular 3D instance segmentation; it uses CUT3R to supply geometric priors from RGB only, then applies three components—a self-supervised query refinement module with spatial-semantic distillation, a 3D query index memory for temporal consistency, and a state-distribution token as a mask identity descriptor—to turn 2D VFM masks into accurate, temporally consistent 3D queries, reaching performance competitive with RGB-D systems on ScanNet200 and SceneNN.

What carries the argument

The self-supervised query refinement module that transforms 2D VFM masks into 3D queries via spatial-semantic distillation, supported by the 3D query index memory for cross-frame retrieval and the CUT3R state-distribution token for mask identity.

If this is right

  • Online 3D segmentation becomes possible from ordinary single-camera video streams without depth sensors.
  • Existing 2D visual foundation models can be lifted to 3D while preserving zero-shot capability.
  • Temporal consistency in 3D queries is achieved through memory-based retrieval rather than explicit tracking.
  • State-distribution tokens from reconstructive models serve as effective descriptors for cross-frame mask fusion.
  • Performance reaches levels previously reported only for RGB-D pipelines on standard indoor benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prior-refinement pattern could be tested on outdoor or dynamic scenes where camera motion is less constrained.
  • If CUT3R-style models improve further, the gap between monocular and RGB-D 3D segmentation may continue to close without hardware changes.
  • The query memory mechanism suggests a route to long-term 3D object persistence across disconnected video clips.
  • Integration with other reconstructive or generative priors could extend the method to categories or scenes absent from current training data.

Load-bearing premise

Geometric priors supplied by CUT3R from monocular RGB are accurate and stable enough to turn 2D masks into reliable 3D queries without depth or pose supervision.

What would settle it

Run MoonSeg3R on a monocular video sequence where CUT3R reconstruction error is high; if the resulting 3D segmentations show large drops in accuracy or temporal consistency relative to RGB-D ground truth, the central claim fails.

Figures

Figures reproduced from arXiv: 2512.15577 by Duolikun Danier, Hakan Bilen, Jan Eric Lenssen, Zhipeng Du.

Figure 1
Figure 1. Figure 1: Previous VFM-assistend Online Paradigm v.s. Ours. While existing methods relies on the ground truth geometry (and 3D segmentation masks), our method works in a monocular on￾line zero-shot setting, exploiting the spatio-temporal priors from an RFM to help with online 3D segmentation, thereby simultane￾ously achieving online reconstruction and segmentation. powerful 2D mask priors, which can then be lifted i… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MoonSeg3R. The pipeline consists of four steps. (a) CUT3R takes an uncalibrated image It as input to predict explicit geometry (pose Pt, world-coordinate pointmap Xt), and implicit representations (geometric features F 3d t , state attention At). (b) VFM masks Mt are lifted and refined into 3D queries q ′ t through a transformer decoder, via spatial-semantic self-distillation supervision (Ldist… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Comparison. Qualitative examples of OnlineAnySeg-M and our method on ScanNet200 sequences. These results visually demonstrate that MoonSeg3R achieves superior instance segmentation. OnlineAnySeg-M, in contrast, tends to fail in associating masks, which leaves significant unsegmented areas, as shown in the red dashed circles. The segmentation results are unprojected to ground truth point cloud f… view at source ↗
Figure 5
Figure 5. Figure 5: State Distribution Similarity. For two consecutive frames, we extract the state distribution tokens for all instances and compute their cross-frame pairwise similarities. Tokens be￾longing to the same instances always exhibit the highest similarity scores, both for large, fully-visible objects (sofa) and small, par￾tially observed objects (table). tributes to higher object discriminativeness in reference f… view at source ↗
read the original abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Our code is available at https://github.com/VICO-UoE/MoonSeg3R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MoonSeg3R for online zero-shot monocular 3D instance segmentation. It uses CUT3R (a reconstructive foundation model) to derive geometric priors from monocular RGB input, then introduces a self-supervised query refinement module with spatial-semantic distillation to convert 2D VFM masks into 3D queries, a 3D query index memory for temporal consistency via contextual retrieval, and CUT3R state-distribution tokens as mask identity descriptors for cross-frame fusion. Experiments on ScanNet200 and SceneNN are reported to show that the method is the first to enable this setting and achieves performance competitive with RGB-D-based SOTA systems; code is released.

Significance. If the performance claims are substantiated, the work would be significant for extending 3D segmentation to practical monocular online scenarios without depth or pose supervision. The combination of reconstructive priors with VFM masks and the query-memory design offers a concrete path toward reducing hardware requirements, and the public code release aids reproducibility.

major comments (2)
  1. [§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.
  2. [§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'competitive with state-of-the-art RGB-D-based systems' should be accompanied by the precise metrics and reference methods used.
  2. [§3.2] Notation: The distinction between '3D queries' and 'contextual queries' in the memory module could be clarified with a diagram or explicit equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of MoonSeg3R for practical monocular online 3D segmentation. We address each major comment below with clarifications and planned revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.

    Authors: We agree that the abstract and high-level summary would benefit from greater specificity. Section 4 of the manuscript already contains tables comparing against RGB-D baselines on both ScanNet200 and SceneNN, reporting mAP and mIoU values together with ablation studies on the query refinement module and temporal memory. Error bars from repeated runs appear in the supplementary material. In the revised version we will (i) insert the key numerical results directly into the abstract and (ii) ensure every table in the main paper explicitly lists the baselines, metrics, and error bars for immediate readability. revision: yes

  2. Referee: [§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.

    Authors: This concern is well taken. Although CUT3R is a published foundation model whose reconstruction quality was validated in its original paper, we did not report dataset-specific depth or pose errors on ScanNet200 and SceneNN. In the revised manuscript we will add a short quantitative analysis (new paragraph in Section 3 or 4) that measures CUT3R’s depth and pose accuracy on the exact sequences used in our experiments. This will directly address whether the geometric priors remain reliable across the evaluated scenes. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external pre-trained models without self-referential reduction

full rationale

The paper's core pipeline (self-supervised query refinement, 3D query index memory, and state-distribution token) is defined as new modules operating on outputs from external CUT3R and VFMs. The abstract and description contain no equations, fitted parameters renamed as predictions, or self-citations that serve as the sole justification for the central claim. Performance is evaluated on ScanNet200/SceneNN against RGB-D baselines rather than being forced by construction from the inputs. This is the standard case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of pre-trained CUT3R for geometric priors and the effectiveness of the three introduced modules; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption CUT3R provides reliable geometric priors from a single RGB stream
    Invoked as the foundation for transforming 2D masks into 3D queries.

pith-pipeline@v0.9.0 · 5516 in / 1178 out tokens · 45242 ms · 2026-05-16T21:32:18.446099+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

  1. [1]

    Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,

    Yash Bhalgat, Iro Laina, Jo ˜ao F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,

  2. [2]

    Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images

    W ANG Bing, Lu Chen, and Bo Yang. Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. InThe Eleventh International Conference on Learning Rep- resentations, 2023. 2

  3. [3]

    Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,

    Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,

  4. [4]

    Must3r: Multi-view network for stereo 3d reconstruc- tion

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion. InCVPR, 2025. 2

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

  6. [6]

    Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,

  7. [7]

    TTT3R: 3D Reconstruction as Test-Time Training

    Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

  8. [8]

    3d-r2n2: A unified approach for single and multi-view 3d object reconstruction

    Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,

  9. [9]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6

  10. [10]

    Lsd- slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InECCV, 2014. 2

  11. [11]

    Svo: Semidirect visual odometry for monocular and multicamera systems.IEEE Transactions on Robotics, 33(2):249–265, 2016

    Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. Svo: Semidirect visual odometry for monocular and multicamera systems.IEEE Transactions on Robotics, 33(2):249–265, 2016. 2

  12. [12]

    Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

    Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In3DV, 2022. 2

  13. [13]

    Scenenn: A scene meshes dataset with annotations

    Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In3DV, 2016. 6

  14. [14]

    Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

    Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engel- mann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. InECCV, 2024. 2

  15. [15]

    Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021

    Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2

  16. [16]

    Learning a multi-view stereo machine.NeurIPS, 2017

    Abhishek Kar, Christian H ¨ane, and Jitendra Malik. Learning a multi-view stereo machine.NeurIPS, 2017. 2

  17. [17]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025. 2

  18. [18]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023. 2

  19. [19]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2

  20. [20]

    Panoptic neural fields: A semantic object-aware neural scene representation

    Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Car- oline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. InCVPR, 2022. 2

  21. [21]

    Boxfusion: Reconstruction-free open-vocabulary 3d object detection via real-time multi-view box fusion

    Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yi- han Cao, Renjiao Yi, Yijie Wang, and Kai Xu. Boxfusion: Reconstruction-free open-vocabulary 3d object detection via real-time multi-view box fusion. InComputer Graphics Fo- rum, page e70254. Wiley Online Library, 2025. 6

  22. [22]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 2

  23. [23]

    Ins-conv: Incremental sparse convolution for online 3d seg- mentation

    Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InCVPR, 2022. 2

  24. [24]

    Slam3r: Real- time dense scene reconstruction from monocular rgb videos

    Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 2

  25. [25]

    Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data

    Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InCoRL, 2023. 2, 6

  26. [26]

    Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks

    John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. InICRA, 2017. 2

  27. [27]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 2

  28. [28]

    6-dof graspnet: Variational grasp generation for object manipula- tion

    Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InICCV, 2019. 1

  29. [29]

    Panopticfusion: Online volumetric semantic mapping at the level of stuff and things

    Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. InIROS, 2019. 2

  30. [30]

    Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

    Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InCVPR, 2024. 2

  31. [31]

    Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025

    Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025. 2

  32. [32]

    Dinov2: Learning robust visual features without supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023. 2

  33. [33]

    High quality entity segmentation

    Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. High quality entity segmentation. InICCV, 2023. 3, 6

  34. [34]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2

  36. [36]

    Language- grounded indoor 3d semantic segmentation in the wild

    David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022. 6

  37. [37]

    Unscene3d: Unsupervised 3d instance segmentation for indoor scenes

    David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. In CVPR, 2024. 2

  38. [38]

    Simplere- con: 3d reconstruction without 3d convolutions

    Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InECCV,

  39. [39]

    Panoptic lifting for 3d scene understanding with neural fields

    Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul ´o, Nor- man M ¨uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. InCVPR, 2023. 2

  40. [40]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3

  41. [41]

    Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video

    Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video. InCVPR, 2021. 2

  42. [42]

    Openmask3d: Open-vocabulary 3d instance segmentation

    Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS, 2023. 2

  43. [43]

    Onlineanyseg: On- line zero-shot 3d segmentation by visual foundation model guided 2d mask merging.CVPR, 2025

    Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, and Kai Xu. Onlineanyseg: On- line zero-shot 3d segmentation by visual foundation model guided 2d mask merging.CVPR, 2025. 1, 2, 6, 7

  44. [44]

    Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

    Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

  45. [45]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021. 2

  46. [46]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

  47. [47]

    3d reconstruction with spatial memory.3DV, 2024

    Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.3DV, 2024. 2

  48. [48]

    Online segment any 3d thing as instance tracking

    Hanshi Wang, caizijian, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InNeurIPS, 2025. 5

  49. [49]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 1, 2

  50. [50]

    Continuous 3d per- ception model with persistent state.CVPR, 2025

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state.CVPR, 2025. 1, 2, 3

  51. [51]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 1, 2

  52. [52]

    Panorecon: Real- time panoptic 3d reconstruction from monocular video

    Dong Wu, Zike Yan, and Hongbin Zha. Panorecon: Real- time panoptic 3d reconstruction from monocular video. In CVPR, 2024. 1, 3

  53. [53]

    Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

    Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 2

  54. [54]

    Memory-based adapters for online 3d scene perception

    Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InCVPR, 2024. 2

  55. [55]

    Embodiedsam: Online segment any 3d thing in real time

    Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time. InICLR, 2025. 1, 2, 5, 6, 7

  56. [56]

    Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

    Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InCVPR, 2024. 2, 6, 7

  57. [57]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025. 2

  58. [58]

    Sam3d: Segment anything in 3d scenes.IC- CVW, 2023

    Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.IC- CVW, 2023. 2, 6, 7

  59. [59]

    Sai3d: Segment any instance in 3d scenes

    Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InCVPR, 2024. 2

  60. [60]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. 2

  61. [61]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 2

  62. [62]

    Fusion-aware point convolution for online semantic 3d scene segmentation

    Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InCVPR, 2020. 2

  63. [63]

    3d-aware object goal navigation via simultaneous exploration and identification

    Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InCVPR, 2023. 1

  64. [64]

    Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025. 2

  65. [65]

    Monst3r: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. InICLR, 2025. 1

  66. [66]

    Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation

    Jihuai Zhao, Junbao Zhuo, Jiansheng Chen, and Huimin Ma. Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation. InCVPR, 2025. 2

  67. [67]

    In-place scene labelling and understanding with implicit scene representation

    Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 2

  68. [68]

    Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance

    Mingquan Zhou, Chen He, Ruiping Wang, and Xilin Chen. Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance. InICCV, 2025. 2

  69. [69]

    Ea3d: Online open-world 3d object extraction from streaming videos

    Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, De- qing Sun, and Ming-Hsuan Yang. Ea3d: Online open-world 3d object extraction from streaming videos. InNeurIPS,

  70. [70]

    Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video

    Zhen Zhou, Yunkai Ma, Junfeng Fan, Shaolin Zhang, Feng- shui Jing, and Min Tan. Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video. InICRA, 2025. 3

  71. [71]

    Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion

    Runsong Zhu, Shi Qiu, Qianyi Wu, Ka-Hei Hui, Pheng-Ann Heng, and Chi-Wing Fu. Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion. InECCV, 2024. 2

  72. [72]

    Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025

    Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025. 2