arxiv: 2512.15577 · v2 · submitted 2025-12-17 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

MoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors

Zhipeng Du , Duolikun Danier , Jan Eric Lenssen , Hakan Bilen

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:32 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular 3D segmentationzero-shot segmentationonline 3D segmentationreconstructive foundation modelsinstance segmentationtemporal consistencyvisual foundation models3D query refinement

0 comments

The pith

MoonSeg3R performs online zero-shot 3D instance segmentation from monocular RGB video alone by converting 2D foundation model masks into consistent 3D queries with reconstructive priors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that reliable geometric priors extracted from a single RGB stream by a reconstructive foundation model can replace the need for depth sensors or known camera poses in 3D segmentation. It introduces a self-supervised refinement step that distills spatial and semantic information to create discriminative 3D queries from 2D masks, then uses a query memory and identity tokens to keep those queries consistent across frames. This matters because it opens 3D segmentation to ordinary video without specialized hardware. A sympathetic reader would care if the approach scales because it removes a major practical barrier between 2D foundation models and 3D scene understanding.

Core claim

MoonSeg3R is the first method for online monocular 3D instance segmentation; it uses CUT3R to supply geometric priors from RGB only, then applies three components—a self-supervised query refinement module with spatial-semantic distillation, a 3D query index memory for temporal consistency, and a state-distribution token as a mask identity descriptor—to turn 2D VFM masks into accurate, temporally consistent 3D queries, reaching performance competitive with RGB-D systems on ScanNet200 and SceneNN.

What carries the argument

The self-supervised query refinement module that transforms 2D VFM masks into 3D queries via spatial-semantic distillation, supported by the 3D query index memory for cross-frame retrieval and the CUT3R state-distribution token for mask identity.

If this is right

Online 3D segmentation becomes possible from ordinary single-camera video streams without depth sensors.
Existing 2D visual foundation models can be lifted to 3D while preserving zero-shot capability.
Temporal consistency in 3D queries is achieved through memory-based retrieval rather than explicit tracking.
State-distribution tokens from reconstructive models serve as effective descriptors for cross-frame mask fusion.
Performance reaches levels previously reported only for RGB-D pipelines on standard indoor benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prior-refinement pattern could be tested on outdoor or dynamic scenes where camera motion is less constrained.
If CUT3R-style models improve further, the gap between monocular and RGB-D 3D segmentation may continue to close without hardware changes.
The query memory mechanism suggests a route to long-term 3D object persistence across disconnected video clips.
Integration with other reconstructive or generative priors could extend the method to categories or scenes absent from current training data.

Load-bearing premise

Geometric priors supplied by CUT3R from monocular RGB are accurate and stable enough to turn 2D masks into reliable 3D queries without depth or pose supervision.

What would settle it

Run MoonSeg3R on a monocular video sequence where CUT3R reconstruction error is high; if the resulting 3D segmentations show large drops in accuracy or temporal consistency relative to RGB-D ground truth, the central claim fails.

Figures

Figures reproduced from arXiv: 2512.15577 by Duolikun Danier, Hakan Bilen, Jan Eric Lenssen, Zhipeng Du.

**Figure 1.** Figure 1: Previous VFM-assistend Online Paradigm v.s. Ours. While existing methods relies on the ground truth geometry (and 3D segmentation masks), our method works in a monocular online zero-shot setting, exploiting the spatio-temporal priors from an RFM to help with online 3D segmentation, thereby simultaneously achieving online reconstruction and segmentation. powerful 2D mask priors, which can then be lifted i… view at source ↗

**Figure 2.** Figure 2: Overview of MoonSeg3R. The pipeline consists of four steps. (a) CUT3R takes an uncalibrated image It as input to predict explicit geometry (pose Pt, world-coordinate pointmap Xt), and implicit representations (geometric features F 3d t , state attention At). (b) VFM masks Mt are lifted and refined into 3D queries q ′ t through a transformer decoder, via spatial-semantic self-distillation supervision (Ldist… view at source ↗

**Figure 3.** Figure 3: Qualitative Comparison. Qualitative examples of OnlineAnySeg-M and our method on ScanNet200 sequences. These results visually demonstrate that MoonSeg3R achieves superior instance segmentation. OnlineAnySeg-M, in contrast, tends to fail in associating masks, which leaves significant unsegmented areas, as shown in the red dashed circles. The segmentation results are unprojected to ground truth point cloud f… view at source ↗

**Figure 5.** Figure 5: State Distribution Similarity. For two consecutive frames, we extract the state distribution tokens for all instances and compute their cross-frame pairwise similarities. Tokens belonging to the same instances always exhibit the highest similarity scores, both for large, fully-visible objects (sofa) and small, partially observed objects (table). tributes to higher object discriminativeness in reference f… view at source ↗

read the original abstract

In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Our code is available at https://github.com/VICO-UoE/MoonSeg3R.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MoonSeg3R carves out the online monocular zero-shot 3D segmentation setting and reports competitive numbers on ScanNet200 and SceneNN, but the results hinge on unverified CUT3R priors with almost no experimental detail shown.

read the letter

The paper's real contribution is defining and attempting the online monocular zero-shot 3D instance segmentation problem, where prior work either needs RGB-D input or runs offline. It builds on CUT3R to extract geometric priors from plain RGB video, then adds a self-supervised query refinement module that turns 2D VFM masks into 3D queries and a 3D query index memory that uses CUT3R state-distribution tokens for cross-frame consistency. Those pieces are concrete and address a practical gap for hardware-light robotics or AR that cannot rely on depth sensors. The abstract positions the method as the first to handle this setting and claims performance competitive with RGB-D systems, which would matter if it holds. The approach avoids obvious circularity by layering new modules on external pre-trained models rather than fitting to the evaluation data itself. The main weakness is the lack of supporting evidence in the abstract. No baselines are named, no error bars or ablations appear, and there is no reported check on CUT3R reconstruction quality in textureless or dynamic regions of the test scenes. If CUT3R's implicit scale, normals, or trajectories drift, the distilled 3D queries and memory retrieval would lose reliability, yet no independent reconstruction metric is given to bound that risk. The competitive claim therefore rests on scenes where CUT3R happens to work rather than a demonstrated general monocular capability. This is the kind of paper a 3D vision or robotics group should read once the full experiments and code are available, mainly to see whether the new setting and modules survive scrutiny. It is worth sending to peer review so referees can check the implementation details and run the necessary controls.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes MoonSeg3R for online zero-shot monocular 3D instance segmentation. It uses CUT3R (a reconstructive foundation model) to derive geometric priors from monocular RGB input, then introduces a self-supervised query refinement module with spatial-semantic distillation to convert 2D VFM masks into 3D queries, a 3D query index memory for temporal consistency via contextual retrieval, and CUT3R state-distribution tokens as mask identity descriptors for cross-frame fusion. Experiments on ScanNet200 and SceneNN are reported to show that the method is the first to enable this setting and achieves performance competitive with RGB-D-based SOTA systems; code is released.

Significance. If the performance claims are substantiated, the work would be significant for extending 3D segmentation to practical monocular online scenarios without depth or pose supervision. The combination of reconstructive priors with VFM masks and the query-memory design offers a concrete path toward reducing hardware requirements, and the public code release aids reproducibility.

major comments (2)

[§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.
[§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.

minor comments (2)

[Abstract] Abstract: The phrase 'competitive with state-of-the-art RGB-D-based systems' should be accompanied by the precise metrics and reference methods used.
[§3.2] Notation: The distinction between '3D queries' and 'contextual queries' in the memory module could be clarified with a diagram or explicit equations.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential significance of MoonSeg3R for practical monocular online 3D segmentation. We address each major comment below with clarifications and planned revisions.

read point-by-point responses

Referee: [§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.

Authors: We agree that the abstract and high-level summary would benefit from greater specificity. Section 4 of the manuscript already contains tables comparing against RGB-D baselines on both ScanNet200 and SceneNN, reporting mAP and mIoU values together with ablation studies on the query refinement module and temporal memory. Error bars from repeated runs appear in the supplementary material. In the revised version we will (i) insert the key numerical results directly into the abstract and (ii) ensure every table in the main paper explicitly lists the baselines, metrics, and error bars for immediate readability. revision: yes
Referee: [§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.

Authors: This concern is well taken. Although CUT3R is a published foundation model whose reconstruction quality was validated in its original paper, we did not report dataset-specific depth or pose errors on ScanNet200 and SceneNN. In the revised manuscript we will add a short quantitative analysis (new paragraph in Section 3 or 4) that measures CUT3R’s depth and pose accuracy on the exact sequences used in our experiments. This will directly address whether the geometric priors remain reliable across the evaluated scenes. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation builds on external pre-trained models without self-referential reduction

full rationale

The paper's core pipeline (self-supervised query refinement, 3D query index memory, and state-distribution token) is defined as new modules operating on outputs from external CUT3R and VFMs. The abstract and description contain no equations, fitted parameters renamed as predictions, or self-citations that serve as the sole justification for the central claim. Performance is evaluated on ScanNet200/SceneNN against RGB-D baselines rather than being forced by construction from the inputs. This is the standard case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the reliability of pre-trained CUT3R for geometric priors and the effectiveness of the three introduced modules; no free parameters or new entities are explicitly introduced in the abstract.

axioms (1)

domain assumption CUT3R provides reliable geometric priors from a single RGB stream
Invoked as the foundation for transforming 2D masks into 3D queries.

pith-pipeline@v0.9.0 · 5516 in / 1178 out tokens · 45242 ms · 2026-05-16T21:32:18.446099+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MoonSeg3R ... self-supervised query refinement module with spatial-semantic distillation ... 3D Query Index Memory ... state-distribution token from CUT3R that acts as a mask identity descriptor

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

[1]

Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,

Yash Bhalgat, Iro Laina, Jo ˜ao F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,

work page
[2]

Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images

W ANG Bing, Lu Chen, and Bo Yang. Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. InThe Eleventh International Conference on Learning Rep- resentations, 2023. 2

work page 2023
[3]

Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,

Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,

work page
[4]

Must3r: Multi-view network for stereo 3d reconstruc- tion

Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion. InCVPR, 2025. 2

work page 2025
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2

work page 2021
[6]

Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,

Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,

work page
[7]

TTT3R: 3D Reconstruction as Test-Time Training

Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2

work page internal anchor Pith review arXiv 2025
[8]

3d-r2n2: A unified approach for single and multi-view 3d object reconstruction

Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,

work page
[9]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6

work page 2017
[10]

Lsd- slam: Large-scale direct monocular slam

Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InECCV, 2014. 2

work page 2014
[11]

Svo: Semidirect visual odometry for monocular and multicamera systems.IEEE Transactions on Robotics, 33(2):249–265, 2016

Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. Svo: Semidirect visual odometry for monocular and multicamera systems.IEEE Transactions on Robotics, 33(2):249–265, 2016. 2

work page 2016
[12]

Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation

Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In3DV, 2022. 2

work page 2022
[13]

Scenenn: A scene meshes dataset with annotations

Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In3DV, 2016. 6

work page 2016
[14]

Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels

Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engel- mann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. InECCV, 2024. 2

work page 2024
[15]

Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021

Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2

work page 2021
[16]

Learning a multi-view stereo machine.NeurIPS, 2017

Abhishek Kar, Christian H ¨ane, and Jitendra Malik. Learning a multi-view stereo machine.NeurIPS, 2017. 2

work page 2017
[17]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023. 2

work page 2023
[19]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2

work page 2023
[20]

Panoptic neural fields: A semantic object-aware neural scene representation

Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Car- oline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. InCVPR, 2022. 2

work page 2022
[21]

Boxfusion: Reconstruction-free open-vocabulary 3d object detection via real-time multi-view box fusion

Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yi- han Cao, Renjiao Yi, Yijie Wang, and Kai Xu. Boxfusion: Reconstruction-free open-vocabulary 3d object detection via real-time multi-view box fusion. InComputer Graphics Fo- rum, page e70254. Wiley Online Library, 2025. 6

work page 2025
[22]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 2

work page 2024
[23]

Ins-conv: Incremental sparse convolution for online 3d seg- mentation

Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InCVPR, 2022. 2

work page 2022
[24]

Slam3r: Real- time dense scene reconstruction from monocular rgb videos

Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 2

work page 2025
[25]

Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data

Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InCoRL, 2023. 2, 6

work page 2023
[26]

Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks

John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. InICRA, 2017. 2

work page 2017
[27]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 2

work page 2021
[28]

6-dof graspnet: Variational grasp generation for object manipula- tion

Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InICCV, 2019. 1

work page 2019
[29]

Panopticfusion: Online volumetric semantic mapping at the level of stuff and things

Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. InIROS, 2019. 2

work page 2019
[30]

Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance

Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InCVPR, 2024. 2

work page 2024
[31]

Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025

Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025. 2

work page 2025
[32]

Dinov2: Learning robust visual features without supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023. 2

work page 2023
[33]

High quality entity segmentation

Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. High quality entity segmentation. InICCV, 2023. 3, 6

work page 2023
[34]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1

work page 2021
[35]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Language- grounded indoor 3d semantic segmentation in the wild

David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022. 6

work page 2022
[37]

Unscene3d: Unsupervised 3d instance segmentation for indoor scenes

David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. In CVPR, 2024. 2

work page 2024
[38]

Simplere- con: 3d reconstruction without 3d convolutions

Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InECCV,

work page
[39]

Panoptic lifting for 3d scene understanding with neural fields

Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul ´o, Nor- man M ¨uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. InCVPR, 2023. 2

work page 2023
[40]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video

Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video. InCVPR, 2021. 2

work page 2021
[42]

Openmask3d: Open-vocabulary 3d instance segmentation

Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS, 2023. 2

work page 2023
[43]

Onlineanyseg: On- line zero-shot 3d segmentation by visual foundation model guided 2d mask merging.CVPR, 2025

Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, and Kai Xu. Onlineanyseg: On- line zero-shot 3d segmentation by visual foundation model guided 2d mask merging.CVPR, 2025. 1, 2, 6, 7

work page 2025
[44]

Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds

Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2

work page 2025
[45]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021. 2

work page 2021
[46]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

3d reconstruction with spatial memory.3DV, 2024

Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.3DV, 2024. 2

work page 2024
[48]

Online segment any 3d thing as instance tracking

Hanshi Wang, caizijian, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InNeurIPS, 2025. 5

work page 2025
[49]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 1, 2

work page 2025
[50]

Continuous 3d per- ception model with persistent state.CVPR, 2025

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state.CVPR, 2025. 1, 2, 3

work page 2025
[51]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 1, 2

work page 2024
[52]

Panorecon: Real- time panoptic 3d reconstruction from monocular video

Dong Wu, Zike Yan, and Hongbin Zha. Panorecon: Real- time panoptic 3d reconstruction from monocular video. In CVPR, 2024. 1, 3

work page 2024
[53]

Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025

Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 2

work page arXiv 2025
[54]

Memory-based adapters for online 3d scene perception

Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InCVPR, 2024. 2

work page 2024
[55]

Embodiedsam: Online segment any 3d thing in real time

Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time. InICLR, 2025. 1, 2, 5, 6, 7

work page 2025
[56]

Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation

Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InCVPR, 2024. 2, 6, 7

work page 2024
[57]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025. 2

work page 2025
[58]

Sam3d: Segment anything in 3d scenes.IC- CVW, 2023

Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.IC- CVW, 2023. 2, 6, 7

work page 2023
[59]

Sai3d: Segment any instance in 3d scenes

Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InCVPR, 2024. 2

work page 2024
[60]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. 2

work page 2021
[61]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 2

work page 2023
[62]

Fusion-aware point convolution for online semantic 3d scene segmentation

Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InCVPR, 2020. 2

work page 2020
[63]

3d-aware object goal navigation via simultaneous exploration and identification

Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InCVPR, 2023. 1

work page 2023
[64]

Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025. 2

work page 2025
[65]

Monst3r: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. InICLR, 2025. 1

work page 2025
[66]

Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation

Jihuai Zhao, Junbao Zhuo, Jiansheng Chen, and Huimin Ma. Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation. InCVPR, 2025. 2

work page 2025
[67]

In-place scene labelling and understanding with implicit scene representation

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 2

work page 2021
[68]

Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance

Mingquan Zhou, Chen He, Ruiping Wang, and Xilin Chen. Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance. InICCV, 2025. 2

work page 2025
[69]

Ea3d: Online open-world 3d object extraction from streaming videos

Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, De- qing Sun, and Ming-Hsuan Yang. Ea3d: Online open-world 3d object extraction from streaming videos. InNeurIPS,

work page
[70]

Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video

Zhen Zhou, Yunkai Ma, Junfeng Fan, Shaolin Zhang, Feng- shui Jing, and Min Tan. Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video. InICRA, 2025. 3

work page 2025
[71]

Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion

Runsong Zhu, Shi Qiu, Qianyi Wu, Ka-Hei Hui, Pheng-Ann Heng, and Chi-Wing Fu. Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion. InECCV, 2024. 2

work page 2024
[72]

Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025

Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025. 2

work page arXiv 2025