Recognition: 1 theorem link
· Lean TheoremMoonSeg3R: Monocular Online Zero-Shot Segment Anything in 3D with Reconstructive Foundation Priors
Pith reviewed 2026-05-16 21:32 UTC · model grok-4.3
The pith
MoonSeg3R performs online zero-shot 3D instance segmentation from monocular RGB video alone by converting 2D foundation model masks into consistent 3D queries with reconstructive priors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MoonSeg3R is the first method for online monocular 3D instance segmentation; it uses CUT3R to supply geometric priors from RGB only, then applies three components—a self-supervised query refinement module with spatial-semantic distillation, a 3D query index memory for temporal consistency, and a state-distribution token as a mask identity descriptor—to turn 2D VFM masks into accurate, temporally consistent 3D queries, reaching performance competitive with RGB-D systems on ScanNet200 and SceneNN.
What carries the argument
The self-supervised query refinement module that transforms 2D VFM masks into 3D queries via spatial-semantic distillation, supported by the 3D query index memory for cross-frame retrieval and the CUT3R state-distribution token for mask identity.
If this is right
- Online 3D segmentation becomes possible from ordinary single-camera video streams without depth sensors.
- Existing 2D visual foundation models can be lifted to 3D while preserving zero-shot capability.
- Temporal consistency in 3D queries is achieved through memory-based retrieval rather than explicit tracking.
- State-distribution tokens from reconstructive models serve as effective descriptors for cross-frame mask fusion.
- Performance reaches levels previously reported only for RGB-D pipelines on standard indoor benchmarks.
Where Pith is reading between the lines
- The same prior-refinement pattern could be tested on outdoor or dynamic scenes where camera motion is less constrained.
- If CUT3R-style models improve further, the gap between monocular and RGB-D 3D segmentation may continue to close without hardware changes.
- The query memory mechanism suggests a route to long-term 3D object persistence across disconnected video clips.
- Integration with other reconstructive or generative priors could extend the method to categories or scenes absent from current training data.
Load-bearing premise
Geometric priors supplied by CUT3R from monocular RGB are accurate and stable enough to turn 2D masks into reliable 3D queries without depth or pose supervision.
What would settle it
Run MoonSeg3R on a monocular video sequence where CUT3R reconstruction error is high; if the resulting 3D segmentations show large drops in accuracy or temporal consistency relative to RGB-D ground truth, the central claim fails.
Figures
read the original abstract
In this paper, we focus on online zero-shot monocular 3D instance segmentation, a novel practical setting where existing approaches fail to perform because they rely on posed RGB-D sequences. To overcome this limitation, we leverage CUT3R, a recent Reconstructive Foundation Model (RFM), to provide reliable geometric priors from a single RGB stream. We propose MoonSeg3R, which introduces three key components: (1) a self-supervised query refinement module with spatial-semantic distillation that transforms segmentation masks from 2D visual foundation models (VFMs) into discriminative 3D queries; (2) a 3D query index memory that provides temporal consistency by retrieving contextual queries; and (3) a state-distribution token from CUT3R that acts as a mask identity descriptor to strengthen cross-frame fusion. Experiments on ScanNet200 and SceneNN show that MoonSeg3R is the first method to enable online monocular 3D segmentation and achieves performance competitive with state-of-the-art RGB-D-based systems. Our code is available at https://github.com/VICO-UoE/MoonSeg3R.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes MoonSeg3R for online zero-shot monocular 3D instance segmentation. It uses CUT3R (a reconstructive foundation model) to derive geometric priors from monocular RGB input, then introduces a self-supervised query refinement module with spatial-semantic distillation to convert 2D VFM masks into 3D queries, a 3D query index memory for temporal consistency via contextual retrieval, and CUT3R state-distribution tokens as mask identity descriptors for cross-frame fusion. Experiments on ScanNet200 and SceneNN are reported to show that the method is the first to enable this setting and achieves performance competitive with RGB-D-based SOTA systems; code is released.
Significance. If the performance claims are substantiated, the work would be significant for extending 3D segmentation to practical monocular online scenarios without depth or pose supervision. The combination of reconstructive priors with VFM masks and the query-memory design offers a concrete path toward reducing hardware requirements, and the public code release aids reproducibility.
major comments (2)
- [§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.
- [§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.
minor comments (2)
- [Abstract] Abstract: The phrase 'competitive with state-of-the-art RGB-D-based systems' should be accompanied by the precise metrics and reference methods used.
- [§3.2] Notation: The distinction between '3D queries' and 'contextual queries' in the memory module could be clarified with a diagram or explicit equations.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the potential significance of MoonSeg3R for practical monocular online 3D segmentation. We address each major comment below with clarifications and planned revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The abstract and results summary claim competitive performance with RGB-D SOTA on ScanNet200 and SceneNN, yet no specific baselines, exact metrics (e.g., mAP, mIoU values), error bars, or ablation studies are described. This absence prevents assessment of whether the reported numbers genuinely support the central claim of competitiveness.
Authors: We agree that the abstract and high-level summary would benefit from greater specificity. Section 4 of the manuscript already contains tables comparing against RGB-D baselines on both ScanNet200 and SceneNN, reporting mAP and mIoU values together with ablation studies on the query refinement module and temporal memory. Error bars from repeated runs appear in the supplementary material. In the revised version we will (i) insert the key numerical results directly into the abstract and (ii) ensure every table in the main paper explicitly lists the baselines, metrics, and error bars for immediate readability. revision: yes
-
Referee: [§3] §3 (Method, especially 3.2–3.3): The pipeline converts 2D VFM masks into temporally consistent 3D queries by relying on CUT3R’s implicit reconstruction (scale, normals, trajectories) and state-distribution tokens for mask identity. No independent quantitative validation of CUT3R reconstruction accuracy on the target datasets (e.g., depth or pose error on ScanNet) is provided, leaving open the possibility that reported segmentation gains are confined to scenes where CUT3R happens to succeed.
Authors: This concern is well taken. Although CUT3R is a published foundation model whose reconstruction quality was validated in its original paper, we did not report dataset-specific depth or pose errors on ScanNet200 and SceneNN. In the revised manuscript we will add a short quantitative analysis (new paragraph in Section 3 or 4) that measures CUT3R’s depth and pose accuracy on the exact sequences used in our experiments. This will directly address whether the geometric priors remain reliable across the evaluated scenes. revision: yes
Circularity Check
No circularity: derivation builds on external pre-trained models without self-referential reduction
full rationale
The paper's core pipeline (self-supervised query refinement, 3D query index memory, and state-distribution token) is defined as new modules operating on outputs from external CUT3R and VFMs. The abstract and description contain no equations, fitted parameters renamed as predictions, or self-citations that serve as the sole justification for the central claim. Performance is evaluated on ScanNet200/SceneNN against RGB-D baselines rather than being forced by construction from the inputs. This is the standard case of an honest non-finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption CUT3R provides reliable geometric priors from a single RGB stream
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
MoonSeg3R ... self-supervised query refinement module with spatial-semantic distillation ... 3D Query Index Memory ... state-distribution token from CUT3R that acts as a mask identity descriptor
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,
Yash Bhalgat, Iro Laina, Jo ˜ao F Henriques, Andrea Vedaldi, and Andrew Zisserman. Contrastive lift: 3d object instance segmentation by slow-fast contrastive fusion.NeurIPS,
-
[2]
Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images
W ANG Bing, Lu Chen, and Bo Yang. Dm-nerf: 3d scene geometry decomposition and manipulation from 2d images. InThe Eleventh International Conference on Learning Rep- resentations, 2023. 2
work page 2023
-
[3]
Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,
Mohamed El Amine Boudjoghra, Angela Dai, Jean Lahoud, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shahbaz Khan. Open-yolo 3d: Towards fast and accurate open-vocabulary 3d instance segmentation.ICLR,
-
[4]
Must3r: Multi-view network for stereo 3d reconstruc- tion
Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruc- tion. InCVPR, 2025. 2
work page 2025
-
[5]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In ICCV, 2021. 2
work page 2021
-
[6]
Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,
Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Ab- hinav Gupta, and Russ R Salakhutdinov. Object goal nav- igation using goal-oriented semantic exploration.NeurIPS,
-
[7]
TTT3R: 3D Reconstruction as Test-Time Training
Xingyu Chen, Yue Chen, Yuliang Xiu, Andreas Geiger, and Anpei Chen. Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645, 2025. 2
work page internal anchor Pith review arXiv 2025
-
[8]
3d-r2n2: A unified approach for single and multi-view 3d object reconstruction
Christopher B Choy, Danfei Xu, JunYoung Gwak, Kevin Chen, and Silvio Savarese. 3d-r2n2: A unified approach for single and multi-view 3d object reconstruction. InECCV,
-
[9]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In CVPR, 2017. 6
work page 2017
-
[10]
Lsd- slam: Large-scale direct monocular slam
Jakob Engel, Thomas Sch ¨ops, and Daniel Cremers. Lsd- slam: Large-scale direct monocular slam. InECCV, 2014. 2
work page 2014
-
[11]
Christian Forster, Zichao Zhang, Michael Gassner, Manuel Werlberger, and Davide Scaramuzza. Svo: Semidirect visual odometry for monocular and multicamera systems.IEEE Transactions on Robotics, 33(2):249–265, 2016. 2
work page 2016
-
[12]
Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation
Xiao Fu, Shangzhan Zhang, Tianrun Chen, Yichong Lu, Lanyun Zhu, Xiaowei Zhou, Andreas Geiger, and Yiyi Liao. Panoptic nerf: 3d-to-2d label transfer for panoptic urban scene segmentation. In3DV, 2022. 2
work page 2022
-
[13]
Scenenn: A scene meshes dataset with annotations
Binh-Son Hua, Quang-Hieu Pham, Duc Thanh Nguyen, Minh-Khoi Tran, Lap-Fai Yu, and Sai-Kit Yeung. Scenenn: A scene meshes dataset with annotations. In3DV, 2016. 6
work page 2016
-
[14]
Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels
Rui Huang, Songyou Peng, Ayca Takmaz, Federico Tombari, Marc Pollefeys, Shiji Song, Gao Huang, and Francis Engel- mann. Segment3d: Learning fine-grained class-agnostic 3d segmentation without manual labels. InECCV, 2024. 2
work page 2024
-
[15]
Shi-Sheng Huang, Ze-Yu Ma, Tai-Jiang Mu, Hongbo Fu, and Shi-Min Hu. Supervoxel convolution for online 3d semantic segmentation.ACM Transactions on Graphics (TOG), 40(3): 1–15, 2021. 2
work page 2021
-
[16]
Learning a multi-view stereo machine.NeurIPS, 2017
Abhishek Kar, Christian H ¨ane, and Jitendra Malik. Learning a multi-view stereo machine.NeurIPS, 2017. 2
work page 2017
-
[17]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman M ¨uller, Johannes Sch ¨onberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, et al. Mapanything: Universal feed-forward metric 3d re- construction.arXiv preprint arXiv:2509.13414, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
3d gaussian splatting for real-time radiance field rendering.ACM Trans
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 2023. 2
work page 2023
-
[19]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1, 2
work page 2023
-
[20]
Panoptic neural fields: A semantic object-aware neural scene representation
Abhijit Kundu, Kyle Genova, Xiaoqi Yin, Alireza Fathi, Car- oline Pantofaru, Leonidas J Guibas, Andrea Tagliasacchi, Frank Dellaert, and Thomas Funkhouser. Panoptic neural fields: A semantic object-aware neural scene representation. InCVPR, 2022. 2
work page 2022
-
[21]
Yuqing Lan, Chenyang Zhu, Zhirui Gao, Jiazhao Zhang, Yi- han Cao, Renjiao Yi, Yijie Wang, and Kai Xu. Boxfusion: Reconstruction-free open-vocabulary 3d object detection via real-time multi-view box fusion. InComputer Graphics Fo- rum, page e70254. Wiley Online Library, 2025. 6
work page 2025
-
[22]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InECCV, 2024. 2
work page 2024
-
[23]
Ins-conv: Incremental sparse convolution for online 3d seg- mentation
Leyao Liu, Tian Zheng, Yun-Jou Lin, Kai Ni, and Lu Fang. Ins-conv: Incremental sparse convolution for online 3d seg- mentation. InCVPR, 2022. 2
work page 2022
-
[24]
Slam3r: Real- time dense scene reconstruction from monocular rgb videos
Yuzheng Liu, Siyan Dong, Shuzhe Wang, Yingda Yin, Yan- chao Yang, Qingnan Fan, and Baoquan Chen. Slam3r: Real- time dense scene reconstruction from monocular rgb videos. InCVPR, 2025. 2
work page 2025
-
[25]
Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data
Shiyang Lu, Haonan Chang, Eric Pu Jing, Abdeslam Boular- ias, and Kostas Bekris. Ovir-3d: Open-vocabulary 3d in- stance retrieval without training on 3d data. InCoRL, 2023. 2, 6
work page 2023
-
[26]
Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks
John McCormac, Ankur Handa, Andrew Davison, and Ste- fan Leutenegger. Semanticfusion: Dense 3d semantic map- ping with convolutional neural networks. InICRA, 2017. 2
work page 2017
-
[27]
Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 2021. 2
work page 2021
-
[28]
6-dof graspnet: Variational grasp generation for object manipula- tion
Arsalan Mousavian, Clemens Eppner, and Dieter Fox. 6-dof graspnet: Variational grasp generation for object manipula- tion. InICCV, 2019. 1
work page 2019
-
[29]
Panopticfusion: Online volumetric semantic mapping at the level of stuff and things
Gaku Narita, Takashi Seno, Tomoya Ishikawa, and Yohsuke Kaji. Panopticfusion: Online volumetric semantic mapping at the level of stuff and things. InIROS, 2019. 2
work page 2019
-
[30]
Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance
Phuc Nguyen, Tuan Duc Ngo, Evangelos Kalogerakis, Chuang Gan, Anh Tran, Cuong Pham, and Khoi Nguyen. Open3dis: Open-vocabulary 3d instance segmentation with 2d mask guidance. InCVPR, 2024. 2
work page 2024
-
[31]
Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025
Phuc Nguyen, Minh Luu, Anh Tran, Cuong Pham, and Khoi Nguyen. Any3dis: Class-agnostic 3d instance segmentation by 2d mask tracking.CVPR, 2025. 2
work page 2025
-
[32]
Dinov2: Learning robust visual features without supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. TMLR, 2023. 2
work page 2023
-
[33]
High quality entity segmentation
Lu Qi, Jason Kuen, Tiancheng Shen, Jiuxiang Gu, Wenbo Li, Weidong Guo, Jiaya Jia, Zhe Lin, and Ming-Hsuan Yang. High quality entity segmentation. InICCV, 2023. 3, 6
work page 2023
-
[34]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InICML, 2021. 1
work page 2021
-
[35]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Language- grounded indoor 3d semantic segmentation in the wild
David Rozenberszki, Or Litany, and Angela Dai. Language- grounded indoor 3d semantic segmentation in the wild. In ECCV, 2022. 6
work page 2022
-
[37]
Unscene3d: Unsupervised 3d instance segmentation for indoor scenes
David Rozenberszki, Or Litany, and Angela Dai. Unscene3d: Unsupervised 3d instance segmentation for indoor scenes. In CVPR, 2024. 2
work page 2024
-
[38]
Simplere- con: 3d reconstruction without 3d convolutions
Mohamed Sayed, John Gibson, Jamie Watson, Victor Prisacariu, Michael Firman, and Cl´ement Godard. Simplere- con: 3d reconstruction without 3d convolutions. InECCV,
-
[39]
Panoptic lifting for 3d scene understanding with neural fields
Yawar Siddiqui, Lorenzo Porzi, Samuel Rota Bul ´o, Nor- man M ¨uller, Matthias Nießner, Angela Dai, and Peter Kontschieder. Panoptic lifting for 3d scene understanding with neural fields. InCVPR, 2023. 2
work page 2023
-
[40]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video
Jiaming Sun, Yiming Xie, Linghao Chen, Xiaowei Zhou, and Hujun Bao. Neuralrecon: Real-time coherent 3d reconstruc- tion from monocular video. InCVPR, 2021. 2
work page 2021
-
[42]
Openmask3d: Open-vocabulary 3d instance segmentation
Ayc ¸a Takmaz, Elisabetta Fedele, Robert W Sumner, Marc Pollefeys, Federico Tombari, and Francis Engelmann. Openmask3d: Open-vocabulary 3d instance segmentation. NeurIPS, 2023. 2
work page 2023
-
[43]
Yijie Tang, Jiazhao Zhang, Yuqing Lan, Yulan Guo, Dezun Dong, Chenyang Zhu, and Kai Xu. Onlineanyseg: On- line zero-shot 3d segmentation by visual foundation model guided 2d mask merging.CVPR, 2025. 1, 2, 6, 7
work page 2025
-
[44]
Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds
Zhenggang Tang, Yuchen Fan, Dilin Wang, Hongyu Xu, Rakesh Ranjan, Alexander Schwing, and Zhicheng Yan. Mv-dust3r+: Single-stage scene reconstruction from sparse views in 2 seconds. InCVPR, 2025. 2
work page 2025
-
[45]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras.NeurIPS, 2021. 2
work page 2021
-
[46]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language en- coders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
3d reconstruction with spatial memory.3DV, 2024
Hengyi Wang and Lourdes Agapito. 3d reconstruction with spatial memory.3DV, 2024. 2
work page 2024
-
[48]
Online segment any 3d thing as instance tracking
Hanshi Wang, caizijian, Jin Gao, Yiwei Zhang, Weiming Hu, Ke Wang, and Zhipeng Zhang. Online segment any 3d thing as instance tracking. InNeurIPS, 2025. 5
work page 2025
-
[49]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InCVPR, 2025. 1, 2
work page 2025
-
[50]
Continuous 3d per- ception model with persistent state.CVPR, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state.CVPR, 2025. 1, 2, 3
work page 2025
-
[51]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024. 1, 2
work page 2024
-
[52]
Panorecon: Real- time panoptic 3d reconstruction from monocular video
Dong Wu, Zike Yan, and Hongbin Zha. Panorecon: Real- time panoptic 3d reconstruction from monocular video. In CVPR, 2024. 1, 3
work page 2024
-
[53]
Yuqi Wu, Wenzhao Zheng, Jie Zhou, and Jiwen Lu. Point3r: Streaming 3d reconstruction with explicit spatial pointer memory.arXiv preprint arXiv:2507.02863, 2025. 2
-
[54]
Memory-based adapters for online 3d scene perception
Xiuwei Xu, Chong Xia, Ziwei Wang, Linqing Zhao, Yueqi Duan, Jie Zhou, and Jiwen Lu. Memory-based adapters for online 3d scene perception. InCVPR, 2024. 2
work page 2024
-
[55]
Embodiedsam: Online segment any 3d thing in real time
Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, and Jiwen Lu. Embodiedsam: Online segment any 3d thing in real time. InICLR, 2025. 1, 2, 5, 6, 7
work page 2025
-
[56]
Mi Yan, Jiazhao Zhang, Yan Zhu, and He Wang. Maskclus- tering: View consensus based mask graph clustering for open-vocabulary 3d instance segmentation. InCVPR, 2024. 2, 6, 7
work page 2024
-
[57]
Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025
Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass.CVPR, 2025. 2
work page 2025
-
[58]
Sam3d: Segment anything in 3d scenes.IC- CVW, 2023
Yunhan Yang, Xiaoyang Wu, Tong He, Hengshuang Zhao, and Xihui Liu. Sam3d: Segment anything in 3d scenes.IC- CVW, 2023. 2, 6, 7
work page 2023
-
[59]
Sai3d: Segment any instance in 3d scenes
Yingda Yin, Yuzheng Liu, Yang Xiao, Daniel Cohen-Or, Jingwei Huang, and Baoquan Chen. Sai3d: Segment any instance in 3d scenes. InCVPR, 2024. 2
work page 2024
-
[60]
pixelnerf: Neural radiance fields from one or few images
Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. 2
work page 2021
-
[61]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InICCV, 2023. 2
work page 2023
-
[62]
Fusion-aware point convolution for online semantic 3d scene segmentation
Jiazhao Zhang, Chenyang Zhu, Lintao Zheng, and Kai Xu. Fusion-aware point convolution for online semantic 3d scene segmentation. InCVPR, 2020. 2
work page 2020
-
[63]
3d-aware object goal navigation via simultaneous exploration and identification
Jiazhao Zhang, Liu Dai, Fanpeng Meng, Qingnan Fan, Xuelin Chen, Kai Xu, and He Wang. 3d-aware object goal navigation via simultaneous exploration and identification. InCVPR, 2023. 1
work page 2023
-
[64]
Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.ICLR, 2025. 2
work page 2025
-
[65]
Monst3r: A simple approach for estimating geometry in the presence of motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion. InICLR, 2025. 1
work page 2025
-
[66]
Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation
Jihuai Zhao, Junbao Zhuo, Jiansheng Chen, and Huimin Ma. Sam2object: Consolidating view consistency via sam2 for zero-shot 3d instance segmentation. InCVPR, 2025. 2
work page 2025
-
[67]
In-place scene labelling and understanding with implicit scene representation
Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InICCV, 2021. 2
work page 2021
-
[68]
Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance
Mingquan Zhou, Chen He, Ruiping Wang, and Xilin Chen. Ov3d-cg: Open-vocabulary 3d instance segmentation with contextual guidance. InICCV, 2025. 2
work page 2025
-
[69]
Ea3d: Online open-world 3d object extraction from streaming videos
Xiaoyu Zhou, Jingqi Wang, Yuang Jia, Yongtao Wang, De- qing Sun, and Ming-Hsuan Yang. Ea3d: Online open-world 3d object extraction from streaming videos. InNeurIPS,
-
[70]
Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video
Zhen Zhou, Yunkai Ma, Junfeng Fan, Shaolin Zhang, Feng- shui Jing, and Min Tan. Eprecon: An efficient framework for real-time panoptic 3d reconstruction from monocular video. InICRA, 2025. 3
work page 2025
-
[71]
Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion
Runsong Zhu, Shi Qiu, Qianyi Wu, Ka-Hei Hui, Pheng-Ann Heng, and Chi-Wing Fu. Pcf-lift: Panoptic lifting by proba- bilistic contrastive fusion. InECCV, 2024. 2
work page 2024
-
[72]
Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025
Lojze Zust, Yohann Cabon, Juliette Marrie, Leonid Antsfeld, Boris Chidlovskii, Jerome Revaud, and Gabriela Csurka. Panst3r: Multi-view consistent panoptic segmentation.arXiv preprint arXiv:2506.21348, 2025. 2
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.