Recognition: unknown
VGGT-Segmentor: Geometry-Enhanced Cross-View Segmentation
Pith reviewed 2026-05-10 13:21 UTC · model grok-4.3
The pith
VGGT-Segmentor adds a three-stage Union Segmentation Head to VGGT features for accurate instance masks across egocentric and exocentric views without paired annotations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VGGT-Segmentor unifies VGGT's cross-view geometric representations with a Union Segmentation Head that performs mask prompt fusion, point-guided prediction, and iterative refinement, thereby translating high-level feature consistency into pixel-accurate instance segmentation masks; the accompanying single-image self-supervised training removes dependence on paired annotations and supports effective generalization to the target benchmark.
What carries the argument
The Union Segmentation Head, which executes mask prompt fusion, point-guided prediction, and iterative mask refinement to turn VGGT cross-view feature alignments into dense pixel predictions.
If this is right
- Achieves 67.7 percent average IoU for Ego-to-Exo and 68.0 percent for Exo-to-Ego on Ego-Exo4D, setting a new state of the art.
- A correspondence-free pretrained model surpasses most fully supervised baselines on the same benchmark.
- Training requires no paired annotations, lowering the cost of scaling cross-view segmentation.
- Supports downstream uses in embodied AI and remote collaboration where view discrepancies are common.
Where Pith is reading between the lines
- The same head design could be attached to other geometry-aware backbones to improve their dense-prediction performance.
- Single-image self-supervision may transfer to additional multi-view tasks such as tracking or depth estimation.
- Better handling of projection drift could benefit robotics systems that fuse cameras with differing intrinsics.
Load-bearing premise
The three-stage Union Segmentation Head can reliably produce pixel-accurate masks from VGGT features despite projection drift, and single-image self-supervised training yields strong results on the Ego-Exo4D benchmark.
What would settle it
Direct measurement of mask IoU on a held-out set of scenes chosen for high projection drift and occlusion would show whether the head actually overcomes the drift limitation.
Figures
read the original abstract
Instance-level object segmentation across disparate egocentric and exocentric views is a fundamental challenge in visual understanding, critical for applications in embodied AI and remote collaboration. This task is exceptionally difficult due to severe changes in scale, perspective, and occlusion, which destabilize direct pixel-level matching. While recent geometry-aware models like VGGT provide a strong foundation for feature alignment, we find they often fail at dense prediction tasks due to significant pixel-level projection drift, even when their internal object-level attention remains consistent. To bridge this gap, we introduce VGGT-Segmentor (VGGT-S), a framework that unifies robust geometric modeling with pixel-accurate semantic segmentation. VGGT-S leverages VGGT's powerful cross-view feature representation and introduces a novel Union Segmentation Head. This head operates in three stages: mask prompt fusion, point-guided prediction, and iterative mask refinement, effectively translating high-level feature alignment into a precise segmentation mask. Furthermore, we propose a single-image self-supervised training strategy that eliminates the need for paired annotations and enables strong generalization. On the Ego-Exo4D benchmark, VGGT-S sets a new state-of-the-art, achieving 67.7% and 68.0% average IoU for Ego to Exo and Exo to Ego tasks, respectively, significantly outperforming prior methods. Notably, our correspondence-free pretrained model surpasses most fully-supervised baselines, demonstrating the effectiveness and scalability of our approach.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents VGGT-Segmentor (VGGT-S), a framework for instance-level object segmentation between egocentric and exocentric views. It builds on VGGT by adding a Union Segmentation Head with three stages—mask prompt fusion, point-guided prediction, and iterative mask refinement—to convert cross-view feature alignments into pixel-accurate masks. A single-image self-supervised training strategy is proposed to avoid the need for paired annotations. The paper claims state-of-the-art performance on the Ego-Exo4D benchmark, with average IoU of 67.7% for Ego-to-Exo and 68.0% for Exo-to-Ego tasks, outperforming prior methods and even most fully-supervised baselines.
Significance. If the reported performance gains hold after detailed validation, this work would be significant for cross-view visual understanding. It shows that geometry-aware features from models like VGGT can be adapted via a lightweight head and self-supervision to achieve dense segmentation under severe viewpoint changes, without paired annotations. This has potential impact for embodied AI and multi-view applications by offering a scalable alternative to fully supervised cross-view methods.
major comments (3)
- [Abstract and §4] Abstract and §4: The central claim that VGGT-S achieves 67.7%/68.0% IoU and surpasses most fully-supervised baselines on Ego-Exo4D is load-bearing, yet the manuscript provides no ablation isolating the contribution of the three-stage Union Segmentation Head to drift reduction, nor any quantitative drift metric (e.g., average pixel displacement) before versus after the head.
- [§3.2] §3.2: The Union Segmentation Head is described as converting high-level alignments into masks despite projection drift, but the text lacks equations or pseudocode specifying how mask prompt fusion interacts with point-guided prediction, and no ablation table shows IoU change when any one of the three stages is removed.
- [§3.4 and §4.3] §3.4 and §4.3: The single-image self-supervised strategy is asserted to produce features that generalize to cross-view Ego-Exo4D tasks without paired data, but the experiments contain no proxy evaluation (e.g., attention map consistency on held-out paired views) or direct comparison against a supervised VGGT fine-tuning baseline to confirm the self-supervised objective implicitly captures cross-view geometry.
minor comments (2)
- [Abstract] The abbreviation 'VGGT-S' appears in the abstract without an explicit first-use definition linking it to 'VGGT-Segmentor'.
- [§4] Table 1 (or equivalent results table) would benefit from reporting the number of runs and standard deviation alongside the mean IoU values to allow assessment of result stability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, indicating the revisions we will incorporate to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4: The central claim that VGGT-S achieves 67.7%/68.0% IoU and surpasses most fully-supervised baselines on Ego-Exo4D is load-bearing, yet the manuscript provides no ablation isolating the contribution of the three-stage Union Segmentation Head to drift reduction, nor any quantitative drift metric (e.g., average pixel displacement) before versus after the head.
Authors: We agree that an explicit ablation and drift metric would provide stronger support for the central claim. In the revised manuscript, we will add an ablation study in §4 that isolates the Union Segmentation Head's contribution by reporting IoU with and without the full head (and its stages). We will also introduce a quantitative drift metric, such as average pixel displacement of projected points before versus after refinement, to directly measure the head's effect on reducing projection drift. revision: yes
-
Referee: [§3.2] §3.2: The Union Segmentation Head is described as converting high-level alignments into masks despite projection drift, but the text lacks equations or pseudocode specifying how mask prompt fusion interacts with point-guided prediction, and no ablation table shows IoU change when any one of the three stages is removed.
Authors: We will revise §3.2 to include formal equations and pseudocode that specify the sequential interactions among mask prompt fusion, point-guided prediction, and iterative mask refinement. We will also add an ablation table in §4.3 that reports IoU changes when each of the three stages is removed individually, quantifying their respective contributions to the final segmentation performance. revision: yes
-
Referee: [§3.4 and §4.3] §3.4 and §4.3: The single-image self-supervised strategy is asserted to produce features that generalize to cross-view Ego-Exo4D tasks without paired data, but the experiments contain no proxy evaluation (e.g., attention map consistency on held-out paired views) or direct comparison against a supervised VGGT fine-tuning baseline to confirm the self-supervised objective implicitly captures cross-view geometry.
Authors: We will add a proxy evaluation in the revised §4.3 consisting of attention map consistency analysis on held-out paired views from Ego-Exo4D to show that the self-supervised pretraining captures cross-view geometry. A direct comparison to supervised VGGT fine-tuning is not possible without paired annotations, which our correspondence-free approach is designed to avoid; the reported outperformance over most supervised baselines on the benchmark provides supporting evidence for generalization. revision: partial
Circularity Check
No significant circularity; empirical results on external benchmark
full rationale
The paper's chain consists of identifying a limitation in prior VGGT (projection drift at pixel level), proposing a new three-stage Union Segmentation Head plus single-image self-supervised pretraining to address it, and reporting measured IoU on the public Ego-Exo4D benchmark. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would make the SOTA numbers or generalization claims equivalent to the inputs by construction. The benchmark evaluation is independent and falsifiable.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption VGGT provides consistent object-level attention and robust cross-view feature representation despite pixel-level projection drift
invented entities (1)
-
Union Segmentation Head
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011
Sameer Agarwal, Yasutaka Furukawa, Noah Snavely, Ian Si- mon, Brian Curless, Steven M Seitz, and Richard Szeliski. Building rome in a day.Communications of the ACM, 54 (10):105–112, 2011. 2
2011
-
[2]
Ego2top: Matching view- ers in egocentric and top-view videos
Shervin Ardeshir and Ali Borji. Ego2top: Matching view- ers in egocentric and top-view videos. InComputer Vision– ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14, pages 253–268. Springer, 2016. 3
2016
-
[3]
Self-supervised cross-view correspondence with predictive cycle consistency
Alan Baade and Changan Chen. Self-supervised cross-view correspondence with predictive cycle consistency. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 16753–16763, 2025. 6
2025
-
[4]
Allison Bayro, Hongju Moon, Yalda Ghasemi, Heejin Jeong, and Jae Yeol Lee. Object manipulation in physically con- strained workplaces: remote collaboration with extended re- ality.IISE Transactions on Occupational Ergonomics and Human Factors, 13(3):177–190, 2025. 1
2025
-
[5]
Yolact: Real-time instance segmentation
Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee. Yolact: Real-time instance segmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 9157–9166, 2019. 3
2019
-
[6]
Brief: Binary robust independent elementary features
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. InComputer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pages 778–
2010
-
[7]
Emerg- ing properties in self-supervised vision transformers
Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 3
2021
-
[8]
Semantic image segmentation with deep convolutional nets and fully connected CRF s
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Semantic image segmen- tation with deep convolutional nets and fully connected crfs. arXiv preprint arXiv:1412.7062, 2014. 3
-
[9]
Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolu- tion, and fully connected crfs.IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2017
2017
-
[10]
Rethinking Atrous Convolution for Semantic Image Segmentation
Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for seman- tic image segmentation.arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review arXiv 2017
-
[11]
Encoder-decoder with atrous separable convolution for semantic image segmentation
Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European conference on computer vision (ECCV), pages 801–818, 2018. 3
2018
-
[12]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexan- der Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022. 3
2022
-
[13]
Aritra Dutta, Srijan Das, Jacob Nielsen, Rajatsubhra Chakraborty, and Mubarak Shah. Multiview aerial visual recognition (mavrec): Can multi-view improve aerial visual perception? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22678–22690, 2024. 8
2024
-
[14]
Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025
Chrisantus Eze and Christopher Crick. Learning by watch- ing: A review of video-based learning approaches for robot manipulation.IEEE Access, 2025. 1
2025
-
[15]
Identifying first-person camera wearers in third- person videos
Chenyou Fan, Jangwon Lee, Mingze Xu, Krishna Ku- mar Singh, Yong Jae Lee, David J Crandall, and Michael S Ryoo. Identifying first-person camera wearers in third- person videos. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5125– 5133, 2017. 3
2017
-
[16]
Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022
Qiancheng Fu, Qingshan Xu, Yew Soon Ong, and Wenbing Tao. Geo-neus: Geometry-consistent neural implicit surfaces learning for multi-view reconstruction.Advances in Neural Information Processing Systems, 35:3403–3416, 2022. 2
2022
-
[17]
Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives
Yuqian Fu, Runze Wang, Bin Ren, Guolei Sun, Biao Gong, Yanwei Fu, Danda Pani Paudel, Xuanjing Huang, and Luc Van Gool. Objectrelator: Enabling cross-view object rela- tion understanding across ego-centric and exo-centric per- spectives. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6530–6540, 2025. 1, 3, 6
2025
-
[18]
Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015
Yasutaka Furukawa, Carlos Hern ´andez, et al. Multi-view stereo: A tutorial.Foundations and trends® in Computer Graphics and Vision, 9(1-2):1–148, 2015. 1, 2
2015
-
[19]
Massively parallel multiview stereopsis by surface normal diffusion
Silvano Galliani, Katrin Lasinger, and Konrad Schindler. Massively parallel multiview stereopsis by surface normal diffusion. InProceedings of the IEEE international confer- ence on computer vision, pages 873–881, 2015. 2
2015
-
[20]
Self-supervised multi-view multi-human association and tracking
Yiyang Gan, Ruize Han, Liqiang Yin, Wei Feng, and Song Wang. Self-supervised multi-view multi-human association and tracking. InACM MM, 2021. 6
2021
-
[21]
Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives
Kristen Grauman, Andrew Westbury, Lorenzo Torresani, Kris Kitani, Jitendra Malik, Triantafyllos Afouras, Kumar Ashutosh, Vijay Baiyya, Siddhant Bansal, Bikram Boote, et al. Ego-exo4d: Understanding skilled human activity from first-and third-person perspectives. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 193...
2024
-
[22]
A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020
Abdul Mueed Hafiz and Ghulam Mohiuddin Bhat. A survey on instance segmentation: state of the art.International jour- nal of multimedia information retrieval, 9(3):171–189, 2020. 3
2020
-
[23]
Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026
Yuping He, Yifei Huang, Guo Chen, Lidong Lu, Baoqi Pei, Jilan Xu, Tong Lu, and Yoichi Sato. Bridging perspec- tives: A survey on cross-view collaborative intelligence with egocentric-exocentric vision.International Journal of Com- puter Vision, 134(2):62, 2026. 1
2026
-
[24]
Deepmvs: Learning multi- view stereopsis
Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi- view stereopsis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2821–2830,
-
[25]
Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025
Rohit Jayanti, Swayam Agrawal, Vansh Garg, Siddharth Tourani, Muhammad Haris Khan, Sourav Garg, and Mad- hava Krishna. Segmast3r: Geometry grounded segment matching.arXiv preprint arXiv:2510.05051, 2025. 3
-
[26]
Egomimic: Scaling imitation learning via egocentric video
Simar Kareer, Dhruv Patel, Ryan Punamiya, Pranay Mathur, Shuo Cheng, Chen Wang, Judy Hoffman, and Danfei Xu. Egomimic: Scaling imitation learning via egocentric video. In2025 IEEE International Conference on Robotics and Au- tomation (ICRA), pages 13226–13233. IEEE, 2025. 1
2025
-
[27]
Panoptic segmentation
Alexander Kirillov, Kaiming He, Ross Girshick, Carsten Rother, and Piotr Doll ´ar. Panoptic segmentation. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9404–9413, 2019. 3
2019
-
[28]
Segment any- thing
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 4015–4026, 2023. 3, 5
2023
-
[29]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3
2024
-
[30]
Matching anything by segmenting anything
Siyuan Li, Lei Ke, Martin Danelljan, Luigi Piccinelli, Mattia Segu, Luc Van Gool, and Fisher Yu. Matching anything by segmenting anything. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 18963–18973, 2024. 3, 5
2024
-
[31]
Domr: Establishing cross-view segmentation via dense object matching
Jitong Liao, Yulu Gao, Shaofei Huang, Jialin Gao, Jie Lei, Ronghua Liang, and Si Liu. Domr: Establishing cross-view segmentation via dense object matching. InProceedings of the 33rd ACM International Conference on Multimedia, pages 412–421, 2025. 1, 3, 6
2025
-
[32]
Path aggregation network for instance segmentation
Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia. Path aggregation network for instance segmentation. InPro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 8759–8768, 2018. 3
2018
-
[33]
Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982
Stuart Lloyd. Least squares quantization in pcm.IEEE trans- actions on information theory, 28(2):129–137, 1982. 4, 5
1982
-
[34]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 5
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[35]
Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004
David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60:91–110, 2004. 2
2004
-
[36]
Multiview stereo with cascaded epipolar raft
Zeyu Ma, Zachary Teed, and Jia Deng. Multiview stereo with cascaded epipolar raft. InEuropean Conference on Com- puter Vision, pages 734–750. Springer, 2022. 2
2022
-
[37]
Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision
Michael Niemeyer, Lars Mescheder, Michael Oechsle, and Andreas Geiger. Differentiable volumetric rendering: Learn- ing implicit 3d representations without 3d supervision. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 3504–3515, 2020
2020
-
[38]
Rethinking depth estimation for multi- view stereo: A unified representation
Rui Peng, Rongjie Wang, Zhenyu Wang, Yawen Lai, and Ronggang Wang. Rethinking depth estimation for multi- view stereo: A unified representation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8645–8654, 2022. 2
2022
-
[39]
Vi- sion transformers for dense prediction
Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3
2021
-
[40]
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 3, 5
work page internal anchor Pith review arXiv 2024
-
[41]
Structure- from-motion revisited
Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 2
2016
-
[42]
Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999
Steven M Seitz and Charles R Dyer. Photorealistic scene reconstruction by voxel coloring.International journal of computer vision, 35(2):151–173, 1999. 1
1999
-
[43]
Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation
Xinyu Shi, Dong Wei, Yu Zhang, Donghuan Lu, Munan Ning, Jiashun Chen, Kai Ma, and Yefeng Zheng. Dense cross-query-and-support attention weighted mask aggrega- tion for few-shot segmentation. InEuropean Conference on Computer Vision, pages 151–168. Springer, 2022. 6
2022
-
[44]
Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching
Yan Shi, Jun-Xiong Cai, Yoli Shavit, Tai-Jiang Mu, Wensen Feng, and Kai Zhang. Clustergnn: Cluster-based coarse-to- fine graph neural network for efficient feature matching. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12517–12526, 2022. 2
2022
-
[45]
Vggsfm: Visual geometry grounded deep structure from motion
Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 21686–21697, 2024. 2
2024
-
[46]
Vggt: Vi- sual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 1, 3
2025
-
[47]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3
2024
-
[48]
Deepsfm: Structure from motion via deep bundle adjustment
Xingkui Wei, Yinda Zhang, Zhuwen Li, Yanwei Fu, and Xi- angyang Xue. Deepsfm: Structure from motion via deep bundle adjustment. InEuropean conference on computer vi- sion, pages 230–247. Springer, 2020. 2
2020
-
[49]
Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes
Yangming Wen, Krishna Kumar Singh, Markham Anderson, Wei-Pang Jan, and Yong Jae Lee. Seeing the unseen: Pre- dicting the first-person camera wearer’s location and pose in third-person scenes. InProceedings of the IEEE/CVF Inter- national Conference on Computer Vision, pages 3446–3455,
-
[50]
Joint person segmentation and iden- tification in synchronized first-and third-person videos
Mingze Xu, Chenyou Fan, Yuchen Wang, Michael S Ryoo, and David J Crandall. Joint person segmentation and iden- tification in synchronized first-and third-person videos. In Proceedings of the European Conference on Computer Vi- sion (ECCV), pages 637–652, 2018. 3
2018
-
[51]
Lift: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InCom- puter Vision–ECCV 2016: 14th European Conference, Am- 10 sterdam, The Netherlands, October 11-14, 2016, Proceed- ings, Part VI 14, pages 467–483. Springer, 2016. 2
2016
-
[52]
Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023
Jiaming Zhang, Huayao Liu, Kailun Yang, Xinxin Hu, Ruip- ing Liu, and Rainer Stiefelhagen. Cmx: Cross-modal fusion for rgb-x semantic segmentation with transformers.IEEE Transactions on intelligent transportation systems, 24(12): 14679–14694, 2023. 6
2023
-
[53]
Ge- omvsnet: Learning multi-view stereo with geometry percep- tion
Zhe Zhang, Rui Peng, Yuxi Hu, and Ronggang Wang. Ge- omvsnet: Learning multi-view stereo with geometry percep- tion. InProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, pages 21508–21518,
-
[54]
Psalm: Pixelwise segmentation with large multi-modal model
Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. Psalm: Pixelwise segmentation with large multi-modal model. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024. 1, 3, 6
2024
-
[55]
Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,
Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment everything everywhere all at once.Advances in neural information processing systems, 36:19769–19782,
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.