LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

Chen Zhu-Tian; Hanspeter Pfister; Wanhua Li; Yilong Liu

arxiv: 2605.23287 · v1 · pith:WP6LWR7Onew · submitted 2026-05-22 · 💻 cs.CV

LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images

Yilong Liu , Wanhua Li , Chen Zhu-Tian , Hanspeter Pfister This is my paper

Pith reviewed 2026-05-25 04:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian splattinglanguage groundingfeed-forward reconstructionsemantic featuresnovel view synthesisunposed imagesmultimodal scene understandingsparse view reconstruction

0 comments

The pith

LangFlash predicts 3D geometry and language semantics from sparse unposed images in one forward pass using Gaussian splatting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

LangFlash introduces a feed-forward method to create 3D scene representations from a few photos without known camera positions. It uses Gaussian primitives that carry both geometric and semantic language information. This allows quick reconstruction and consistent understanding of scenes with language labels. The approach avoids slow optimization steps common in previous 3D methods. By enriching training data with semantic labels, it supports large-scale learning of these representations.

Core claim

LangFlash directly predicts the geometry and semantics in a single forward pass from sparse unposed multi-view images, using Gaussian primitives enriched with language-aligned semantic features via a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, and achieves superior novel view synthesis and semantic consistency.

What carries the argument

Feed-forward prediction of language-enriched 3D Gaussian primitives with a sparse semantic encoding scheme combining a global dictionary and per-primitive weights.

If this is right

LangFlash enables low-latency 3D reconstruction without iterative optimization.
The predicted features support language-consistent scene understanding.
The method achieves superior novel view synthesis and semantic consistency compared with prior approaches.
It establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The encoding approach could scale to larger scenes by limiting memory use per primitive.
Integration with video sequences might extend the method to dynamic environments.
The single-pass design opens direct use in settings requiring immediate multimodal scene output.

Load-bearing premise

The sparse semantic encoding scheme combining a global semantic dictionary with locally varying per-primitive weights preserves high-level linguistic information while reducing representation complexity.

What would settle it

Evaluating whether the predicted semantic features maintain consistency with ground-truth language annotations across novel views on held-out scenes with sparse inputs would confirm or refute the central claim.

Figures

Figures reproduced from arXiv: 2605.23287 by Chen Zhu-Tian, Hanspeter Pfister, Wanhua Li, Yilong Liu.

**Figure 1.** Figure 1: LangFlash reconstructs 3D semantic Gaussian fields directly from sparse unposed multi-view images in a single forward pass. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: The overall architecture of LangFlash is illustrated as follows. Features extracted by the shared image encoders are first passed [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Language-based 3D Segmentation Comparison on ScanNet [ [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results on Re10K. We further visualize both the segmentation and novel-view synthesis results, which demonstrate [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional qualitative results on RE10k. We visualize both the semantic and novel-view synthesis results. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

We present LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. Unlike optimization-based 3D methods, LangFlash directly predicts the geometry and semantics in a single forward pass, enabling low-latency 3D reconstruction and language-consistent scene understanding. To support large-scale training, we enriched the RealEstate10k dataset with coherent and dense semantic information for 3D semantic supervision. Furthermore, we propose a sparse semantic encoding scheme that combines a global semantic dictionary with locally varying per-primitive weights, preserving high-level linguistic information, while reducing representation complexity. Experimental results show that LangFlash achieves superior novel view synthesis and semantic consistency compared with previous methods. This study establishes a new paradigm for pose-free, language-grounded 3D scene reconstruction, advancing generalizable 3D vision and multimodal scene understanding. Demo is available at https://liylo.github.io/langflash.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LangFlash shifts 3D language Gaussian splatting to a feed-forward setup from unposed sparse views, with the sparse semantic encoding as the main practical move.

read the letter

LangFlash claims a feed-forward network that outputs 3D Gaussians carrying both geometry and language-aligned features directly from a handful of unposed images. The shift away from per-scene optimization is the central change, and they back it with an enriched RealEstate10k dataset plus a sparse encoding that uses one global semantic dictionary and per-primitive weights to cut complexity while keeping the language signal usable for training at scale. The reported experiments show gains on novel view synthesis and semantic consistency over earlier methods. Those pieces are the actual additions worth noting. The encoding trick looks like a workable way to make language supervision tractable without blowing up the representation size. The feed-forward framing addresses a real latency issue in multimodal 3D work. The main soft spot is that the abstract stays high-level on numbers, architecture, and loss details, so it is still unclear how large the gains are or how much they trace to the new encoding versus other design choices. If the full paper has tight ablations and clear baselines, that would tighten the case; without them the performance claims stay hard to weigh. The semantic dictionary assumption is an implementation detail rather than a load-bearing flaw. This paper is aimed at groups working on generalizable 3D reconstruction that also want language grounding, especially for robotics or AR settings that need speed. Readers who track feed-forward NeRF or 3DGS papers will get the most out of it. It deserves a serious referee because the direction is relevant and the feed-forward angle has enough novelty to justify review time, even if the writeup needs more concrete evidence.

Referee Report

2 major / 0 minor

Summary. The paper presents LangFlash, a feed-forward framework for 3D Language Gaussian Splatting that reconstructs 3D scenes parameterized by Gaussian primitives enriched with language-aligned semantic features from sparse unposed multi-view images. It claims to directly predict geometry and semantics in a single forward pass for low-latency reconstruction and language-consistent scene understanding. The work enriches the RealEstate10k dataset with coherent semantic information for supervision and proposes a sparse semantic encoding scheme combining a global semantic dictionary with locally varying per-primitive weights. Experimental results are stated to show superior novel view synthesis and semantic consistency compared to previous methods, establishing a new paradigm for pose-free, language-grounded 3D scene reconstruction.

Significance. If the central claims hold with supporting evidence, the work would be significant for advancing generalizable 3D vision and multimodal scene understanding by shifting from optimization-based to feed-forward methods that integrate language features directly into Gaussian primitives. The single-pass prediction and dataset enrichment could enable practical low-latency applications, with the sparse encoding potentially aiding scalability.

major comments (2)

[Abstract] Abstract: the central claims of 'superior novel view synthesis and semantic consistency' and 'establishes a new paradigm' are presented without any quantitative results, tables, figures, derivations, or experimental details, rendering it impossible to assess whether the data supports the claims or to verify the feed-forward prediction mechanism.
[Abstract] Abstract: the assumption that the proposed sparse semantic encoding scheme (global dictionary + per-primitive weights) preserves high-level linguistic information while reducing complexity is stated but not supported by any formulation, ablation, or analysis showing it is load-bearing for the language-consistent understanding claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on the abstract. We address each point below and indicate where revisions to the manuscript are planned.

read point-by-point responses

Referee: [Abstract] Abstract: the central claims of 'superior novel view synthesis and semantic consistency' and 'establishes a new paradigm' are presented without any quantitative results, tables, figures, derivations, or experimental details, rendering it impossible to assess whether the data supports the claims or to verify the feed-forward prediction mechanism.

Authors: We agree that the abstract presents high-level claims without embedding specific quantitative results or experimental details. This follows the conventional structure of abstracts, which prioritize brevity while directing readers to the full evidence in the manuscript body. Quantitative comparisons for novel view synthesis (PSNR, SSIM, LPIPS) and semantic consistency metrics appear in Tables 1–3 and Figures 4–6 of Section 4, with the feed-forward architecture and single-pass prediction detailed in Section 3.1. To improve clarity, we will revise the abstract to incorporate one or two key quantitative highlights (e.g., average PSNR gains) within length constraints. revision: yes
Referee: [Abstract] Abstract: the assumption that the proposed sparse semantic encoding scheme (global dictionary + per-primitive weights) preserves high-level linguistic information while reducing complexity is stated but not supported by any formulation, ablation, or analysis showing it is load-bearing for the language-consistent understanding claim.

Authors: The abstract provides a high-level summary of the sparse semantic encoding. The full formulation (global dictionary combined with per-primitive weights) is given in Section 3.2, including the mathematical definition and complexity analysis. Section 4.3 contains ablation studies that isolate the contribution of this scheme, demonstrating its role in maintaining semantic consistency while lowering representation size. These results directly support the language-consistent understanding claim. We will revise the abstract wording to more precisely reflect that the scheme is validated in the main text, without adding unsupported assertions. revision: partial

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The provided abstract and description present LangFlash as a standard feed-forward neural network that directly predicts 3D Gaussian primitives with language features from unposed images. It relies on dataset enrichment for supervision and a proposed sparse encoding scheme as implementation choices. No equations, derivation steps, predictions, or self-citations are shown that reduce any claimed output to its inputs by construction. The central claim of single-pass prediction is independent of the outputs and follows typical supervised learning patterns, rendering the approach self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Abstract-only review limits identification of additional parameters or axioms; the encoding scheme is highlighted as a key innovation.

invented entities (1)

sparse semantic encoding scheme no independent evidence
purpose: combines a global semantic dictionary with locally varying per-primitive weights to preserve high-level linguistic information while reducing representation complexity
Described in the abstract as a proposed component of the framework.

pith-pipeline@v0.9.0 · 5721 in / 1287 out tokens · 64292 ms · 2026-05-25T04:23:57.887309+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 6 internal anchors

[1]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3, 4

work page 2020
[2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 1, 2, 6

work page 2024
[3]

Sl- gaussian: Fast language gaussian splatting in sparse views

Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, and Haoqian Wang. Sl- gaussian: Fast language gaussian splatting in sparse views. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3047–3056, 2025. 2

work page 2025
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision, pages 370–386. Springer, 2024. 1, 2

work page 2024
[5]

Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021. 3

work page 2021
[6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5, 6, 7, 8

work page 2017
[7]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1

work page 2022
[8]

Momentum-gs: Momentum gaussian self-distillation for high-quality large scene reconstruction

Jixuan Fan, Wanhua Li, Yifei Han, Tianru Dai, and Yansong Tang. Momentum-gs: Momentum gaussian self-distillation for high-quality large scene reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 25250–25260, 2025. 1

work page 2025
[9]

Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

work page
[10]

Pe3r: Perception-efficient 3d reconstruction.arXiv preprint arXiv:2503.07507, 2025

Jie Hu, Shizun Wang, and Xinchao Wang. Pe3r: Perception-efficient 3d reconstruction.arXiv preprint arXiv:2503.07507, 2025. 2

work page arXiv 2025
[11]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 1, 2

work page 2025
[12]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 2

work page 2025
[13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page
[14]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,

work page
[15]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 2, 3

work page 2024
[16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4, 5

work page 2023
[17]

Decomposing nerf for editing via feature field distil- lation.Advances in neural information processing systems, 35:23311–23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distil- lation.Advances in neural information processing systems, 35:23311–23330, 2022. 6, 7

work page 2022
[18]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean confer- ence on computer vision, pages 71–91. Springer, 2024. 2

work page 2024
[19]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 1, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3041–3050, 2023. 3

work page 2023
[21]

Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, and Yebin Liu. Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields. arXiv preprint arXiv:2506.09565, 2025. 2

work page arXiv 2025
[22]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, and Hanspeter Pfister. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136, 2025. 2

work page arXiv 2025
[23]

4d langsplat: 4d language gaussian splatting via multimodal large language models

Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Jo- hannes Herter, Minghan Qin, Gao Huang, and Hanspeter Pfister. 4d langsplat: 4d language gaussian splatting via multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22001–22011, 2025. 2

work page 2025
[24]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 4

work page 2017
[25]

Semantic ray: Learning a generalizable semantic field with cross-reprojection attention

Fangfu Liu, Chubin Zhang, Yu Zheng, and Yueqi Duan. Semantic ray: Learning a generalizable semantic field with cross-reprojection attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17386–17396, 2023. 2, 3

work page 2023
[26]

Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023. 5, 7

work page 2023
[27]

Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022. 3

work page arXiv 2022
[28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page
[29]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017
[30]

Open- vocabulary one-stage detection with hierarchical visual- language knowledge distillation

Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. Open- vocabulary one-stage detection with hierarchical visual- language knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14074–14083, 2022. 1

work page 2022
[31]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

work page 2021
[32]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 4

work page 2016
[33]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 2, 3

work page 2024
[34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

work page 2021
[35]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3

work page 2021
[36]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: zero-shot gaussian splat- ting from uncalibrated image pairs (2024).URL https://arxiv. org/abs/2408.13912, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8248–8258, 2022. 1

work page 2022
[39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

work page 2017
[40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

work page 2025
[41]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697– 20709, 2024. 2

work page 2024
[42]

Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024. 2

work page 2024
[43]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023
[44]

Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022. 7

work page 2022
[45]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 7

work page 2024
[46]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

work page
[47]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2955–2966, 2023. 7

work page 2023
[48]

Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023

Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023. 1

work page 2023
[49]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 2, 3, 5

work page arXiv 2024
[50]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision, pages 162–179. Springer, 2024. 2, 3

work page 2024
[51]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 1

work page 2022
[52]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting

Cheng Zhang, Haofei Xu, Qianyi Wu, Camilo Cruz Gam- bardella, Dinh Phung, and Jianfei Cai. Pansplat: 4k panorama synthesis with feed-forward gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 11437–11447, 2025. 1, 2

work page 2025
[53]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022
[54]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 7

work page 2021
[55]

In-place scene labelling and understanding with implicit scene representation

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021. 2, 3

work page 2021
[56]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2, 3, 6

work page 2024
[57]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018
[58]

3d gaussian splatting in robotics: A survey

Siting Zhu, Guangming Wang, Xin Kong, Dezhi Kong, and Hesheng Wang. 3d gaussian splatting in robotics: A survey. arXiv preprint arXiv:2410.12262, 2024. 1 LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images Supplementary Material Table 6. Statistics of the processed RE10k dataset. Metric Value Total frames∼6M Number of scenes∼...

work page arXiv 2024
[59]

5 provide a qualitative overview of our performance on the RE10k dataset

RE10k Qualitative visualizations The visual results shown in Fig. 5 provide a qualitative overview of our performance on the RE10k dataset. These examples were selected to emphasize the characteristic challenges in the dataset: numerous small and overlapping object instances, wide lighting variation, and strong view- point changes that stress both 2D segm...

work page
[60]

6) summarizes the primary corpus- level statistics of the processed RE10k split used in this study

RE10k Dataset statistics The table above (Tab. 6) summarizes the primary corpus- level statistics of the processed RE10k split used in this study. In total, we retained approximately six million frames across roughly ten thousand scenes; on average, each image contained dozens of instance masks, with mask pixels covering the majority of the image area. Th...

work page
[61]

RE10k 3D semantic segmentation In addition to 4, we annotated five previously unseen scenes and report the per-scene mIoU as well as the average over- all score in Tab. 7. The baseline methods (LSeg and LSM) struggled on several scenes, whereas our method achieved substantially higher per-scene and overall mIoU, indicating more consistent cross-view seman...

work page

[1] [1]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3, 4

work page 2020

[2] [2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 1, 2, 6

work page 2024

[3] [3]

Sl- gaussian: Fast language gaussian splatting in sparse views

Kangjie Chen, BingQuan Dai, Minghan Qin, Dongbin Zhang, Peihao Li, Yingshuang Zou, and Haoqian Wang. Sl- gaussian: Fast language gaussian splatting in sparse views. InProceedings of the 33rd ACM International Conference on Multimedia, pages 3047–3056, 2025. 2

work page 2025

[4] [4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean conference on computer vision, pages 370–386. Springer, 2024. 1, 2

work page 2024

[5] [5]

Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021

Bowen Cheng, Alex Schwing, and Alexander Kirillov. Per- pixel classification is not all you need for semantic segmen- tation.Advances in neural information processing systems, 34:17864–17875, 2021. 3

work page 2021

[6] [6]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 5, 6, 7, 8

work page 2017

[7] [7]

A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022

Jiafei Duan, Samson Yu, Hui Li Tan, Hongyuan Zhu, and Cheston Tan. A survey of embodied ai: From simulators to research tasks.IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022. 1

work page 2022

[8] [8]

Momentum-gs: Momentum gaussian self-distillation for high-quality large scene reconstruction

Jixuan Fan, Wanhua Li, Yifei Han, Tianru Dai, and Yansong Tang. Momentum-gs: Momentum gaussian self-distillation for high-quality large scene reconstruction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 25250–25260, 2025. 1

work page 2025

[9] [9]

Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

work page

[10] [10]

Pe3r: Perception-efficient 3d reconstruction.arXiv preprint arXiv:2503.07507, 2025

Jie Hu, Shizun Wang, and Xinchao Wang. Pe3r: Perception-efficient 3d reconstruction.arXiv preprint arXiv:2503.07507, 2025. 2

work page arXiv 2025

[11] [11]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 1, 2

work page 2025

[12] [12]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Transactions on Graphics (TOG), 44(6):1–16, 2025. 2

work page 2025

[13] [13]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, George Drettakis, et al. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

work page

[14] [14]

Lerf: Language embedded radiance fields

Justin Kerr, Chung Min Kim, Ken Goldberg, Angjoo Kanazawa, and Matthew Tancik. Lerf: Language embedded radiance fields. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 19729–19739,

work page

[15] [15]

Garfield: Group anything with radiance fields

Chung Min Kim, Mingxuan Wu, Justin Kerr, Ken Gold- berg, Matthew Tancik, and Angjoo Kanazawa. Garfield: Group anything with radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21530–21539, 2024. 2, 3

work page 2024

[16] [16]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 4, 5

work page 2023

[17] [17]

Decomposing nerf for editing via feature field distil- lation.Advances in neural information processing systems, 35:23311–23330, 2022

Sosuke Kobayashi, Eiichi Matsumoto, and Vincent Sitz- mann. Decomposing nerf for editing via feature field distil- lation.Advances in neural information processing systems, 35:23311–23330, 2022. 6, 7

work page 2022

[18] [18]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean confer- ence on computer vision, pages 71–91. Springer, 2024. 2

work page 2024

[19] [19]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 1, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Mask dino: Towards a unified transformer-based framework for object detection and segmentation

Feng Li, Hao Zhang, Huaizhe Xu, Shilong Liu, Lei Zhang, Lionel M Ni, and Heung-Yeung Shum. Mask dino: Towards a unified transformer-based framework for object detection and segmentation. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 3041–3050, 2023. 3

work page 2023

[21] [21]

Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, and Yebin Liu. Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields. arXiv preprint arXiv:2506.09565, 2025. 2

work page arXiv 2025

[22] [22]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, and Hanspeter Pfister. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136, 2025. 2

work page arXiv 2025

[23] [23]

4d langsplat: 4d language gaussian splatting via multimodal large language models

Wanhua Li, Renping Zhou, Jiawei Zhou, Yingwei Song, Jo- hannes Herter, Minghan Qin, Gao Huang, and Hanspeter Pfister. 4d langsplat: 4d language gaussian splatting via multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22001–22011, 2025. 2

work page 2025

[24] [24]

Focal loss for dense object detection

Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Doll´ar. Focal loss for dense object detection. InPro- ceedings of the IEEE international conference on computer vision, pages 2980–2988, 2017. 4

work page 2017

[25] [25]

Semantic ray: Learning a generalizable semantic field with cross-reprojection attention

Fangfu Liu, Chubin Zhang, Yu Zheng, and Yueqi Duan. Semantic ray: Learning a generalizable semantic field with cross-reprojection attention. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 17386–17396, 2023. 2, 3

work page 2023

[26] [26]

Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023

Kunhao Liu, Fangneng Zhan, Jiahui Zhang, Muyu Xu, Yingchen Yu, Abdulmotaleb El Saddik, Christian Theobalt, Eric Xing, and Shijian Lu. Weakly supervised 3d open- vocabulary segmentation.Advances in Neural Information Processing Systems, 36:53433–53456, 2023. 5, 7

work page 2023

[27] [27]

Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022

Shilong Liu, Feng Li, Hao Zhang, Xiao Yang, Xianbiao Qi, Hang Su, Jun Zhu, and Lei Zhang. Dab-detr: Dynamic anchor boxes are better queries for detr.arXiv preprint arXiv:2201.12329, 2022. 3

work page arXiv 2022

[28] [28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuro- pean conference on computer vision, pages 38–55. Springer,

work page

[29] [29]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

work page internal anchor Pith review Pith/arXiv arXiv 2017

[30] [30]

Open- vocabulary one-stage detection with hierarchical visual- language knowledge distillation

Zongyang Ma, Guan Luo, Jin Gao, Liang Li, Yuxin Chen, Shaoru Wang, Congxuan Zhang, and Weiming Hu. Open- vocabulary one-stage detection with hierarchical visual- language knowledge distillation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14074–14083, 2022. 1

work page 2022

[31] [31]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 1, 2

work page 2021

[32] [32]

V-net: Fully convolutional neural networks for volumetric medical image segmentation

Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In2016 fourth international conference on 3D vision (3DV), pages 565–571. Ieee, 2016. 4

work page 2016

[33] [33]

Langsplat: 3d language gaussian splatting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 2, 3

work page 2024

[34] [34]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5

work page 2021

[35] [35]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 3

work page 2021

[36] [36]

SAM 2: Segment Anything in Images and Videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: zero-shot gaussian splat- ting from uncalibrated image pairs (2024).URL https://arxiv. org/abs/2408.13912, 2024. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

Block-nerf: Scalable large scene neural view synthesis

Matthew Tancik, Vincent Casser, Xinchen Yan, Sabeek Prad- han, Ben Mildenhall, Pratul P Srinivasan, Jonathan T Barron, and Henrik Kretzschmar. Block-nerf: Scalable large scene neural view synthesis. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 8248–8258, 2022. 1

work page 2022

[39] [39]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 3

work page 2017

[40] [40]

Vggt: Vi- sual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2

work page 2025

[41] [41]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 20697– 20709, 2024. 2

work page 2024

[42] [42]

Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024

Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024. 2

work page 2024

[43] [43]

Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow

Philippe Weinzaepfel, Thomas Lucas, Vincent Leroy, Yohann Cabon, Vaibhav Arora, Romain Br ´egier, Gabriela Csurka, Leonid Antsfeld, Boris Chidlovskii, and J ´erˆome Revaud. Croco v2: Improved cross-view completion pre- training for stereo matching and optical flow. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17969–17...

work page 2023

[44] [44]

Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022

Xiaoyang Wu, Yixing Lao, Li Jiang, Xihui Liu, and Heng- shuang Zhao. Point transformer v2: Grouped vector atten- tion and partition-based pooling.Advances in Neural Infor- mation Processing Systems, 35:33330–33342, 2022. 7

work page 2022

[45] [45]

Point transformer v3: Simpler faster stronger

Xiaoyang Wu, Li Jiang, Peng-Shuai Wang, Zhijian Liu, Xi- hui Liu, Yu Qiao, Wanli Ouyang, Tong He, and Hengshuang Zhao. Point transformer v3: Simpler faster stronger. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4840–4851, 2024. 7

work page 2024

[46] [46]

Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

Yanmin Wu, Jiarui Meng, Haijie Li, Chenming Wu, Yahao Shi, Xinhua Cheng, Chen Zhao, Haocheng Feng, Errui Ding, Jingdong Wang, et al. Opengaussian: Towards point-level 3d gaussian-based open vocabulary understanding.Advances in Neural Information Processing Systems, 37:19114–19138,

work page

[47] [47]

Open-vocabulary panop- tic segmentation with text-to-image diffusion models

Jiarui Xu, Sifei Liu, Arash Vahdat, Wonmin Byeon, Xiao- long Wang, and Shalini De Mello. Open-vocabulary panop- tic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 2955–2966, 2023. 7

work page 2023

[48] [48]

Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023

Yifan Xu, Mengdan Zhang, Chaoyou Fu, Peixian Chen, Xi- aoshan Yang, Ke Li, and Changsheng Xu. Multi-modal queried object detection in the wild.Advances in Neural Information Processing Systems, 36:4452–4469, 2023. 1

work page 2023

[49] [49]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 2, 3, 5

work page arXiv 2024

[50] [50]

Gaussian grouping: Segment and edit anything in 3d scenes

Mingqiao Ye, Martin Danelljan, Fisher Yu, and Lei Ke. Gaussian grouping: Segment and edit anything in 3d scenes. InEuropean conference on computer vision, pages 162–179. Springer, 2024. 2, 3

work page 2024

[51] [51]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. InEuropean conference on computer vision, pages 106–122. Springer, 2022. 1

work page 2022

[52] [52]

Pansplat: 4k panorama synthesis with feed-forward gaussian splatting

Cheng Zhang, Haofei Xu, Qianyi Wu, Camilo Cruz Gam- bardella, Dinh Phung, and Jianfei Cai. Pansplat: 4k panorama synthesis with feed-forward gaussian splatting. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 11437–11447, 2025. 1, 2

work page 2025

[53] [53]

DINO: DETR with Improved DeNoising Anchor Boxes for End-to-End Object Detection

Hao Zhang, Feng Li, Shilong Liu, Lei Zhang, Hang Su, Jun Zhu, Lionel M Ni, and Heung-Yeung Shum. Dino: Detr with improved denoising anchor boxes for end-to-end object detection.arXiv preprint arXiv:2203.03605, 2022. 3

work page internal anchor Pith review Pith/arXiv arXiv 2022

[54] [54]

Point transformer

Hengshuang Zhao, Li Jiang, Jiaya Jia, Philip HS Torr, and Vladlen Koltun. Point transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 16259–16268, 2021. 7

work page 2021

[55] [55]

In-place scene labelling and understanding with implicit scene representation

Shuaifeng Zhi, Tristan Laidlow, Stefan Leutenegger, and An- drew J Davison. In-place scene labelling and understanding with implicit scene representation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15838–15847, 2021. 2, 3

work page 2021

[56] [56]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21676–21685, 2024. 2, 3, 6

work page 2024

[57] [57]

Stereo Magnification: Learning View Synthesis using Multiplane Images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2018

[58] [58]

3d gaussian splatting in robotics: A survey

Siting Zhu, Guangming Wang, Xin Kong, Dezhi Kong, and Hesheng Wang. 3d gaussian splatting in robotics: A survey. arXiv preprint arXiv:2410.12262, 2024. 1 LangFlash: Feed-forward 3D Language Gaussian Splatting from Sparse Unposed Images Supplementary Material Table 6. Statistics of the processed RE10k dataset. Metric Value Total frames∼6M Number of scenes∼...

work page arXiv 2024

[59] [59]

5 provide a qualitative overview of our performance on the RE10k dataset

RE10k Qualitative visualizations The visual results shown in Fig. 5 provide a qualitative overview of our performance on the RE10k dataset. These examples were selected to emphasize the characteristic challenges in the dataset: numerous small and overlapping object instances, wide lighting variation, and strong view- point changes that stress both 2D segm...

work page

[60] [60]

6) summarizes the primary corpus- level statistics of the processed RE10k split used in this study

RE10k Dataset statistics The table above (Tab. 6) summarizes the primary corpus- level statistics of the processed RE10k split used in this study. In total, we retained approximately six million frames across roughly ten thousand scenes; on average, each image contained dozens of instance masks, with mask pixels covering the majority of the image area. Th...

work page

[61] [61]

RE10k 3D semantic segmentation In addition to 4, we annotated five previously unseen scenes and report the per-scene mIoU as well as the average over- all score in Tab. 7. The baseline methods (LSeg and LSM) struggled on several scenes, whereas our method achieved substantially higher per-scene and overall mIoU, indicating more consistent cross-view seman...

work page