arxiv: 2512.17541 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: no theorem link

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation

Qijian Tian , Xin Tan , Jiayu Ying , Xuhong Wang , Yuan Xie , Lizhuang Ma

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:44 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattinglanguage embeddingsfeed-forward reconstructionsemantic representationsparse embeddingsmulti-view synthesisnovel view synthesis

0 comments

The pith

FLEG reconstructs language-embedded 3D Gaussians from arbitrary input views using only 5 percent of typical language embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLEG as a feed-forward network that builds language-embedded 3D Gaussian scenes from any number of input images without needing camera parameters. It relies on a geometry-semantic dual-branch distillation framework plus a novel-view-based training strategy to handle flexible inputs and reduce overfitting to the supplied views. The core innovation is the observation that semantic data is far sparser than geometry, so language information can be carried by a small set of dedicated semantic Gaussians rather than attached to every Gaussian. This decoupled strategy keeps semantic fidelity comparable to dense per-Gaussian embeddings while cutting storage to roughly 5 percent. Experiments show the resulting models exceed prior feed-forward reconstruction and language-embedded Gaussian methods on both geometric quality and semantic alignment.

Core claim

FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views. It introduces a geometry-semantic dual-branch distillation framework that enables flexible input from arbitrary multi-view images without camera parameters, a novel-view-based distillation strategy to mitigate overfitting, and a decoupled language embedding strategy that represents language information with a sparse set of semantic Gaussians using only 5 percent of the language embeddings while maintaining comparable semantic fidelity.

What carries the argument

geometry-semantic dual-branch distillation framework together with decoupled language embedding that assigns language information to a sparse set of semantic Gaussians rather than every Gaussian

If this is right

Reconstruction becomes possible from any number of input views without camera parameters or fixed input counts.
Storage for language information drops to 5 percent of dense per-Gaussian schemes while semantic fidelity stays comparable.
Both geometric reconstruction quality and language-aligned semantic performance exceed those of prior feed-forward methods.
Novel-view distillation during training reduces overfitting to the specific input images supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparsity principle could be tested on other scene attributes such as material or lighting that are not uniformly dense.
The reduced embedding count may make real-time language-guided 3D reconstruction practical on devices with limited memory.
The method could be extended by letting the number of semantic Gaussians adapt automatically to scene complexity.

Load-bearing premise

Semantic information is sufficiently sparse that a small set of dedicated semantic Gaussians can represent language meaning across the entire scene without loss of fidelity compared to per-Gaussian embeddings.

What would settle it

A controlled experiment that replaces the sparse semantic Gaussians with full per-Gaussian embeddings and measures a clear drop in semantic query accuracy or reconstruction metrics would show the sparsity assumption does not hold.

Figures

Figures reproduced from arXiv: 2512.17541 by Jiayu Ying, Lizhuang Ma, Qijian Tian, Xin Tan, Xuhong Wang, Yuan Xie.

**Figure 1.** Figure 1: FLEG reconstructs language-embedded Gaussians in a single feed-forward pass from any uncalibrated and unposed multi-view [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of FLEG. Our FLEG adopts a large transformer with a DPT-based decoder and corresponding prediction heads to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative comparisons with feed-forward methods on the ScanNet dataset under sparse-view inputs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons with per-scene optimized meth [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views. Previous feed-forward language-embedded Gaussian reconstruction methods are restricted to a fixed number of input views and typically attach a language-aligned semantic embedding to each Gaussian, resulting in impractical input settings and semantic redundancy. In contrast, we introduce a geometry-semantic dual-branch distillation framework that enables flexible input from arbitrary multi-view images without camera parameters. We also propose a novel-view-based distillation strategy during training that mitigates overfitting to input views. In addition, we observe that semantic representations are significantly sparser than geometric ones, and per-Gaussian language embedding is unnecessary. To exploit this sparsity, we design a decoupled language embedding strategy that represents language information with a sparse set of semantic Gaussians, rather than attaching embeddings to every Gaussian. Compared with dense pixel-aligned per-Gaussian embedding schemes, our method uses only 5\% of the language embeddings while maintaining comparable semantic fidelity, effectively reducing storage costs. Extensive experiments demonstrate that FLEG outperforms state-of-the-art feed-forward reconstruction and language-embedded Gaussian methods in both reconstruction quality and language-aligned semantic representation. Project page: https://fangzhou2000.github.io/projects/fleg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FLEG shows a workable feed-forward path to language-embedded Gaussians from arbitrary views by splitting geometry and semantics and using a small set of dedicated semantic Gaussians.

read the letter

The main thing here is that FLEG reconstructs language-embedded 3D Gaussians in one forward pass from any number of input views, without camera parameters, and does it with only 5% of the usual language embeddings. It achieves this by running geometry and semantic branches separately during distillation and by training with a novel-view regularization term to cut overfitting to the input images. The key design move is the decoupled sparse semantic Gaussians: instead of tagging every Gaussian with an embedding, they keep a much smaller set of semantic ones that carry the language information across the scene.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FLEG, a feed-forward network for reconstructing language-embedded 3D Gaussians from arbitrary multi-view images without camera parameters. It introduces a geometry-semantic dual-branch distillation framework and a novel-view-based distillation strategy to mitigate overfitting. The core technical contribution is a decoupled language embedding approach that represents semantics via a sparse set of dedicated semantic Gaussians rather than per-Gaussian embeddings, exploiting the claimed sparsity of semantic information relative to geometry. The paper asserts that this uses only 5% of the language embeddings while preserving comparable semantic fidelity and that FLEG outperforms prior feed-forward reconstruction and language-embedded Gaussian methods on both geometric quality and language-aligned metrics.

Significance. If the experimental claims hold under rigorous validation, the work would be significant for efficient semantic 3D reconstruction. Reducing language embeddings to 5% while maintaining fidelity could substantially lower storage and compute costs for open-vocabulary 3D models, benefiting downstream tasks in robotics, AR, and scene understanding. The feed-forward arbitrary-view capability also advances generalizable 3D pipelines beyond fixed-view constraints.

major comments (2)

[Abstract] Abstract: The load-bearing claim that 'semantic representations are significantly sparser than geometric ones' and that a sparse set of semantic Gaussians suffices for 'comparable semantic fidelity' at 5% embeddings must be supported by explicit ablations. The manuscript should report language metrics (CLIP similarity, open-vocabulary segmentation) as a function of the number of semantic Gaussians across multiple scenes; without these, the risk that fine-grained language information is lost in complex scenes cannot be ruled out.
[Experiments] Experiments section: The headline performance gains over state-of-the-art feed-forward and language-embedded Gaussian baselines require detailed tables with exact dataset splits, baseline implementations, and per-metric scores (PSNR, SSIM, LPIPS for geometry; CLIP-based metrics for semantics). The current abstract-level assertion is insufficient to confirm that the dual-branch distillation transfers all queryable language information without measurable degradation.

minor comments (2)

[Method] Clarify the precise architecture of the dual-branch distillation (e.g., how gradients flow between geometry and semantic branches) and the selection mechanism for the sparse semantic Gaussians.
[Figures] Ensure figures showing qualitative language alignment include side-by-side comparisons with dense per-Gaussian baselines on the same queries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback. We appreciate the recognition of the potential significance of our work in efficient semantic 3D reconstruction. We will address the major comments by providing the requested ablations and detailed experimental tables in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The load-bearing claim that 'semantic representations are significantly sparser than geometric ones' and that a sparse set of semantic Gaussians suffices for 'comparable semantic fidelity' at 5% embeddings must be supported by explicit ablations. The manuscript should report language metrics (CLIP similarity, open-vocabulary segmentation) as a function of the number of semantic Gaussians across multiple scenes; without these, the risk that fine-grained language information is lost in complex scenes cannot be ruled out.

Authors: We agree that explicit ablations are necessary to substantiate the sparsity claim and the sufficiency of the sparse semantic Gaussians. In the revised manuscript, we will include new ablation studies reporting CLIP similarity and open-vocabulary segmentation metrics as a function of the number of semantic Gaussians (e.g., varying from 1% to 20% of the original embeddings) across multiple scenes from the datasets. These will demonstrate that semantic fidelity is preserved at 5% without significant loss in fine-grained information, addressing the concern for complex scenes. revision: yes
Referee: [Experiments] Experiments section: The headline performance gains over state-of-the-art feed-forward and language-embedded Gaussian baselines require detailed tables with exact dataset splits, baseline implementations, and per-metric scores (PSNR, SSIM, LPIPS for geometry; CLIP-based metrics for semantics). The current abstract-level assertion is insufficient to confirm that the dual-branch distillation transfers all queryable language information without measurable degradation.

Authors: We acknowledge that the current presentation relies on abstract-level assertions and will enhance the Experiments section with comprehensive tables. These will detail exact dataset splits (e.g., train/test divisions for Replica, ScanNet, etc.), descriptions of baseline implementations (including how we reproduced prior methods), and full per-metric scores including PSNR, SSIM, LPIPS for geometry and CLIP-based metrics (such as CLIP similarity, open-vocabulary segmentation accuracy) for semantics. This will provide rigorous validation that the dual-branch distillation maintains language information fidelity without degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new architecture with independent empirical basis

full rationale

The paper presents FLEG as a novel feed-forward architecture using geometry-semantic dual-branch distillation, novel-view training, and a decoupled sparse semantic Gaussian representation. The central sparsity claim is framed as an empirical observation rather than a fitted parameter or self-referential derivation, and no equations reduce claimed performance metrics to inputs by construction. Self-citations, if present, are not load-bearing for the core claims, which rest on the proposed design choices and experimental validation against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The sparsity assumption and the effectiveness of the distillation strategy are implicit but not formalized.

pith-pipeline@v0.9.0 · 5530 in / 1105 out tokens · 14811 ms · 2026-05-16T20:44:03.338033+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

[1]

Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018. 1

work page 2018
[2]

Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InConference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. 4

work page 2021
[3]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19457–19467, 2024. 2, 4

work page 2024
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 2, 4

work page arXiv 2024
[5]

Understanding augmented reality: Concepts and applications

Alan B Craig. Understanding augmented reality: Concepts and applications. 2013. 1

work page 2013
[6]

Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2017. 4, 6

work page 2017
[7]

Large spatial model: End-to-end unposed images to semantic 3d.Advances in Neural Information Processing Systems (NeurIPS), 37: 40212–40229, 2025

Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in Neural Information Processing Systems (NeurIPS), 37: 40212–40229, 2025. 3, 7

work page 2025
[8]

Visual language maps for robot navigation

Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023. 1

work page 2023
[9]

Fastlgs: Speeding up lan- guage embedded gaussians with feature grid mapping

Yuzhou Ji, He Zhu, Junshu Tang, Wuyi Liu, Zhizhong Zhang, Xin Tan, and Yuan Xie. Fastlgs: Speeding up lan- guage embedded gaussians with feature grid mapping. In Proceedings of the AAAI Conference on Artificial Intelli- gence (AAAI), 2025. 2, 3

work page 2025
[10]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

work page arXiv
[11]

3d gaussian splatting for real-time radiance field rendering.ACM Trans

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1– 139:14, 2023. 2

work page 2023
[12]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision (ECCV), pages 71–91. Springer,

work page
[13]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 7

work page internal anchor Pith review arXiv 2022
[14]

Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields

Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, and Yebin Liu. Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields. arXiv preprint arXiv:2506.09565, 2025. 2, 3

work page arXiv 2025
[15]

Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, and Hanspeter Pfister. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136, 2025. 3

work page arXiv 2025
[16]

Langscene-x: Re- construct generalizable 3d language-embedded scenes with trimap video diffusion.arXiv preprint arXiv:2507.02813,

Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, and Yueqi Duan. Langscene-x: Re- construct generalizable 3d language-embedded scenes with trimap video diffusion.arXiv preprint arXiv:2507.02813,

work page arXiv
[17]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5

work page 2024
[18]

Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

work page 2021
[19]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

work page internal anchor Pith review Pith/arXiv arXiv 2023
[20]

Langsplat: 3d language gaussian splat- ting

Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20051– 20060, 2024. 2, 3, 6, 8

work page 2024
[21]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 4 9

work page 2021
[22]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 4

work page 2021
[23]

Sam 2: Seg- ment anything in images and videos

Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4

work page 2025
[24]

Distilled feature fields enable few-shot language-guided manipulation

William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. InConference on Robot Learning (CoRL), pages 405–424. PMLR, 2023. 1

work page 2023
[25]

Spatialsplat: Efficient semantic 3d from sparse unposed images.arXiv preprint arXiv:2505.23044, 2025

Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, and Jianmin Ji. Spatialsplat: Efficient semantic 3d from sparse unposed images.arXiv preprint arXiv:2505.23044, 2025. 2, 3

work page arXiv 2025
[26]

Language embedded 3d gaussians for open- vocabulary scene understanding

Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024. 3

work page 2024
[27]

Open-world object manipulation using pre-trained vision-language models

Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kir- mani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. In Conference on Robot Learning (CoRL), pages 3397–3417. PMLR, 2023. 1

work page 2023
[28]

Uni3r: Unified 3d re- construction and semantic understanding via generalizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643, 2025

Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, et al. Uni3r: Unified 3d re- construction and semantic understanding via generalizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643, 2025. 2, 3, 7

work page arXiv 2025
[29]

Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025

Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, and Lizhuang Ma. Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025. 2, 3

work page arXiv 2025
[30]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 6

work page 2025
[31]

Continuous 3d perception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

work page 2025
[32]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 2, 7

work page 2024
[33]

Gsemsplat: Generalizable semantic 3d gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2412.16932, 2024

Xingrui Wang, Cuiling Lan, Hanxin Zhu, Zhibo Chen, and Yan Lu. Gsemsplat: Generalizable semantic 3d gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2412.16932, 2024. 2, 3

work page arXiv 2024
[34]

Scannet++: A high-fidelity dataset of 3d in- door scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12–22, 2023. 4, 6, 8

work page 2023
[35]

Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 21676–21685,

work page
[36]

Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. 4 10

work page 2018