pith. machine review for the scientific record. sign in

arxiv: 2512.17541 · v2 · submitted 2025-12-19 · 💻 cs.CV

Recognition: no theorem link

FLEG: Feed-Forward Language Embedded Gaussian Splatting from Any Views via Compact Semantic Representation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 20:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian Splattinglanguage embeddingsfeed-forward reconstructionsemantic representationsparse embeddingsmulti-view synthesisnovel view synthesis
0
0 comments X

The pith

FLEG reconstructs language-embedded 3D Gaussians from arbitrary input views using only 5 percent of typical language embeddings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FLEG as a feed-forward network that builds language-embedded 3D Gaussian scenes from any number of input images without needing camera parameters. It relies on a geometry-semantic dual-branch distillation framework plus a novel-view-based training strategy to handle flexible inputs and reduce overfitting to the supplied views. The core innovation is the observation that semantic data is far sparser than geometry, so language information can be carried by a small set of dedicated semantic Gaussians rather than attached to every Gaussian. This decoupled strategy keeps semantic fidelity comparable to dense per-Gaussian embeddings while cutting storage to roughly 5 percent. Experiments show the resulting models exceed prior feed-forward reconstruction and language-embedded Gaussian methods on both geometric quality and semantic alignment.

Core claim

FLEG is a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views. It introduces a geometry-semantic dual-branch distillation framework that enables flexible input from arbitrary multi-view images without camera parameters, a novel-view-based distillation strategy to mitigate overfitting, and a decoupled language embedding strategy that represents language information with a sparse set of semantic Gaussians using only 5 percent of the language embeddings while maintaining comparable semantic fidelity.

What carries the argument

geometry-semantic dual-branch distillation framework together with decoupled language embedding that assigns language information to a sparse set of semantic Gaussians rather than every Gaussian

If this is right

  • Reconstruction becomes possible from any number of input views without camera parameters or fixed input counts.
  • Storage for language information drops to 5 percent of dense per-Gaussian schemes while semantic fidelity stays comparable.
  • Both geometric reconstruction quality and language-aligned semantic performance exceed those of prior feed-forward methods.
  • Novel-view distillation during training reduces overfitting to the specific input images supplied.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same sparsity principle could be tested on other scene attributes such as material or lighting that are not uniformly dense.
  • The reduced embedding count may make real-time language-guided 3D reconstruction practical on devices with limited memory.
  • The method could be extended by letting the number of semantic Gaussians adapt automatically to scene complexity.

Load-bearing premise

Semantic information is sufficiently sparse that a small set of dedicated semantic Gaussians can represent language meaning across the entire scene without loss of fidelity compared to per-Gaussian embeddings.

What would settle it

A controlled experiment that replaces the sparse semantic Gaussians with full per-Gaussian embeddings and measures a clear drop in semantic query accuracy or reconstruction metrics would show the sparsity assumption does not hold.

Figures

Figures reproduced from arXiv: 2512.17541 by Jiayu Ying, Lizhuang Ma, Qijian Tian, Xin Tan, Xuhong Wang, Yuan Xie.

Figure 1
Figure 1. Figure 1: FLEG reconstructs language-embedded Gaussians in a single feed-forward pass from any uncalibrated and unposed multi-view [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of FLEG. Our FLEG adopts a large transformer with a DPT-based decoder and corresponding prediction heads to [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparisons with feed-forward methods on the ScanNet dataset under sparse-view inputs. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparisons with per-scene optimized meth [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

We present FLEG, a feed-forward network that reconstructs language-embedded 3D Gaussians from arbitrary views. Previous feed-forward language-embedded Gaussian reconstruction methods are restricted to a fixed number of input views and typically attach a language-aligned semantic embedding to each Gaussian, resulting in impractical input settings and semantic redundancy. In contrast, we introduce a geometry-semantic dual-branch distillation framework that enables flexible input from arbitrary multi-view images without camera parameters. We also propose a novel-view-based distillation strategy during training that mitigates overfitting to input views. In addition, we observe that semantic representations are significantly sparser than geometric ones, and per-Gaussian language embedding is unnecessary. To exploit this sparsity, we design a decoupled language embedding strategy that represents language information with a sparse set of semantic Gaussians, rather than attaching embeddings to every Gaussian. Compared with dense pixel-aligned per-Gaussian embedding schemes, our method uses only 5\% of the language embeddings while maintaining comparable semantic fidelity, effectively reducing storage costs. Extensive experiments demonstrate that FLEG outperforms state-of-the-art feed-forward reconstruction and language-embedded Gaussian methods in both reconstruction quality and language-aligned semantic representation. Project page: https://fangzhou2000.github.io/projects/fleg.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents FLEG, a feed-forward network for reconstructing language-embedded 3D Gaussians from arbitrary multi-view images without camera parameters. It introduces a geometry-semantic dual-branch distillation framework and a novel-view-based distillation strategy to mitigate overfitting. The core technical contribution is a decoupled language embedding approach that represents semantics via a sparse set of dedicated semantic Gaussians rather than per-Gaussian embeddings, exploiting the claimed sparsity of semantic information relative to geometry. The paper asserts that this uses only 5% of the language embeddings while preserving comparable semantic fidelity and that FLEG outperforms prior feed-forward reconstruction and language-embedded Gaussian methods on both geometric quality and language-aligned metrics.

Significance. If the experimental claims hold under rigorous validation, the work would be significant for efficient semantic 3D reconstruction. Reducing language embeddings to 5% while maintaining fidelity could substantially lower storage and compute costs for open-vocabulary 3D models, benefiting downstream tasks in robotics, AR, and scene understanding. The feed-forward arbitrary-view capability also advances generalizable 3D pipelines beyond fixed-view constraints.

major comments (2)
  1. [Abstract] Abstract: The load-bearing claim that 'semantic representations are significantly sparser than geometric ones' and that a sparse set of semantic Gaussians suffices for 'comparable semantic fidelity' at 5% embeddings must be supported by explicit ablations. The manuscript should report language metrics (CLIP similarity, open-vocabulary segmentation) as a function of the number of semantic Gaussians across multiple scenes; without these, the risk that fine-grained language information is lost in complex scenes cannot be ruled out.
  2. [Experiments] Experiments section: The headline performance gains over state-of-the-art feed-forward and language-embedded Gaussian baselines require detailed tables with exact dataset splits, baseline implementations, and per-metric scores (PSNR, SSIM, LPIPS for geometry; CLIP-based metrics for semantics). The current abstract-level assertion is insufficient to confirm that the dual-branch distillation transfers all queryable language information without measurable degradation.
minor comments (2)
  1. [Method] Clarify the precise architecture of the dual-branch distillation (e.g., how gradients flow between geometry and semantic branches) and the selection mechanism for the sparse semantic Gaussians.
  2. [Figures] Ensure figures showing qualitative language alignment include side-by-side comparisons with dense per-Gaussian baselines on the same queries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the detailed and constructive feedback. We appreciate the recognition of the potential significance of our work in efficient semantic 3D reconstruction. We will address the major comments by providing the requested ablations and detailed experimental tables in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The load-bearing claim that 'semantic representations are significantly sparser than geometric ones' and that a sparse set of semantic Gaussians suffices for 'comparable semantic fidelity' at 5% embeddings must be supported by explicit ablations. The manuscript should report language metrics (CLIP similarity, open-vocabulary segmentation) as a function of the number of semantic Gaussians across multiple scenes; without these, the risk that fine-grained language information is lost in complex scenes cannot be ruled out.

    Authors: We agree that explicit ablations are necessary to substantiate the sparsity claim and the sufficiency of the sparse semantic Gaussians. In the revised manuscript, we will include new ablation studies reporting CLIP similarity and open-vocabulary segmentation metrics as a function of the number of semantic Gaussians (e.g., varying from 1% to 20% of the original embeddings) across multiple scenes from the datasets. These will demonstrate that semantic fidelity is preserved at 5% without significant loss in fine-grained information, addressing the concern for complex scenes. revision: yes

  2. Referee: [Experiments] Experiments section: The headline performance gains over state-of-the-art feed-forward and language-embedded Gaussian baselines require detailed tables with exact dataset splits, baseline implementations, and per-metric scores (PSNR, SSIM, LPIPS for geometry; CLIP-based metrics for semantics). The current abstract-level assertion is insufficient to confirm that the dual-branch distillation transfers all queryable language information without measurable degradation.

    Authors: We acknowledge that the current presentation relies on abstract-level assertions and will enhance the Experiments section with comprehensive tables. These will detail exact dataset splits (e.g., train/test divisions for Replica, ScanNet, etc.), descriptions of baseline implementations (including how we reproduced prior methods), and full per-metric scores including PSNR, SSIM, LPIPS for geometry and CLIP-based metrics (such as CLIP similarity, open-vocabulary segmentation accuracy) for semantics. This will provide rigorous validation that the dual-branch distillation maintains language information fidelity without degradation. revision: yes

Circularity Check

0 steps flagged

No significant circularity: new architecture with independent empirical basis

full rationale

The paper presents FLEG as a novel feed-forward architecture using geometry-semantic dual-branch distillation, novel-view training, and a decoupled sparse semantic Gaussian representation. The central sparsity claim is framed as an empirical observation rather than a fitted parameter or self-referential derivation, and no equations reduce claimed performance metrics to inputs by construction. Self-citations, if present, are not load-bearing for the core claims, which rest on the proposed design choices and experimental validation against external baselines.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The sparsity assumption and the effectiveness of the distillation strategy are implicit but not formalized.

pith-pipeline@v0.9.0 · 5530 in / 1105 out tokens · 14811 ms · 2026-05-16T20:44:03.338033+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 2 internal anchors

  1. [1]

    Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: Interpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3674–3683, 2018. 1

  2. [2]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Yuri Fei- gin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data. InConference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 1), 2021. 4

  3. [3]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19457–19467, 2024. 2, 4

  4. [4]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images.arXiv preprint arXiv:2403.14627, 2024. 2, 4

  5. [5]

    Understanding augmented reality: Concepts and applications

    Alan B Craig. Understanding augmented reality: Concepts and applications. 2013. 1

  6. [6]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), 2017. 4, 6

  7. [7]

    Large spatial model: End-to-end unposed images to semantic 3d.Advances in Neural Information Processing Systems (NeurIPS), 37: 40212–40229, 2025

    Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in Neural Information Processing Systems (NeurIPS), 37: 40212–40229, 2025. 3, 7

  8. [8]

    Visual language maps for robot navigation

    Chenguang Huang, Oier Mees, Andy Zeng, and Wolfram Burgard. Visual language maps for robot navigation. In IEEE International Conference on Robotics and Automation (ICRA), pages 10608–10615. IEEE, 2023. 1

  9. [9]

    Fastlgs: Speeding up lan- guage embedded gaussians with feature grid mapping

    Yuzhou Ji, He Zhu, Junshu Tang, Wuyi Liu, Zhizhong Zhang, Xin Tan, and Yuan Xie. Fastlgs: Speeding up lan- guage embedded gaussians with feature grid mapping. In Proceedings of the AAAI Conference on Artificial Intelli- gence (AAAI), 2025. 2, 3

  10. [10]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

  11. [11]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139:1– 139:14, 2023. 2

  12. [12]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision (ECCV), pages 71–91. Springer,

  13. [13]

    Language-driven Semantic Segmentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 7

  14. [14]

    Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields

    Qijing Li, Jingxiang Sun, Liang An, Zhaoqi Su, Hongwen Zhang, and Yebin Liu. Semanticsplat: Feed-forward 3d scene understanding with language-aware gaussian fields. arXiv preprint arXiv:2506.09565, 2025. 2, 3

  15. [15]

    Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

    Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, and Hanspeter Pfister. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136, 2025. 3

  16. [16]

    Langscene-x: Re- construct generalizable 3d language-embedded scenes with trimap video diffusion.arXiv preprint arXiv:2507.02813,

    Fangfu Liu, Hao Li, Jiawei Chi, Hanyang Wang, Minghui Yang, Fudong Wang, and Yueqi Duan. Langscene-x: Re- construct generalizable 3d language-embedded scenes with trimap video diffusion.arXiv preprint arXiv:2507.02813,

  17. [17]

    Scaffold-gs: Structured 3d gaussians for view-adaptive rendering

    Tao Lu, Mulin Yu, Linning Xu, Yuanbo Xiangli, Limin Wang, Dahua Lin, and Bo Dai. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 5

  18. [18]

    Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis.Communications of the ACM, 65(1):99–106, 2021. 2

  19. [19]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 3

  20. [20]

    Langsplat: 3d language gaussian splat- ting

    Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 20051– 20060, 2024. 2, 3, 6, 8

  21. [21]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PmLR, 2021. 4 9

  22. [22]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12179–12188, 2021. 4

  23. [23]

    Sam 2: Seg- ment anything in images and videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman R¨adle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Seg- ment anything in images and videos. InInternational Con- ference on Learning Representations (ICLR), 2025. 4

  24. [24]

    Distilled feature fields enable few-shot language-guided manipulation

    William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields enable few-shot language-guided manipulation. InConference on Robot Learning (CoRL), pages 405–424. PMLR, 2023. 1

  25. [25]

    Spatialsplat: Efficient semantic 3d from sparse unposed images.arXiv preprint arXiv:2505.23044, 2025

    Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, and Jianmin Ji. Spatialsplat: Efficient semantic 3d from sparse unposed images.arXiv preprint arXiv:2505.23044, 2025. 2, 3

  26. [26]

    Language embedded 3d gaussians for open- vocabulary scene understanding

    Jin-Chuan Shi, Miao Wang, Hao-Bin Duan, and Shao- Hua Guan. Language embedded 3d gaussians for open- vocabulary scene understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5333–5343, 2024. 3

  27. [27]

    Open-world object manipulation using pre-trained vision-language models

    Austin Stone, Ted Xiao, Yao Lu, Keerthana Gopalakrishnan, Kuang-Huei Lee, Quan Vuong, Paul Wohlhart, Sean Kir- mani, Brianna Zitkovich, Fei Xia, et al. Open-world object manipulation using pre-trained vision-language models. In Conference on Robot Learning (CoRL), pages 3397–3417. PMLR, 2023. 1

  28. [28]

    Uni3r: Unified 3d re- construction and semantic understanding via generalizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643, 2025

    Xiangyu Sun, Haoyi Jiang, Liu Liu, Seungtae Nam, Gyeongjin Kang, Xinjie Wang, Wei Sui, Zhizhong Su, Wenyu Liu, Xinggang Wang, et al. Uni3r: Unified 3d re- construction and semantic understanding via generalizable gaussian splatting from unposed multi-view images.arXiv preprint arXiv:2508.03643, 2025. 2, 3, 7

  29. [29]

    Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025

    Qijian Tian, Xin Tan, Jingyu Gong, Yuan Xie, and Lizhuang Ma. Uniforward: Unified 3d scene and semantic field re- construction via feed-forward gaussian splatting from only sparse-view images.arXiv preprint arXiv:2506.09378, 2025. 2, 3

  30. [30]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 6

  31. [31]

    Continuous 3d perception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025

  32. [32]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20697–20709, 2024. 2, 7

  33. [33]

    Gsemsplat: Generalizable semantic 3d gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2412.16932, 2024

    Xingrui Wang, Cuiling Lan, Hanxin Zhu, Zhibo Chen, and Yan Lu. Gsemsplat: Generalizable semantic 3d gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2412.16932, 2024. 2, 3

  34. [34]

    Scannet++: A high-fidelity dataset of 3d in- door scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d in- door scenes. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 12–22, 2023. 4, 6, 8

  35. [35]

    Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields

    Shijie Zhou, Haoran Chang, Sicheng Jiang, Zhiwen Fan, Ze- hao Zhu, Dejia Xu, Pradyumna Chari, Suya You, Zhangyang Wang, and Achuta Kadambi. Feature 3dgs: Supercharging 3d gaussian splatting to enable distilled feature fields. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 21676–21685,

  36. [36]

    Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images.ACM Transactions on Graphics (TOG), 37(4):1–12, 2018. 4 10