pith. sign in

arxiv: 2512.04021 · v2 · submitted 2025-12-03 · 💻 cs.CV

C3G: Learning Compact 3D Representations with 2K Gaussians

Pith reviewed 2026-05-17 02:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords 3D Gaussian Splattingcompact 3D representationfeed-forward reconstructionmulti-view feature aggregationnovel view synthesis3D segmentationsparse viewsattention tokens
0
0 comments X

The pith

C3G shows that only about 2,000 strategically placed 3D Gaussians suffice for high-quality scene reconstruction and understanding from sparse unposed views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces C3G as a feed-forward framework that generates a compact set of 3D Gaussians at key spatial locations rather than one per pixel. Learnable tokens aggregate features from multiple views using self-attention, and these patterns then direct both where the Gaussians appear and how their features are decoded for lifting. This reduces memory use while supporting tasks such as novel view synthesis and 3D segmentation. A sympathetic reader would care because the work argues that dense per-pixel representations are largely redundant once essential locations and cross-view features are selected properly.

Core claim

C3G estimates compact 3D Gaussians only at essential spatial locations by introducing learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, then exploits the learned attention patterns for efficient Gaussian decoding and feature lifting, yielding superior memory efficiency and feature fidelity on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation.

What carries the argument

Learnable tokens that aggregate multi-view features via self-attention to guide the generation and decoding of a limited set of 3D Gaussians at essential locations.

If this is right

  • High-quality novel view synthesis becomes feasible with orders-of-magnitude lower memory than per-pixel Gaussian methods.
  • 3D open-vocabulary segmentation improves through more effective multi-view feature aggregation.
  • Feed-forward processing of unposed sparse views requires far less storage while retaining geometric fidelity.
  • Redundancy in 3D representations can be eliminated without sacrificing reconstruction or understanding performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-guided placement strategy could be tested on larger indoor or outdoor scenes to check whether compactness scales beyond the evaluated datasets.
  • Attention-driven selection of essential locations may transfer to other sparse-view 3D tasks such as object tracking or surface reconstruction.
  • Real-time applications on resource-constrained devices become more plausible once the Gaussian count is fixed near 2K.

Load-bearing premise

That learnable tokens aggregating multi-view features via self-attention can reliably guide Gaussian generation and decoding at essential locations without losing critical geometric details or introducing artifacts from sparse unposed views.

What would settle it

A benchmark comparison on standard datasets where the 2K-Gaussian model produces measurably worse novel-view PSNR or segmentation accuracy than dense per-pixel Gaussian baselines would falsify the sufficiency claim.

Figures

Figures reproduced from arXiv: 2512.04021 by Chaehyun Kim, Honggyu An, Hyuna Ko, Jaewoo Jung, Jisang Han, Junsu Kim, Kazumi Fukuda, Minkyeong Jeon, Mungyeom Kim, Seungryong Kim, Sunghwan Hong, Takuya Narihira, Yuki Mitsufuji.

Figure 1
Figure 1. Figure 1: Teaser. Our method learns compact 3D Gaussians from unposed multi-view images through a query-based Gaussian decoding pipeline. Compact representations enable efficient 2D-to-3D feature lifting for downstream applications, including 3D understanding, correspondence, and upsampling. Compared to prior works (LSM [14] and CF3 [33]), Our C3G results in the fewest Gaussians—about 2K, which is roughly 65× fewer … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of per-pixel and compact scene representations. (Left): Existing per-pixel estimators [24, 71] predict one or multiple Gaussians per pixel, resulting in redundant Gaussians with misalignments across views. (Right): Our method uses learnable Gaussian queries to discover and decode only compact 3D Gaussians at essential locations, achieving a compact representation with only 2K Gaussians and 4.1M … view at source ↗
Figure 3
Figure 3. Figure 3: Architecture and emergent attention behaviors of our 3D Gaussian decoder (C3G-G). (a) Our framework first extracts multi￾view features using VGGT, then processes them with learnable query tokens through transformer blocks in our Gaussian decoder (C3G-G). The refined queries are subsequently decoded into compact 3D Gaussians via a Gaussian head, trained with the novel view synthesis loss Lnovel. (b) Visuali… view at source ↗
Figure 4
Figure 4. Figure 4: C3G-F training scheme. We leverage the learned atten￾tion patterns from the Gaussian decoder C3G-G (top) to efficiently learn a 3D feature decoder C3G-F(bottom) for feature lifting. We initialize C3G-F by copying C3G-G’s architecture and copy the attention weights from C3G-G, using learnable feature queries Q′ and features E ′ from any desired encoder. Only the value projec￾tions V ′ are trainable, enablin… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative results of novel view synthesis on RealEstate10K [80]. Given multi-view input images, our method produces the highest-quality renderings, both with and without test-time Gaussian optimization. TTO denotes that test-time op￾timization is applied to the Gaussians. features [63], DINOv2 [45], and DINOv3 [58] to demon￾strate C3G-F’s effectiveness as a view-invariant feature de￾coder by evaluating i… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results of 3D scene understanding on ScanNet [10]. We conduct qualitative comparison for 3D scene understand￾ing via novel view synthesis and open-vocabulary segmentation. When compared to both per-scene optimization ((a), (b)) and feed-forward ((c), (d)) methods, ours show the most high-fidelity renderings and accurate segmentation maps compared to the ground-truth. (a) Context view (b) VGGT-T… view at source ↗
Figure 7
Figure 7. Figure 7: PCA visualization of multi-view features on Scan￾Net [10]. We visualize the PCA results of encoded multi-view features. Our method improves multi-view consistency compared to the original visual features [63]. 4.5. Multi-view feature upsampling Since 3D Gaussians can be projected to arbitrary camera poses and intrinsics, we can render images at different res￾olutions. This property enables upsampled featur… view at source ↗
Figure 9
Figure 9. Figure 9: Convergence speed improvement by eliminating au￾toencoder in our framework. ture loss only to Gaussian feature attributes. We extend the CUDA rasterizer to propagate feature loss to all Gaussian attributes. However, this also degrades geometry estimation results because foundation model features are not perfectly multi-view consistent. We hypothesize that with perfectly multi-view invariant features, featu… view at source ↗
Figure 10
Figure 10. Figure 10: Novel view synthesis via latent decoding. We ex￾plore the potential of combining our view-invariant feature de￾coder, C3G-F, with generative models. We lift DINOv2-base features (which serve as latents for a Representation Autoencoder (RAE) [77]) extracted from the input views ((a)) to 3D Gaussians and render them at novel viewpoints. (b) Ours – 3DGS Render￾ing: Standard RGB rendering from the estimated G… view at source ↗
Figure 11
Figure 11. Figure 11: Additional visualization of learned attention patterns between a target Gaussian and input images. Without explicit supervision, each query token (red dots) learns to attend to spatially coherent regions across multiple views, naturally discovering corre￾sponding regions. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative results of 3D scene understanding on ScanNet [10]. We conduct qualitative comparison for 3D scene understanding via novel view synthesis and open-vocabulary segmentation. When compared to both per-scene optimization ((a), (b)) and feed-forward ((c), (d)) methods, ours show the most high-fidelity renderings and accurate segmentation maps compared to the ground-truth. 17 [PITH_FULL_I… view at source ↗
Figure 13
Figure 13. Figure 13: Additional PCA visualization of multi-view features on ScanNet [10]. We visualize the PCA results of encoded multi-view features. Our method improves multi-view consistency compared to the original visual features. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional PCA visualization of upsampled feature on ScanNet [10]. We visualize the PCA results of the upsampled feature. Our model upsamples the features while maintaining the multi-view consistencies compared to other baselines. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional qualitative results of novel view synthesis on RealEstate10K [80]. We conduct a qualitative comparison for novel view synthesis with available multi-view images. Our method produces the highest quality rendering results, with or without test￾time Gaussian optimization. TTO denotes that test-time optimization is applied to the Gaussians. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_15.png] view at source ↗
read the original abstract

Reconstructing and understanding 3D scenes from unposed sparse views in a feed-forward manner remains as a challenging task in 3D computer vision. Recent approaches use per-pixel 3D Gaussian Splatting for reconstruction, followed by a 2D-to-3D feature lifting stage for scene understanding. However, they generate excessive redundant Gaussians, causing high memory overhead and sub-optimal multi-view feature aggregation, leading to degraded novel view synthesis and scene understanding performance. We propose C3G, a novel feed-forward framework that estimates compact 3D Gaussians only at essential spatial locations, minimizing redundancy while enabling effective feature lifting. We introduce learnable tokens that aggregate multi-view features through self-attention to guide Gaussian generation, ensuring each Gaussian integrates relevant visual features across views. We then exploit the learned attention patterns for Gaussian decoding to efficiently lift features. Extensive experiments on pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation demonstrate our approach's effectiveness. Results show that a compact yet geometrically meaningful representation is sufficient for high-quality scene reconstruction and understanding, achieving superior memory efficiency and feature fidelity compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper proposes C3G, a feed-forward framework for 3D scene reconstruction and understanding from unposed sparse views. It generates only 2K compact 3D Gaussians at essential locations by using learnable tokens that aggregate multi-view features via self-attention; these tokens also guide efficient feature lifting for downstream tasks including pose-free novel view synthesis, 3D open-vocabulary segmentation, and view-invariant feature aggregation. The central claim is that this compact yet geometrically meaningful representation suffices for high-quality results while improving memory efficiency and feature fidelity over dense Gaussian methods.

Significance. If the quantitative claims hold, the work establishes that a fixed small number of Gaussians (2K) can replace redundant per-pixel splatting for both reconstruction and semantic tasks in the challenging feed-forward, pose-free regime. This would represent a meaningful advance in memory-efficient 3D vision pipelines and could influence subsequent work on compact scene representations.

major comments (3)
  1. [§3.2] §3.2 (Learnable Token Aggregation): the self-attention mechanism for multi-view feature aggregation is presented as the key enabler for selecting essential Gaussian locations, yet no analysis or ablation is provided on its behavior under limited view overlap or weak correspondences typical of sparse unposed inputs; this directly bears on whether critical geometric details are preserved or artifacts are introduced in the decoder.
  2. [§4.1, Table 2] §4.1 and Table 2 (Novel View Synthesis Results): the superiority claim over baselines is stated without reporting absolute metrics (e.g., PSNR, SSIM, LPIPS) or statistical significance across multiple scenes; without these numbers the memory-efficiency advantage cannot be weighed against any potential quality trade-off.
  3. [§4.3] §4.3 (3D Open-Vocabulary Segmentation): the feature-lifting stage is said to exploit learned attention patterns, but the paper does not quantify how much of the reported mIoU gain is attributable to the compact 2K representation versus the attention guidance itself; an ablation removing the token-guided component would be required to support the central compactness claim.
minor comments (3)
  1. The abstract asserts 'superior memory efficiency' but does not define the exact memory metric (e.g., peak GPU memory or parameter count) used for comparison.
  2. Notation for the learnable tokens (e.g., T in Eq. (3)) is introduced without an explicit dimensionality or initialization description, which would aid reproducibility.
  3. Figure 3 caption refers to 'attention maps' but the figure itself lacks a color scale or legend explaining what the visualized values represent.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We have carefully considered each point and provide detailed responses below. Where appropriate, we will revise the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Learnable Token Aggregation): the self-attention mechanism for multi-view feature aggregation is presented as the key enabler for selecting essential Gaussian locations, yet no analysis or ablation is provided on its behavior under limited view overlap or weak correspondences typical of sparse unposed inputs; this directly bears on whether critical geometric details are preserved or artifacts are introduced in the decoder.

    Authors: We agree that further analysis of the self-attention mechanism under conditions of limited view overlap would be valuable. Our current experiments are conducted on datasets featuring sparse, unposed views with naturally occurring limited overlap and weak correspondences. To address this, we will add a new subsection or paragraph in §3.2 discussing the robustness of the attention-based aggregation, supported by qualitative attention visualizations on challenging sparse-view examples. If feasible within page limits, we will also include a targeted ablation on synthetic data with controlled overlap levels. revision: partial

  2. Referee: [§4.1, Table 2] §4.1 and Table 2 (Novel View Synthesis Results): the superiority claim over baselines is stated without reporting absolute metrics (e.g., PSNR, SSIM, LPIPS) or statistical significance across multiple scenes; without these numbers the memory-efficiency advantage cannot be weighed against any potential quality trade-off.

    Authors: The referee is correct that absolute performance metrics are important for a balanced evaluation. While Table 2 highlights relative gains in memory efficiency and performance, we will update the table and accompanying text in §4.1 to include the absolute values of PSNR, SSIM, and LPIPS for C3G and all baselines. Additionally, we will report mean and standard deviation across multiple scenes to demonstrate statistical significance. revision: yes

  3. Referee: [§4.3] §4.3 (3D Open-Vocabulary Segmentation): the feature-lifting stage is said to exploit learned attention patterns, but the paper does not quantify how much of the reported mIoU gain is attributable to the compact 2K representation versus the attention guidance itself; an ablation removing the token-guided component would be required to support the central compactness claim.

    Authors: We appreciate this suggestion for clarifying the source of improvements. The central claim of our work is that the compact 2K Gaussian representation, enabled by the learnable tokens and attention, suffices for high-quality results. To better isolate the compactness aspect, we will conduct and report an ablation in the revised manuscript where we replace the compact representation with a dense per-pixel Gaussian baseline while keeping the attention-guided feature lifting. This will help attribute the mIoU gains more precisely. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation; method uses standard attention without self-referential reduction

full rationale

The paper introduces a feed-forward framework for compact 3D Gaussians guided by learnable tokens and self-attention, validated empirically on novel view synthesis and segmentation tasks. No equations, derivations, or load-bearing steps in the abstract or described method reduce predictions to fitted inputs by construction or via self-citation chains. The approach relies on established attention mechanisms and experimental comparisons rather than internal redefinitions or ansatzes smuggled through citations. This yields a self-contained proposal with independent empirical content, consistent with a low circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the premise that attention-based token aggregation can identify essential spatial locations and enable effective feature lifting without explicit geometric priors or dense sampling.

axioms (1)
  • domain assumption Self-attention on multi-view features can produce reliable guidance for Gaussian placement and decoding
    Invoked in the description of learnable tokens and attention patterns for Gaussian generation
invented entities (1)
  • Learnable tokens for multi-view aggregation no independent evidence
    purpose: To guide compact Gaussian generation and feature lifting
    Introduced as the key mechanism to minimize redundancy while integrating visual features across views

pith-pipeline@v0.9.0 · 5560 in / 1278 out tokens · 85297 ms · 2026-05-17T02:03:29.833913+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. No Pose, No Problem in 4D: Feed-Forward Dynamic Gaussians from Unposed Multi-View Videos

    cs.CV 2026-05 unverdicted novelty 7.0

    NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.

  2. PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.

  3. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

  4. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.

  5. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

  6. Entropy-Gradient Grounding: Training-Free Evidence Retrieval in Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 6.0

    Entropy-gradient grounding uses model uncertainty to retrieve evidence regions in VLMs, improving performance on detail-critical and compositional tasks across multiple architectures.

Reference graph

Works this paper leans on

83 extracted references · 83 canonical work pages · cited by 5 Pith papers · 10 internal anchors

  1. [1]

    Cross-view completion models are zero-shot correspondence estimators

    Honggyu An, Jin Hyeon Kim, Seonghoon Park, Jaewoo Jung, Jisang Han, Sunghwan Hong, and Seungryong Kim. Cross-view completion models are zero-shot correspondence estimators. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 1103–1115, 2025. 5

  2. [2]

    Layer Normalization

    Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hin- ton. Layer normalization.arXiv preprint arXiv:1607.06450,

  3. [3]

    Spatial memory: how egocentric and allocen- tric combine.Trends in cognitive sciences, 10(12):551–557,

    Neil Burgess. Spatial memory: how egocentric and allocen- tric combine.Trends in cognitive sciences, 10(12):551–557,

  4. [4]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19457–19467, 2024. 3, 4, 5, 11, 14, 15

  5. [5]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEuropean Conference on Computer Vision, pages 370–386. Springer, 2024. 3, 11, 14, 15

  6. [6]

    Occam’s lgs: An efficient approach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024

    Jiahuan Cheng, Jan-Nico Zaech, Luc Van Gool, and Danda Pani Paudel. Occam’s lgs: An efficient ap- proach for language gaussian splatting.arXiv preprint arXiv:2412.01807, 2024. 2, 5

  7. [7]

    Cats: Cost ag- gregation transformers for visual correspondence.Advances in Neural Information Processing Systems, 34:9011–9023,

    Seokju Cho, Sunghwan Hong, Sangryul Jeon, Yunsung Lee, Kwanghoon Sohn, and Seungryong Kim. Cats: Cost ag- gregation transformers for visual correspondence.Advances in Neural Information Processing Systems, 34:9011–9023,

  8. [8]

    Cats++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2022

    Seokju Cho, Sunghwan Hong, and Seungryong Kim. Cats++: Boosting cost aggregation with convolutions and transformers.IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(6):7174–7194, 2022. 12

  9. [9]

    Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation

    Seokju Cho, Heeseong Shin, Sunghwan Hong, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Cat- seg: Cost aggregation for open-vocabulary semantic seg- mentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4113– 4123, 2024. 7

  10. [10]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 7, 8, 9, 11, 12, 17, 18, 19

  11. [11]

    Learning to render novel views from wide-baseline stereo pairs

    Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4970– 4980, 2023. 3, 7

  12. [12]

    Roma: Robust dense fea- ture matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense fea- ture matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790– 19800, 2024. 9

  13. [13]

    Prob- ing the 3d awareness of visual foundation models

    Mohamed El Banani, Amit Raj, Kevis-Kokitsi Maninis, Ab- hishek Kar, Yuanzhen Li, Michael Rubinstein, Deqing Sun, Leonidas Guibas, Justin Johnson, and Varun Jampani. Prob- ing the 3d awareness of visual foundation models. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21795–21806, 2024. 6, 7, 12

  14. [14]

    Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

    Zhiwen Fan, Jian Zhang, Wenyan Cong, Peihao Wang, Renjie Li, Kairun Wen, Shijie Zhou, Achuta Kadambi, Zhangyang Wang, Danfei Xu, et al. Large spatial model: End-to-end unposed images to semantic 3d.Advances in neural information processing systems, 37:40212–40229,

  15. [15]

    1, 2, 3, 5, 7, 8, 9, 11, 12, 13, 14

  16. [16]

    arXiv preprint arXiv:2504.00992 , year=

    Elisabetta Fedele, Boyang Sun, Leonidas Guibas, Marc Pollefeys, and Francis Engelmann. Superdec: 3d scene de- composition with superquadric primitives.arXiv preprint arXiv:2504.00992, 2025. 2, 4

  17. [17]

    D^2USt3R: Enhancing 3D Reconstruction with 4D Pointmaps for Dynamic Scenes, April 2025

    Jisang Han, Honggyu An, Jaewoo Jung, Takuya Narihira, Junyoung Seo, Kazumi Fukuda, Chaehyun Kim, Sunghwan Hong, Yuki Mitsufuji, and Seungryong Kim. Dˆ 2ust3r: En- hancing 3d reconstruction with 4d pointmaps for dynamic scenes.arXiv preprint arXiv:2504.06264, 2025. 2

  18. [18]

    Deep matching prior: Test-time optimization for dense correspondence

    Sunghwan Hong and Seungryong Kim. Deep matching prior: Test-time optimization for dense correspondence. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9907–9917, 2021. 12

  19. [19]

    Cost aggregation with 4d convolutional swin transformer for few-shot segmentation

    Sunghwan Hong, Seokju Cho, Jisu Nam, Stephen Lin, and Seungryong Kim. Cost aggregation with 4d convolutional swin transformer for few-shot segmentation. InEuropean Conference on Computer Vision, pages 108–126. Springer, 2022

  20. [20]

    Neural matching fields: Implicit representation of matching fields for visual correspondence.Advances in Neural Information Processing Systems, 35:13512–13526, 2022

    Sunghwan Hong, Jisu Nam, Seokju Cho, Susung Hong, San- gryul Jeon, Dongbo Min, and Seungryong Kim. Neural matching fields: Implicit representation of matching fields for visual correspondence.Advances in Neural Information Processing Systems, 35:13512–13526, 2022

  21. [21]

    Unifying feature and cost aggregation with transformers for semantic and visual correspondence.arXiv preprint arXiv:2403.11120, 2024

    Sunghwan Hong, Seokju Cho, Seungryong Kim, and Stephen Lin. Unifying feature and cost aggregation with transformers for semantic and visual correspondence.arXiv preprint arXiv:2403.11120, 2024. 12

  22. [22]

    Pf3plat: Pose-free feed-forward 3d gaussian splatting, 2025

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting.arXiv preprint arXiv:2410.22128, 2024. 2, 3, 5, 11, 14, 15

  23. [23]

    Unifying cor- respondence pose and nerf for generalized pose-free novel view synthesis

    Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jiaolong Yang, Seungryong Kim, and Chong Luo. Unifying cor- respondence pose and nerf for generalized pose-free novel view synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20196– 20206, 2024. 3, 11, 14, 15

  24. [24]

    Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025

    Guichen Huang, Ruoyu Wang, Xiangjun Gao, Che Sun, Yuwei Wu, Shenghua Gao, and Yunde Jia. Longsplat: On- line generalizable 3d gaussian splatting from long sequence images.arXiv preprint arXiv:2507.16144, 2025. 3

  25. [25]

    No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 21 views

    Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse 21 views. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 27947–27957, 2025. 2, 3, 11, 14, 15

  26. [26]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views, 2025

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,

  27. [27]

    Geonerf: Generalizing nerf with geometry priors

    Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18365–18375, 2022. 3

  28. [28]

    Kim Jun-Seong, GeonU Kim, Kim Yu-Ji, Yu-Chiang Frank Wang, Jaesung Choe, and Tae-Hyun Oh. Dr. splat: Directly referring 3d gaussian splatting via direct language embed- ding registration. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 14137–14146, 2025. 5

  29. [29]

    Relaxing accurate initialization constraint for 3d gaussian splatting

    Jaewoo Jung, Jisang Han, Honggyu An, Jiwon Kang, Seonghoon Park, and Seungryong Kim. Relaxing accurate initialization constraint for 3d gaussian splatting. 2024. 5, 6, 9

  30. [30]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans. Graph., 42(4):139–1,

  31. [31]

    Seg4diff: Unveiling open-vocabulary segmentation in text-to-image diffusion transformers.arXiv preprint arXiv:2509.18096, 2025

    Chaehyun Kim, Heeseong Shin, Eunbeen Hong, Heeji Yoon, Anurag Arnab, Paul Hongsuck Seo, Sunghwan Hong, and Seungryong Kim. Seg4diff: Unveiling open-vocabulary seg- mentation in text-to-image diffusion transformers.arXiv preprint arXiv:2509.18096, 2025. 2

  32. [32]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  33. [33]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InProceedings of the IEEE/CVF international confer- ence on computer vision, pages 4015–4026, 2023. 14

  34. [34]

    Cf3: Compact and fast 3d feature fields.arXiv preprint arXiv:2508.05254,

    Hyunjoon Lee, Joonkyu Min, and Jaesik Park. Cf3: Compact and fast 3d feature fields.arXiv preprint arXiv:2508.05254,

  35. [35]

    1, 2, 3, 5, 7, 8, 11, 12, 13

  36. [36]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 3, 6, 9, 11, 14

  37. [37]

    Language-driven Semantic Segmentation

    Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation.arXiv preprint arXiv:2201.03546, 2022. 6, 7, 8, 11, 12, 13

  38. [38]

    Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps

    Wanhua Li, Yujie Zhao, Minghan Qin, Yang Liu, Yuanhao Cai, Chuang Gan, and Hanspeter Pfister. Langsplatv2: High- dimensional 3d language gaussian splatting with 450+ fps. arXiv preprint arXiv:2507.07136, 2025. 3, 12

  39. [39]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 14

  40. [40]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 14

  41. [41]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017. 6

  42. [42]

    Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004

    David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vi- sion, 60(2):91–110, 2004. 12

  43. [43]

    Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes

    Juliette Marrie, Romain M ´en´egaux, Michael Arbel, Diane Larlus, and Julien Mairal. Ludvig: Learning-free uplift- ing of 2d visual features to gaussian splatting scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 7440–7450, 2025. 2, 5

  44. [44]

    Evolsplat: Efficient volume-based gaussian splatting for urban view synthesis

    Sheng Miao, Jiaxin Huang, Dongfeng Bai, Xu Yan, Hongyu Zhou, Yue Wang, Bingbing Liu, Andreas Geiger, and Yiyi Liao. Evolsplat: Efficient volume-based gaussian splatting for urban view synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11286– 11296, 2025. 3

  45. [45]

    Differentiable blocks world: Qualitative 3d decomposition by rendering primitives.Advances in Neu- ral Information Processing Systems, 36:5791–5807, 2023

    Tom Monnier, Jake Austin, Angjoo Kanazawa, Alexei Efros, and Mathieu Aubry. Differentiable blocks world: Qualitative 3d decomposition by rendering primitives.Advances in Neu- ral Information Processing Systems, 36:5791–5807, 2023. 2

  46. [46]

    Polyfit: Polygonal surface reconstruction from point clouds

    Liangliang Nan and Peter Wonka. Polyfit: Polygonal surface reconstruction from point clouds. InProceedings of the IEEE international conference on computer vision, pages 2353– 2361, 2017. 3

  47. [47]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 2, 6, 7, 9, 12, 14

  48. [48]

    Superquadrics revisited: Learning 3d shape pars- ing beyond cuboids

    Despoina Paschalidou, Ali Osman Ulusoy, and Andreas Geiger. Superquadrics revisited: Learning 3d shape pars- ing beyond cuboids. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 10344–10353, 2019. 2, 4

  49. [49]

    Learning unsupervised hierarchical part decomposition of 3d objects from a single rgb image

    Despoina Paschalidou, Luc Van Gool, and Andreas Geiger. Learning unsupervised hierarchical part decomposition of 3d objects from a single rgb image. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1060–1070, 2020. 2

  50. [50]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 4195–4205,

  51. [51]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10106–10116, 2024. 12

  52. [52]

    Langsplat: 3d language gaussian splatting

    Minghan Qin, Wanhua Li, Jiawei Zhou, Haoqian Wang, and Hanspeter Pfister. Langsplat: 3d language gaussian splatting. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20051–20060, 2024. 3, 12 22

  53. [53]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 12

  54. [54]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020. 12

  55. [55]

    Structure- from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure- from-motion revisited. InProceedings of the IEEE con- ference on computer vision and pattern recognition, pages 4104–4113, 2016. 11

  56. [56]

    Distilled feature fields en- able few-shot language-guided manipulation.arXiv preprint arXiv:2308.07931, 2023

    William Shen, Ge Yang, Alan Yu, Jansen Wong, Leslie Pack Kaelbling, and Phillip Isola. Distilled feature fields en- able few-shot language-guided manipulation.arXiv preprint arXiv:2308.07931, 2023. 2

  57. [57]

    Spatialsplat: Efficient semantic 3d from sparse unposed images,

    Yu Sheng, Jiajun Deng, Xinran Zhang, Yu Zhang, Bei Hua, Yanyong Zhang, and Jianmin Ji. Spatialsplat: Efficient semantic 3d from sparse unposed images.arXiv preprint arXiv:2505.23044, 2025. 2

  58. [58]

    Mental rotation of three-dimensional objects.Science, 171(3972):701–703,

    Roger N Shepard and Jacqueline Metzler. Mental rotation of three-dimensional objects.Science, 171(3972):701–703,

  59. [59]

    Towards open-vocabulary semantic segmentation with- out semantic labels.Advances in Neural Information Pro- cessing Systems, 37:9153–9177, 2024

    Heeseong Shin, Chaehyun Kim, Sunghwan Hong, Seokju Cho, Anurag Arnab, Paul Hongsuck Seo, and Seungryong Kim. Towards open-vocabulary semantic segmentation with- out semantic labels.Advances in Neural Information Pro- cessing Systems, 37:9153–9177, 2024. 7

  60. [60]

    DINOv3

    Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 2, 6, 7, 8, 9, 10, 12

  61. [61]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 3, 11, 14, 15

  62. [62]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797,

  63. [63]

    Learning shape abstractions by as- sembling volumetric primitives

    Shubham Tulsiani, Hao Su, Leonidas J Guibas, Alexei A Efros, and Jitendra Malik. Learning shape abstractions by as- sembling volumetric primitives. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2635–2643, 2017. 2

  64. [64]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 3

  65. [65]

    Vggt: Vi- sual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Vi- sual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025. 2, 3, 4, 6, 7, 8, 9, 10, 11, 12, 14, 15

  66. [66]

    Ibr- net: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibr- net: Learning multi-view image-based rendering. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4690–4699, 2021. 3

  67. [67]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20697– 20709, 2024. 3

  68. [68]

    Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs,

    Weijie Wang, Donny Y Chen, Zeyu Zhang, Duochao Shi, Akide Liu, and Bohan Zhuang. Zpressor: Bottleneck-aware compression for scalable feed-forward 3dgs.arXiv preprint arXiv:2505.23734, 2025. 3

  69. [69]

    V olsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction,

    Weijie Wang, Yeqing Chen, Zeyu Zhang, Hengyu Liu, Haoxiao Wang, Zhiyuan Feng, Wenkang Qin, Zheng Zhu, Donny Y Chen, and Bohan Zhuang. V olsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned pre- diction.arXiv preprint arXiv:2509.19297, 2025. 3

  70. [70]

    Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024

    Yunsong Wang, Tianxin Huang, Hanlin Chen, and Gim Hee Lee. Freesplat: Generalizable 3d gaussian splatting towards free view synthesis of indoor scenes.Advances in Neural Information Processing Systems, 37:107326–107349, 2024. 3

  71. [71]

    Anyup: Universal feature upsampling

    Thomas Wimmer, Prune Truong, Marie-Julie Rakotosaona, Michael Oechsle, Federico Tombari, Bernt Schiele, and Jan Eric Lenssen. Anyup: Universal feature upsampling. arXiv preprint arXiv:2510.12764, 2025. 8, 9, 12

  72. [72]

    Murf: multi-baseline radiance fields

    Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, and Fisher Yu. Murf: multi-baseline radiance fields. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20041–20050, 2024. 3

  73. [73]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images.arXiv preprint arXiv:2410.24207, 2024. 2, 3, 4, 5, 7, 9, 11, 14, 15

  74. [74]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 4578–4587, 2021. 3

  75. [75]

    Improving 2d feature representations by 3d-aware fine-tuning

    Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, and Jan Eric Lenssen. Improving 2d feature representations by 3d-aware fine-tuning. InEuropean Conference on Com- puter Vision, pages 57–74. Springer, 2024. 7, 9

  76. [76]

    Knaebel, K

    Karim Abou Zeid, Kadir Yilmaz, Daan de Geus, Alexander Hermans, David Adrian, Timm Linder, and Bastian Leibe. Dino in the room: Leveraging 2d foundation models for 3d segmentation.arXiv preprint arXiv:2503.18944, 2025. 2, 5

  77. [77]

    Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers

    Chuanrui Zhang, Yingshuang Zou, Zhuoling Li, Minmin Yi, and Haoqian Wang. Transplat: Generalizable 3d gaussian splatting from sparse multi-view images with transformers. InProceedings of the AAAI Conference on Artificial Intelli- gence, pages 9869–9877, 2025. 3 23

  78. [78]

    Shengjun Zhang, Xin Fei, Fangfu Liu, Haixu Song, and Yueqi Duan. Gaussian graph network: Learning efficient and generalizable gaussian representations from multi-view im- ages.Advances in Neural Information Processing Systems, 37:50361–50380, 2024. 3

  79. [79]

    Diffusion Transformers with Representation Autoencoders

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoen- coders.arXiv preprint arXiv:2510.11690, 2025. 14

  80. [80]

    Extract free dense labels from clip

    Chong Zhou, Chen Change Loy, and Bo Dai. Extract free dense labels from clip. InEuropean conference on computer vision, pages 696–712. Springer, 2022. 6, 7, 8, 12

Showing first 80 references.