arxiv: 2605.10239 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AdaptSplat: Adapting Vision Foundation Models for Feed-Forward 3D Gaussian Splatting

Mingwei Xing , Xinliang Wang , Yifeng Shi

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:25 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingfeed-forward reconstructionvision foundation modelslightweight adapterhigh-frequency preservationcross-domain generalizationnovel view synthesisadapter tuning

0 comments

The pith

A single 1.5-million-parameter adapter added to vision foundation models enables superior feed-forward 3D Gaussian Splatting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard feed-forward 3D Gaussian Splatting pipelines lose high-frequency geometric details and generalize poorly across domains because deep networks apply low-pass filtering and 3D training data remains limited in scale. It shows that inserting one lightweight Frequency-Preserving Adapter into the existing image-feature-extraction to multi-view-interaction to feature-decoding pipeline is enough to recover those details. The adapter pulls direction-aware high-frequency priors from shallow layers of a pre-trained vision foundation model and blends them back through positional encodings and residual modulation. If correct, this means researchers can achieve better surface and boundary accuracy without redesigning entire architectures or gathering larger 3D datasets.

Core claim

AdaptSplat demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, the Frequency-Preserving Adapter extracts direction-aware high-frequency structural priors from the shallow features of a vision foundation model backbone and integrates them via high-frequency positional encodings and adaptive residual modulation, compensating for the high-frequency attenuation caused by over-smoothing in deep features and improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries.

What carries the argument

The Frequency-Preserving Adapter (FPA), which extracts direction-aware high-frequency structural priors from shallow backbone features and fuses them into the generic pipeline through high-frequency positional encodings and adaptive residual modulation.

If this is right

Gaussian primitives achieve higher fitting accuracy on complex surfaces and sharp boundaries.
Reconstruction quality reaches state-of-the-art levels across multiple standard benchmarks.
Cross-domain generalization improves without domain-specific fine-tuning or extra data.
The overall pipeline remains lightweight while delivering better fidelity than prior custom designs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Comparable shallow-feature adapters could improve other feed-forward 3D methods that currently rely on deep smoothed features.
Vision foundation models appear to hold under-used high-frequency geometric cues that become accessible with minimal added parameters.
The same integration strategy might be tested on larger or different foundation backbones to measure further gains in surface fidelity.
This pattern suggests a broader route for parameter-efficient transfer from 2D pre-training to 3D reconstruction tasks.

Load-bearing premise

High-frequency structural priors from shallow features of a vision foundation model can be seamlessly integrated into the 3DGS pipeline to compensate for low-pass filtering without needing extra domain-specific fine-tuning or more training data.

What would settle it

Training identical pipelines with and without the Frequency-Preserving Adapter on the same multi-domain benchmarks and checking whether the adapter produces consistent gains in high-frequency detail metrics and cross-domain accuracy; no gain or added artifacts would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.10239 by Mingwei Xing, Xinliang Wang, Yifeng Shi.

**Figure 2.** Figure 2: Overview of AdaptSplat. Based on the generic feature extraction-interaction-decoding pipeline, AdaptSplat introduces a lightweight Frequency-Preserving Adapter (FPA, 1.5M parameters). FPA explicitly extracts high-frequency structural priors to combat the network’s spectral bias. These priors are then injected into the Multi-view Transformer as frequency-guided positional encodings (PE) and into the DPT dec… view at source ↗

**Figure 3.** Figure 3: Qualitative comparison on DL3DV. AdaptSplat yields superior high-frequency fidelity and sharper geometric boundaries. blurring and structural degradation. Conversely, AdaptSplat produces sharp boundaries and clear local details by preserving and explicitly incorporating high-frequency signals, which yields results that closely match the ground truth. Following the YoNoSplat [40] protocol ( [PITH_FULL_IMAG… view at source ↗

**Figure 5.** Figure 5: Gaussian distribution visualization at boundaries [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

This work explores a simple yet powerful lightweight adapter design for feed-forward 3D Gaussian Splatting (3DGS). Existing methods typically apply complex, architecture-specific designs on top of the generic pipeline of image feature extraction $\rightarrow$ multi-view interaction $\rightarrow$ feature decoding. However, constrained by the scale bottleneck of 3D training data and the low-pass filtering effect of deep networks, these methods still fall short in cross-domain generalization and high-frequency geometric fidelity. To address these problems, we propose AdaptSplat, which demonstrates that without complex component engineering, introducing a single adapter of only 1.5M parameters into the generic architecture is sufficient to achieve superior performance. Specifically, we design a lightweight Frequency-Preserving Adapter (FPA) that extracts direction-aware high-frequency structural priors from the shallow features of a powerful vision foundation model backbone, and seamlessly integrates them into the generic pipeline via high-frequency positional encodings and adaptive residual modulation. This effectively compensates for the high-frequency attenuation caused by over-smoothing in deep features, improving the fitting accuracy of Gaussian primitives on complex surfaces and sharp boundaries. Extensive experiments demonstrate that AdaptSplat achieves state-of-the-art feed-forward reconstruction performance on multiple standard benchmarks, with stable generalization across domains. Code available at: https://github.com/xmw666/AdaptSplat.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaptSplat shows a 1.5M-parameter adapter can lift high-frequency fidelity in feed-forward 3DGS by pulling from VFM shallow layers, but the mechanism rests on an unverified assumption and the abstract gives no numbers to check it.

read the letter

The paper's main point is straightforward: instead of redesigning the whole feed-forward 3DGS pipeline, a single lightweight adapter can fix two real problems—poor cross-domain generalization and loss of sharp details—by injecting high-frequency structural priors from the early layers of a frozen vision foundation model. The Frequency-Preserving Adapter uses direction-aware extraction, high-frequency positional encodings, and adaptive residual modulation to offset the low-pass filtering that happens deeper in the network. That is the concrete contribution, and it keeps the rest of the generic image-feature-to-Gaussian pipeline untouched. The design is simple enough that it could be dropped into other existing methods without much re-engineering, which is a practical plus for anyone already running 3DGS training loops on limited 3D data. The authors also release code, which helps reproducibility even if the current description is high-level. The soft spot is that the central claim about the priors is not yet backed by direct evidence. Shallow VFM layers are trained on 2D tasks, so they mostly see texture and edges; nothing in the abstract shows they actually carry multi-view consistent 3D surface information that would help Gaussian fitting on complex boundaries. Without ablations that swap in random features or run Fourier checks on the extracted maps, it is hard to tell whether the reported gains come from the hypothesized compensation or simply from the extra trainable capacity. The abstract states SOTA results and stable generalization across domains, but supplies no tables, baselines, or error breakdowns, so the strength of those claims cannot be judged yet. This work is aimed at people building efficient, feed-forward 3D reconstruction systems who want to avoid heavy architecture changes. A reader already familiar with 3DGS and VFMs will see the adapter as a targeted, low-cost tweak worth testing. It is coherent enough on its own terms to deserve a serious referee, mainly so the experimental section can be examined in full and the prior assumption can be stress-tested with the right controls. I would send it to review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper proposes AdaptSplat, a feed-forward 3D Gaussian Splatting method that inserts a single lightweight Frequency-Preserving Adapter (FPA) of 1.5M parameters into the standard image-feature-extraction to multi-view-interaction to feature-decoding pipeline. The FPA extracts direction-aware high-frequency structural priors from shallow layers of a frozen vision foundation model and injects them via high-frequency positional encodings and adaptive residual modulation, claiming this compensates for low-pass filtering in deep features, yields SOTA reconstruction quality on complex surfaces and sharp boundaries, and provides stable cross-domain generalization.

Significance. If the claimed performance gains and generalization hold under rigorous verification, the result would be significant: it would show that a minimal, architecture-agnostic adapter suffices to overcome the data-scale and frequency-attenuation bottlenecks that have limited prior feed-forward 3DGS approaches, without requiring large 3D-specific datasets or bespoke multi-view modules. This could simplify deployment of high-fidelity 3D reconstruction systems.

major comments (2)

[Abstract] Abstract: the central mechanistic claim—that shallow VFM features supply 'direction-aware high-frequency structural priors' that compensate for deep-layer low-pass filtering—is load-bearing for attributing gains to the FPA design rather than to extra capacity or training. No Fourier analysis of feature maps, no ablation replacing VFM shallow features with random or 3D-specific alternatives, and no visualization of the extracted priors are described to confirm the presence of multi-view-consistent 3D geometry (as opposed to 2D texture/edge statistics).
[Abstract] Abstract and experimental claims: the assertions of 'state-of-the-art feed-forward reconstruction performance' and 'stable generalization across domains' are presented without any quantitative tables, baseline comparisons, ablation results on FPA components, or error analysis on high-frequency regions. This absence prevents evaluation of effect sizes and robustness, which are required to substantiate the 'single adapter of only 1.5M parameters is sufficient' thesis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for their thorough review and valuable feedback, which helps us improve the clarity and rigor of our work. We respond to each major comment in detail below, indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central mechanistic claim—that shallow VFM features supply 'direction-aware high-frequency structural priors' that compensate for deep-layer low-pass filtering—is load-bearing for attributing gains to the FPA design rather than to extra capacity or training. No Fourier analysis of feature maps, no ablation replacing VFM shallow features with random or 3D-specific alternatives, and no visualization of the extracted priors are described to confirm the presence of multi-view-consistent 3D geometry (as opposed to 2D texture/edge statistics).

Authors: We agree that the mechanistic claim is important and that additional supporting analyses would enhance the paper. Although our experiments include component ablations that show the FPA's contribution beyond mere capacity (as the adapter is lightweight and the backbone is frozen), we did not include Fourier analysis or visualizations of the priors. In the revised manuscript, we will add visualizations of the direction-aware high-frequency features extracted from shallow VFM layers and their effect on the Gaussian splatting output. We will also conduct and report a Fourier analysis to demonstrate the preservation of high-frequency components. For the suggested ablation with random or 3D-specific alternatives, we will include a discussion noting that such controls would not test the hypothesis of leveraging pre-trained priors, but we can add a random feature baseline if space permits. revision: partial
Referee: [Abstract] Abstract and experimental claims: the assertions of 'state-of-the-art feed-forward reconstruction performance' and 'stable generalization across domains' are presented without any quantitative tables, baseline comparisons, ablation results on FPA components, or error analysis on high-frequency regions. This absence prevents evaluation of effect sizes and robustness, which are required to substantiate the 'single adapter of only 1.5M parameters is sufficient' thesis.

Authors: We believe there may be a misunderstanding, as the full manuscript provides all the requested elements. Table 1 reports quantitative comparisons against multiple baselines on several benchmarks, demonstrating SOTA performance. Table 3 and Section 4.3 present ablations on the FPA components, including the impact of high-frequency positional encodings and residual modulation. Figure 8 includes error maps and analysis specifically on high-frequency regions such as sharp boundaries and complex surfaces. Cross-domain results are in Table 2. These substantiate the claims regarding the sufficiency of the 1.5M parameter adapter. We will add cross-references in the abstract to these sections in the revision. revision: no

Circularity Check

0 steps flagged

No circularity: empirical adapter design with external validation

full rationale

The paper proposes an empirical architecture (lightweight FPA adapter of 1.5M parameters inserted into a generic 3DGS pipeline) whose performance claims rest on experimental benchmarks rather than any closed mathematical derivation. No equations, fitted parameters renamed as predictions, or self-definitional reductions appear in the abstract or described method. The central mechanism—extracting high-frequency priors from shallow VFM features and injecting them via positional encodings and residual modulation—is presented as an architectural choice justified by observed low-pass filtering, not by any tautological input-output equivalence or load-bearing self-citation chain. The approach is therefore self-contained against external data and does not reduce its reported gains to quantities defined inside the paper itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that shallow VFM features contain transferable high-frequency priors that can be injected without retraining the backbone or collecting new 3D data.

axioms (1)

domain assumption Shallow layers of vision foundation models encode direction-aware high-frequency structural information useful for 3D geometry
Invoked to justify extraction from shallow features rather than deep layers.

invented entities (1)

Frequency-Preserving Adapter (FPA) no independent evidence
purpose: Extract and integrate high-frequency priors into the 3DGS pipeline
New module introduced by the paper; no independent evidence outside this work.

pith-pipeline@v0.9.0 · 5541 in / 1243 out tokens · 43813 ms · 2026-05-12T03:25:46.238373+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FPA introduces 2D DWT to break this degeneration... LH and HL subbands capture high-frequency energy along orthogonal axes, providing a directional structure tensor
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

introducing a single adapter of only 1.5M parameters into the generic architecture

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 4 internal anchors

[1]

J. T. Barron, B. Mildenhall, D. Verbin, P. P. Srinivasan, and P. Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

work page 2022
[2]

Charatan, S

D. Charatan, S. L. Li, A. Tagliasacchi, and V . Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, 2024

work page 2024
[3]

H. Chen, B. Shen, Y . Liu, R. Shi, L. Zhou, C. Z. Lin, J. Gu, H. Su, G. Wetzstein, and L. Guibas. 3d- adapter: Geometry-consistent multi-view diffusion for high-quality 3d generation, 2024. URL https: //arxiv.org/abs/2410.18974

work page arXiv 2024
[4]

Y . Chen, H. Xu, C. Qian, and G. Zeng. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, 2024

work page 2024
[5]

Z. Chen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, F. Li, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats.arXiv preprint arXiv:2410.12781, 2024

work page arXiv 2024
[6]

Guédon and V

A. Guédon and V . Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh reconstruction and high-quality mesh rendering.CVPR, 2024

work page 2024
[7]

Hanson, A

A. Hanson, A. Tu, V . Singla, M. Jayawardhana, M. Zwicker, and T. Goldstein. Pup 3d-gs: Principled uncertainty pruning for 3d gaussian splatting. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5949–5958, 2025

work page 2025
[8]

arXiv preprint arXiv:2210.01055 , year=

T. Huang, B. Dong, Y . Yang, and et al. Clip2point: Transfer clip to point cloud classification with image-depth pre-training, 2022. URLhttps://arxiv.org/abs/2210.01055

work page arXiv 2022
[9]

Mv-adapter: Multi-view consistent image generation made easy.arXiv preprint arXiv:2412.03632, 2024

Z. Huang, Y . Guo, H. Wang, R. Yi, L. Ma, Y .-P. Cao, and L. Sheng. Mv-adapter: Multi-view consistent image generation made easy, 2024. URLhttps://arxiv.org/abs/2412.03632

work page arXiv 2024
[10]

GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

R. Itkin, N. Issachar, Y . Keypur, X. Chen, A. Chen, and S. Benaim. Globalsplat: Efficient feed-forward 3d gaussian splatting via global scene tokens.arXiv preprint arXiv:2604.15284, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[11]

Jeong, S

H. Jeong, S. Lee, G. Kang, S. Yang, X. Sun, S. Nam, and E. Park. 2xplat: Two experts are better than one generalist.arXiv preprint arXiv:2603.21064, 2026

work page arXiv 2026
[12]

H. Jia, L. Zhu, and N. Zhao. H3r: Hybrid multi-view correspondence for generalizable 3d reconstruction. arXiv preprint arXiv:2508.03118, 2025

work page arXiv 2025
[13]

J. Jia, Z. Li, and Y . Shi. You only gaussian once: Controllable 3d gaussian splatting for ultra-densely sampled scenes.arXiv preprint arXiv:2511.11233, 2025

work page arXiv 2025
[14]

Jiang, Y

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, D. Lin, and B. Dai. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.TOG, 44(6):1–16, 2025

work page 2025
[15]

G. Kang, S. Nam, S. Yang, X. Sun, S. Khamis, A. Mohamed, and E. Park. ilrm: An iterative large 3d reconstruction model.arXiv preprint arXiv:2507.23277, 2025

work page arXiv 2025
[16]

G. Kang, S. Yang, S. Nam, Y . Lee, J. Kim, and E. Park. Multi-view pyramid transformer: Look coarser to see broader.arXiv preprint arXiv:2512.07806, 2025

work page arXiv 2025
[17]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering.TOG, 42(4):1–14, 2023

work page 2023
[18]

J. Kim, J. Noh, D.-G. Lee, and A. Kim. Transplat: Surface embedding-guided 3d gaussian splatting for transparent object manipulation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 3190–3196. IEEE, 2025

work page 2025
[19]

Knapitsch, J

A. Knapitsch, J. Park, Q.-Y . Zhou, and V . Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction.ACM Transactions on Graphics (ToG), 36(4):1–13, 2017

work page 2017
[20]

R. Li, B. Yi, J. Liu, H. Gao, Y . Ma, and A. Kanazawa. Cameras as relative positional encoding.arXiv preprint arXiv:2507.10496, 2025

work page arXiv 2025
[21]

Z. Li, C. Dong, Y . Chen, Z. Huang, and P. Liu. Vicasplat: A single run is all you need for 3d gaussian splatting and camera estimation from unposed video frames.arXiv preprint arXiv:2503.10286, 2025. 10

work page arXiv 2025
[22]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22160–22169, 2024

work page 2024
[23]

Z. Liu, Z. Li, Y . Shi, and X. Li. Attentiongs: Towards initialization-free 3d gaussian splatting via structural attention.arXiv preprint arXiv:2506.23611, 2025

work page arXiv 2025
[24]

W. Long, H. Wu, S. Jiang, J. Zhang, X. Ji, and S. Gu. Idesplat: Iterative depth probability estimation for generalizable 3d gaussian splatting.arXiv preprint arXiv:2601.03824, 2026

work page arXiv 2026
[25]

Ranftl, A

R. Ranftl, A. Bochkovskiy, and V . Koltun. Vision transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021

work page 2021
[26]

J. L. Schonberger and J.-M. Frahm. Structure-from-motion revisited. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4104–4113, 2016

work page 2016
[27]

J. L. Schönberger, E. Zheng, J.-M. Frahm, and M. Pollefeys. Pixelwise view selection for unstructured multi-view stereo. InEuropean conference on computer vision, pages 501–518. Springer, 2016

work page 2016
[28]

Segre, O

L. Segre, O. Hirschorn, and S. Avidan. Multi-view foundation models, 2025. URL https://arxiv.org/ abs/2512.15708

work page arXiv 2025
[29]

D. Shi, W. Wang, D. Y . Chen, Z. Zhang, J. Bian, B. Zhuang, and C. Shen. Revisiting depth representations for feed-forward 3d gaussian splatting.arXiv preprint arXiv:2506.05327, 2025

work page arXiv 2025
[30]

DINOv3

O. Siméoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Szymanowicz, C

S. Szymanowicz, C. Rupprecht, and A. Vedaldi. Splatter image: Ultra-fast single-view 3d reconstruction. InCVPR, 2024

work page 2024
[32]

Tamjidi, H

M. Tamjidi, H. Dastmalchi, M. Alimoradijazi, and et al. Adapt-as-you-walk through the clouds, 2025. URLhttps://arxiv.org/abs/2511.15311

work page arXiv 2025
[33]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 5294–5306, 2025

work page 2025
[34]

W. Wang, Y . Chen, Z. Zhang, H. Liu, H. Wang, Z. Feng, W. Qin, Z. Zhu, and D. Y . Chen. V olsplat: Rethink- ing feed-forward 3d gaussian splatting with voxel-aligned prediction.arXiv preprint arXiv:2509.19297, 2025

work page arXiv 2025
[35]

X. Wang, Y . Shi, and Z. Wu. Artifactworld: Scaling 3d gaussian splatting artifact restoration via video generation models.arXiv preprint arXiv:2604.12251, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

C. Xu, S. Yang, T. Galanti, and et al. Image2point: 3d point-cloud understanding with 2d image pretrained models, 2021. URLhttps://arxiv.org/abs/2106.04180

work page arXiv 2021
[37]

H. Xu, S. Peng, F. Wang, H. Blum, D. Barath, A. Geiger, and M. Pollefeys. Depthsplat: Connecting gaussian splatting and depth. InCVPR, 2025

work page 2025
[38]

H. Xu, S. Zhang, P. Li, B. Ye, X. Chen, H.-a. Gao, J. Zheng, X. Song, Z. Peng, R. Miao, et al. Cruise: Cooperative reconstruction and editing in v2x scenarios using gaussian splatting. In2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 12518–12525. IEEE, 2025

work page 2025
[39]

J. Yan, Z. Wei, H. Yi, M. Wang, C. Ma, G. Huang, and X. Wen. Transmvsnet: Global context-aware multi-view stereo network with transformers. InCVPR, 2021

work page 2021
[40]

B. Ye, B. Chen, H. Xu, D. Barath, and M. Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[41]

Ye et al

Y . Ye et al. Noposplat: Pose-free generalizable 3d gaussian splatting.arXiv preprint arXiv:2404.05345, 2024

work page arXiv 2024
[42]

Q. Zhao, H. Tan, Q. Wang, S. Bi, K. Zhang, K. Sunkavalli, S. Tulsiani, and H. Jiang. E-rayzer: Self- supervised 3d reconstruction as spatial visual pre-training.arXiv preprint arXiv:2512.10950, 2025

work page arXiv 2025
[43]

T. Zhou, R. Tucker, J. Flynn, G. Fyffe, and N. Snavely. Stereo magnification: Learning view synthesis using multiplane images.arXiv preprint arXiv:1805.09817, 2018

work page internal anchor Pith review arXiv 2018
[44]

S. Zou, X. Fan, L. Li, Y . Wang, and Y . Wang. Gps-gaussian: Generalizable pixel-wise 3d gaussian splatting. InCVPR, 2024. 11

work page 2024