$\text{VG}^2$GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

Jianjun Yi; Jun Nan; Liwei Chen; Wenli Yang; Yibin Zhao; Yihan Pan

arxiv: 2606.01573 · v2 · pith:GISV3WM3new · submitted 2026-06-01 · 💻 cs.CV

VG²GT: Voxel-Gaussian Splatting Visual Geometry Grounded Transformer

Yibin Zhao , Yihan Pan , Jun Nan , Wenli Yang , Liwei Chen , Jianjun Yi This is my paper

Pith reviewed 2026-06-28 15:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords Gaussian splatting3D reconstructionvisual foundation modelvoxel featurestransformervolume renderingnovel view synthesisfeed-forward method

0 comments

The pith

VG²GT regresses Gaussian primitive parameters directly from multi-scale voxel features of a frozen visual foundation model, supervised by stochastic solid volume rendering of depth maps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents VG²GT to overcome the need for accurate camera parameters and per-scene optimization in Gaussian splatting for 3D reconstruction. It freezes a pretrained visual foundation model, adds a multi-scale differentiable voxel module for better geometry, and regresses Gaussian parameters straight from the voxel features. Depth supervision occurs through stochastic solid volume rendering during training. This setup allows the method to plug into various patch-based models at lower training cost while producing geometrically accurate scenes.

Core claim

VG²GT is a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer that leverages a frozen pretrained visual foundation model, incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features, with depth maps supervised through stochastic solid volume rendering to enable geometrically accurate Gaussian scene reconstruction.

What carries the argument

The multi-scale differentiable voxel module that extracts features from a frozen VFM and supplies them for direct regression of Gaussian primitive parameters.

If this is right

VG²GT outperforms current state-of-the-art methods on the DTU, Replica, TAT, and ScanNet datasets.
The method requires no per-scene optimization or camera parameter refinement at inference time.
VG²GT can be plugged into any patch-feature-based visual foundation model.
Training cost is substantially reduced compared to methods that optimize per scene.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The feed-forward design could support reconstruction pipelines that process video streams in real time without repeated optimization steps.
Replacing the depth-only supervision with additional signals such as surface normals might reduce remaining inconsistencies in complex indoor scenes.
Because the voxel module operates on frozen features, the same architecture could be tested on tasks like semantic 3D labeling with minimal added heads.

Load-bearing premise

Directly regressing Gaussian parameters from multi-scale voxel features of a frozen VFM, supervised only by stochastic solid volume rendering of depth maps, produces artifact-free and geometrically accurate scene reconstructions without per-scene optimization or camera parameter refinement.

What would settle it

Significant geometric inaccuracies or visible artifacts appearing in reconstructions on the ScanNet dataset when tested without any per-scene adjustment would indicate the claim does not hold.

Figures

Figures reproduced from arXiv: 2606.01573 by Jianjun Yi, Jun Nan, Liwei Chen, Wenli Yang, Yibin Zhao, Yihan Pan.

**Figure 1.** Figure 1: VG2GT, a non-pixel-aligned Gaussian Splatting [15] feed-forward model for fast 3D reconstruction and novel view synthesis. VG2GT directly regresses geometrically accurate and spatially uniform Gaussian scenes from an arbitrary number of unposed and uncalibrated images within seconds. solids and renders them along camera rays [25]. It improves the quality of synthesized Gaussian depth maps and strengthens … view at source ↗

**Figure 2.** Figure 2: Overview of VG2GT. The frozen VFM takes multi-view RGB images as input, computes patch tokens, and decodes depth maps and camera parameters. Multi-scale voxels decode the Gaussian scene: coarse voxels enhance patch tokens, while fine voxels split and decode Gaussian primitives. During training, differentiable stochastic solid volume rendering supervises scene geometry. Here, the 3DGS scene 𝐆 consists of 𝐺 … view at source ↗

**Figure 3.** Figure 3: Comparison of different depth rendering strategies. Compared with rasterization-based median depth, our continuous volume rendering produces smoother and more accurate depth maps. During forward propagation, we define the depth map as the median depth, namely the location where accumulated transmittance reaches 0.5. The accumulated transmittance of multiple Gaussian primitives is computed by multiplicati… view at source ↗

**Figure 4.** Figure 4: RGB images and depth maps rendered by rasterization and volume rendering for the same Gaussian scene. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparisons between VG2GT and baselines for geometric reconstruction. Gaussian depth maps rendered by VG2GT show higher accuracy, indicating improved geometric fidelity of the reconstructed Gaussian scenes. J.K. Krishnan et al.: Preprint submitted to Elsevier Page 7 of 15 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Visual comparisons between VG2GT and baselines for NVS. Multi-scale voxel-based Gaussian scenes suppress the overlapping artifacts commonly observed in pixel-aligned methods, improving scene geometry and NVS quality. angular error (Mean◦ ), median angular error (Median◦ ), thresholded angular accuracy (Acc@10◦ and Acc@30◦ ), and cosine similarity (Cos.). For novel view synthesis, we evaluate rendering fide… view at source ↗

**Figure 7.** Figure 7: Visual comparisons of rendered normal maps. VG2GT produces more accurate and spatially consistent normal maps than pixel-aligned baselines. 4.5. Evaluation of Time Consumption The training and inference costs of VG2GT remain well controlled. We evaluate the time consumption of VG2GT using 10 input images. Inference. Although VG2GT decodes depth maps and camera parameters twice, the additional decoding over… view at source ↗

**Figure 8.** Figure 8: Time consumption comparisons for inference and training. Notably, owing to our voxel-based Gaussian scene decoding, when the image resolution increases to 518 × 518, the finevoxel decoding time remains almost unchanged. The total inference time is 2.440 s, only 0.260 s higher than DA3 (2.180 s), as shown in Figure 8a [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Number of Gaussian primitives and decoding-head time as the number of input images increases. Voxel-based alignment also controls both the number of Gaussian primitives and the decoding cost as the number of input views increases. As shown in [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗

read the original abstract

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VG²GT tries a feed-forward route to Gaussian splatting by regressing primitives from multi-scale voxels on a frozen VFM with only depth supervision, but the abstract supplies zero numbers or ablations to check whether the geometry actually holds.

read the letter

The design choice that stands out is the multi-scale differentiable voxel module placed on top of a frozen visual foundation model, followed by direct splitting and regression of Gaussian means, covariances, opacities, and spherical harmonics. Training relies on stochastic solid volume rendering against depth maps, which removes the need for per-scene optimization and camera refinement while keeping the VFM untouched. The paper positions this as a lower-cost alternative that can slot into existing patch-feature VFMs.

That framing of the problem is straightforward: existing optimization-heavy methods are expensive, and pixel-aligned feed-forward methods produce artifacts and uneven primitives. The voxel step is presented as the fix for geometric signal.

The soft spots are exactly where the stress-test note flags them. The claim that frozen VFM features plus voxel processing yield artifact-free, metric-accurate Gaussians across novel views rests on an untested link; VFMs are built for 2D semantics, not precise 3D structure. Without any reported metrics, error bars, ablations, or dataset statistics on DTU, Replica, TAT, or ScanNet, there is no way to judge whether the regression step actually delivers consistent primitives or whether the no-optimization design recovers quality when it does not. The abstract alone leaves the central performance claim unsupported.

This work is for people building efficient 3D pipelines who want to avoid per-scene fitting. A reader already working on voxel or transformer extensions of Gaussian splatting could extract the architecture pattern, but anyone needing reproducible evidence will find the current version thin.

I would send it to peer review. The direction addresses a real bottleneck in the field, and the authors should be asked to supply the missing experiments and direct comparisons so the claim can be evaluated properly.

Referee Report

2 major / 1 minor

Summary. The paper proposes VG²GT, a feed-forward architecture for 3D scene reconstruction that combines Gaussian splatting with a visual geometry-grounded transformer. It extracts multi-scale features from a frozen pretrained visual foundation model (VFM), processes them via a differentiable voxel module, and directly regresses all Gaussian primitive parameters (means, covariances, opacities, SH coefficients). Depth supervision is applied through stochastic solid volume rendering; the design avoids per-scene optimization and camera refinement while claiming to be pluggable into existing patch-feature VFMs and to outperform prior methods on DTU, Replica, TAT, and ScanNet.

Significance. If the empirical claims are substantiated, the work would be significant for demonstrating that frozen VFMs plus a lightweight voxel module can supply sufficient metric geometry for high-quality, artifact-free Gaussian reconstruction in a single forward pass, thereby removing the dominant computational cost of per-scene optimization in current pipelines.

major comments (2)

[Abstract] Abstract: the central claim that VG²GT 'outperforms current state-of-the-art methods' on DTU, Replica, TAT, and ScanNet is presented without any quantitative numbers, tables, error bars, or ablation statistics, rendering the soundness of the outperformance assertion impossible to evaluate from the given material.
[Method overview] Core regression step (described in the abstract and method overview): the load-bearing assumption that multi-scale voxel features extracted from a frozen VFM contain adequate metric geometric signal to regress accurate Gaussian means, covariances, opacities, and SH coefficients—supervised solely by stochastic solid volume rendering of depth maps—remains untested by the provided text. The skeptic concern that VFMs are optimized for 2D semantics rather than precise 3D geometry is therefore not yet addressed by concrete evidence such as failure-case analysis or controlled ablations that isolate the voxel module.

minor comments (1)

[Abstract] Abstract: the phrasing 'substantially reducing the required training cost' is asserted without any reported training-time or memory figures relative to baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that VG²GT 'outperforms current state-of-the-art methods' on DTU, Replica, TAT, and ScanNet is presented without any quantitative numbers, tables, error bars, or ablation statistics, rendering the soundness of the outperformance assertion impossible to evaluate from the given material.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised version we will insert the primary metrics (e.g., mean PSNR/SSIM on DTU and Replica) directly into the abstract while retaining the high-level claim. revision: yes
Referee: [Method overview] Core regression step (described in the abstract and method overview): the load-bearing assumption that multi-scale voxel features extracted from a frozen VFM contain adequate metric geometric signal to regress accurate Gaussian means, covariances, opacities, and SH coefficients—supervised solely by stochastic solid volume rendering of depth maps—remains untested by the provided text. The skeptic concern that VFMs are optimized for 2D semantics rather than precise 3D geometry is therefore not yet addressed by concrete evidence such as failure-case analysis or controlled ablations that isolate the voxel module.

Authors: The current manuscript already contains ablation studies (Section 4.3) that remove the voxel module and report corresponding drops in depth and reconstruction accuracy, together with failure-case visualizations in the supplement. To make the geometric-signal argument more explicit, we will add a short dedicated paragraph in Section 3.2 and expand the ablation table with an additional row that isolates the voxel module while keeping the frozen VFM fixed. revision: partial

Circularity Check

0 steps flagged

No derivation chain or equations present; architectural proposal with empirical claims only.

full rationale

The provided abstract and description contain no equations, no claimed first-principles derivations, no fitted parameters presented as predictions, and no self-citations invoked as load-bearing uniqueness theorems. The method is described as an architectural design (frozen VFM + voxel module + regression + volume rendering supervision) whose performance is asserted via dataset benchmarks rather than any mathematical reduction to inputs. Without any load-bearing steps that reduce by construction, the circularity score is 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the method implicitly assumes that features from a frozen pretrained VFM plus voxel processing suffice for geometric accuracy, but no explicit free parameters, axioms, or invented entities are stated.

axioms (1)

domain assumption Features from a frozen pretrained visual foundation model remain useful for geometric understanding when combined with a differentiable voxel module.
The design freezes the VFM and relies on its features for the voxel module.

pith-pipeline@v0.9.1-grok · 5734 in / 1165 out tokens · 28372 ms · 2026-06-28T15:40:10.866335+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 16 canonical work pages · 5 internal anchors

[1]

Large-scale data for multiple-view stere- opsis

Aanaes, HenrikJensen, RamsbolVogiatzis, R., GeorgeTola, Engin- Dahl, Bjorholm, A., 2016. Large-scale data for multiple-view stere- opsis. International Journal of Computer Vision 120

2016
[2]

Barron,J.T.,Mildenhall,B.,Verbin,D.,Srinivasan,P.P.,Hedman,P.,
[3]

5470–5479

Mip-nerf 360: Unbounded anti-aliased neural radiance fields, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 5470–5479
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J., 2024. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627

work page arXiv 2024
[5]

Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

2017
[6]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

Dao, T., 2024. FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

2024
[7]

Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp

DeTone,D.,Malisiewicz,T.,Rabinovich,A.,2018. Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236

2018
[8]

Real-time plane-sweeping stereo with multiple sweeping directions, J.K

Gallup,D.,Frahm,J.M.,Mordohai,P.,Yang,Q.,Pollefeys,M.,2007. Real-time plane-sweeping stereo with multiple sweeping directions, J.K. Krishnan et al.:Preprint submitted to ElsevierPage 12 of 15 Leveraging social media news Method 3 Views 10 Views 20 Views RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ Replica Dataset[33] MVSpl...

2007
[9]

Vision meets robotics:Thekittidataset

Geiger, A., Lenz, P., Stiller, C., Urtasun, R., 2013. Vision meets robotics:Thekittidataset. InternationalJournalofRoboticsResearch (IJRR)

2013
[10]

Rgbd gs-icp slam, in: European conference on computer vision, Springer

Ha, S., Yeon, J., Yu, H., 2024. Rgbd gs-icp slam, in: European conference on computer vision, Springer. pp. 180–197

2024
[11]

In: ACM SIGGRAPH 2024 Conference Pa- pers

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S., 2024. 2d gaussian splatting for geometrically accurate radiance fields, in: SIGGRAPH 2024 Conference Papers, Association for Computing Machinery. doi:10.1145/3641519.3657428

work page doi:10.1145/3641519.3657428 2024
[12]

Huang, H., Wu, Y., Deng, C., Gao, G., Gu, M., Liu, Y.S., 2025. Fatesgs: Fast and accurate sparse-view surface reconstruction using gaussian splatting with depth-feature consistency, in: Proceedings of the AAAI Conference on Artificial Intelligence

2025
[13]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al., 2025. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG) 44, 1–16

2025
[14]

Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K

Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J., 2024. Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K. Krishnan et al.:Preprint submitted to ElsevierPage 13 of 15 Leveraging social media news conference on computer vision and pattern recognition, pp. 21357– 21366

2024
[15]

MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera,M.,Bulò,S.R.,Richardt,C.,Ramanan,D.,Scherer, S., Kontschieder, P., 2026. MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

2026
[16]

3d gaussiansplattingforreal-timeradiancefieldrendering

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., 2023. 3d gaussiansplattingforreal-timeradiancefieldrendering. ACMTrans- actions on Graphics 42. URL:https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

2023
[17]

Tanks and temples: Benchmarking large-scale scene reconstruction

Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V., 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36

2017
[18]

Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction

Li, H., Zou, Z., Liu, F., Zhang, X., Hong, F., Cao, Y., Lan, Y., Zhang, M., Yu, G., Zhang, D., et al., 2025. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction. arXiv preprint arXiv:2510.22706

work page arXiv 2025
[19]

Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction

Li, Y., Lv, C., Tang, Z., Yang, H., Huang, D., 2026. Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction. arXiv preprint arXiv:2603.00697

work page arXiv 2026
[20]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B., 2025. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

Scale invariant feature transform

Lindeberg, T., 2012. Scale invariant feature transform

2012
[22]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., Dai, B., 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664

2024
[23]

Rethinking network design and local geometry in point cloud: A simple residual mlp framework

Ma, X., Qin, C., You, H., Ran, H., Fu, Y., 2022. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123

work page arXiv 2022
[24]

Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Meng, Q., Chen, A., Luo, H., Wu, M., Su, H., Xu, L., He, X., Yu, J., 2021. Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361

2021
[25]

Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

Mildenhall,B.,Srinivasan,P.P.,Tancik,M.,Barron,J.T.,Ramamoor- thi, R., Ng, R., 2020. Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

2020
[26]

Objects as volumes: A stochastic geometry view of opaque solids

Miller, B., Chen, H., Lai, A., Gkioulekas, I., 2023. Objects as volumes: A stochastic geometry view of opaque solids. IEEE

2023
[27]

Dinov2: Learning robust visual features without supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P., 2023. Dinov2: Learning robust visua...

2023
[28]

Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer

Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L., 2024. Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer. pp. 58–77

2024
[29]

Vision transformers for dense prediction

Ranftl, R., Bochkovskiy, A., Koltun, V., 2021. Vision transformers for dense prediction. ArXiv preprint

2021
[30]

Fastgs:Training3dgaussian splatting in 100 seconds

Ren,S.,Wen,T.,Fang,Y.,Lu,B.,2025. Fastgs:Training3dgaussian splatting in 100 seconds. arXiv preprint arXiv:2511.04283

work page arXiv 2025
[31]

Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp

Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A., 2020. Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 4938–4947

2020
[32]

Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

Schönberger, J.L., Frahm, J.M., 2016. Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

2016
[33]

Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

Schönberger,J.L.,Zheng,E.,Pollefeys,M.,Frahm,J.M.,2016. Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

2016
[34]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe,R.,20...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[35]

Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.,2017. Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

2017
[36]

Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend

Wang, H., Agapito, L., 2025. Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend. arXiv preprint arXiv:2511.20343

work page arXiv 2025
[37]

Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,J.,Chen,M.,Karaev,N.,Vedaldi,A.,Rupprecht,C.,Novotny, D.,2025a. Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
[38]

Wang, J., Chen, M., Zhang, S., Karaev, N., Schönberger, J., Labatut, P., Bojanowski, P., Novotny, D., Vedaldi, A., Rupprecht, C., 2026. Vggt-𝜔. URL:https://arxiv.org/abs/2605.15195,arXiv:2605.15195

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Wang, J., Karaev, N., Rupprecht, C., Novotny, D., 2024a. Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21686–21697
[40]

Dust3r: Geometric 3d vision made easy, in: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J., 2024b. Dust3r: Geometric 3d vision made easy, in: CVPR
[41]

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Wang, W., Chen, Y., Zhang, Z., Liu, H., Wang, H., Feng, Z., Qin, W., Chen, F., Zhu, Z., Chen, D.Y., Zhuang, B., 2025b. Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction. arXiv preprint arXiv:2509.19297

work page internal anchor Pith review Pith/arXiv arXiv
[42]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang,J.,Shen,C.,He,T.,2025c.𝜋 3:Permutation-EquivariantVisual Geometry Learning. arXiv preprint arXiv:2507.13347

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004a. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE transactions on image processing 13, 600–612
[44]

Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004b. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE Trans Image Process 13
[45]

Nerf–: Neural radiance fields without known camera parameters

Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A., 2021. Nerf–: Neural radiance fields without known camera parameters

2021
[46]

4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp

Wu,G.,Yi,T.,Fang,J.,Xie,L.,Zhang,X.,Wei,W.,Liu,W.,Tian,Q., Wang, X., 2024a. 4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp. 20310–20320
[47]

Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views

Wu,J.,Li,R.,Zhu,Y.,Guo,R.,Sun,J.,Zhang,Y.,2025. Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views. arXiv preprint arXiv:2504.20378

work page arXiv 2025
[48]

Point transformer v3: Simpler, faster, stronger, in: CVPR

Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H., 2024b. Point transformer v3: Simpler, faster, stronger, in: CVPR
[49]

Depthsplat: Connecting gaussian splatting and depth, in: CVPR

Xu,H.,Peng,S.,Wang,F.,Blum,H.,Barath,D.,Geiger,A.,Pollefeys, M., 2025. Depthsplat: Connecting gaussian splatting and depth, in: CVPR

2025
[50]

Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction

Xu, J., Gao, S., Shan, Y., 2024. Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction. arXiv preprint arXiv:2412.09573

work page arXiv 2024
[51]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L., 2018. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV)

2018
[52]

arXiv preprint arXiv:2511.07321 (2025) 1, 3, 7, 9, 10, 11, 16, 17, 18, 19

Ye, B., Chen, B., Xu, H., Barath, D., Pollefeys, M., 2025. Yonosplat: You only need one model for feedforward 3d gaussian splatting. arXiv:2511.07321

work page arXiv 2025
[53]

Geometry- grounded gaussian splatting

Zhang, B., Jiang, C., Li, H., Shen, S., Tan, P., 2026. Geometry- grounded gaussian splatting. arXiv preprint arXiv:2601.17835

work page arXiv 2026
[54]

The unreasonable effectiveness of deep features as a perceptual metric

Zhang,R.,Isola,P.,Efros,A.A.,Shechtman,E.,Wang,O.,2018. The unreasonable effectiveness of deep features as a perceptual metric. IEEE

2018
[55]

Applied Intelligence 55, 1118

Zhao,Y.,Yi,J.,Pan,Y.,Chen,L.,2025.Robustgeometricreconstruc- tion of rgb-d data based on gaussian splatting. Applied Intelligence 55, 1118. J.K. Krishnan et al.:Preprint submitted to ElsevierPage 14 of 15 Leveraging social media news

2025
[56]

Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Zhu, Z., Wang, S., Xie, J., Liu, J.j., Wang, J., Yang, J., 2025. Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6761–6771. Yibin Zhaois currently a Ph.D. student at East China University of Science and Technology at Shanghai. He...

2025

[1] [1]

Large-scale data for multiple-view stere- opsis

Aanaes, HenrikJensen, RamsbolVogiatzis, R., GeorgeTola, Engin- Dahl, Bjorholm, A., 2016. Large-scale data for multiple-view stere- opsis. International Journal of Computer Vision 120

2016

[2] [2]

Barron,J.T.,Mildenhall,B.,Verbin,D.,Srinivasan,P.P.,Hedman,P.,

[3] [3]

5470–5479

Mip-nerf 360: Unbounded anti-aliased neural radiance fields, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 5470–5479

[4] [4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Chen, Y., Xu, H., Zheng, C., Zhuang, B., Pollefeys, M., Geiger, A., Cham, T.J., Cai, J., 2024. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627

work page arXiv 2024

[5] [5]

Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc

Dai,A.,Chang,A.X.,Savva,M.,Halber,M.,Funkhouser,T.,Nießner, M., 2017. Scannet: Richly-annotated 3d reconstructions of indoor scenes, in: Proc. Computer Vision and Pattern Recognition (CVPR), IEEE

2017

[6] [6]

FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

Dao, T., 2024. FlashAttention-2: Faster attention with better paral- lelism and work partitioning, in: International Conference on Learn- ing Representations (ICLR)

2024

[7] [7]

Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp

DeTone,D.,Malisiewicz,T.,Rabinovich,A.,2018. Superpoint:Self- supervised interest point detection and description, in: Proceedings of the IEEE conference on computer vision and pattern recognition workshops, pp. 224–236

2018

[8] [8]

Real-time plane-sweeping stereo with multiple sweeping directions, J.K

Gallup,D.,Frahm,J.M.,Mordohai,P.,Yang,Q.,Pollefeys,M.,2007. Real-time plane-sweeping stereo with multiple sweeping directions, J.K. Krishnan et al.:Preprint submitted to ElsevierPage 12 of 15 Leveraging social media news Method 3 Views 10 Views 20 Views RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ RMSE↓ABS↓𝛿 1.25 ↑𝛿 1.05 ↑ Replica Dataset[33] MVSpl...

2007

[9] [9]

Vision meets robotics:Thekittidataset

Geiger, A., Lenz, P., Stiller, C., Urtasun, R., 2013. Vision meets robotics:Thekittidataset. InternationalJournalofRoboticsResearch (IJRR)

2013

[10] [10]

Rgbd gs-icp slam, in: European conference on computer vision, Springer

Ha, S., Yeon, J., Yu, H., 2024. Rgbd gs-icp slam, in: European conference on computer vision, Springer. pp. 180–197

2024

[11] [11]

In: ACM SIGGRAPH 2024 Conference Pa- pers

Huang, B., Yu, Z., Chen, A., Geiger, A., Gao, S., 2024. 2d gaussian splatting for geometrically accurate radiance fields, in: SIGGRAPH 2024 Conference Papers, Association for Computing Machinery. doi:10.1145/3641519.3657428

work page doi:10.1145/3641519.3657428 2024

[12] [12]

Huang, H., Wu, Y., Deng, C., Gao, G., Gu, M., Liu, Y.S., 2025. Fatesgs: Fast and accurate sparse-view surface reconstruction using gaussian splatting with depth-feature consistency, in: Proceedings of the AAAI Conference on Artificial Intelligence

2025

[13] [13]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views

Jiang, L., Mao, Y., Xu, L., Lu, T., Ren, K., Jin, Y., Xu, X., Yu, M., Pang, J., Zhao, F., et al., 2025. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views. ACM Transactions on Graphics (TOG) 44, 1–16

2025

[14] [14]

Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K

Keetha, N., Karhade, J., Jatavallabhula, K.M., Yang, G., Scherer, S., Ramanan, D., Luiten, J., 2024. Splatam: Splat track & map 3d gaussians for dense rgb-d slam, in: Proceedings of the IEEE/CVF J.K. Krishnan et al.:Preprint submitted to ElsevierPage 13 of 15 Leveraging social media news conference on computer vision and pattern recognition, pp. 21357– 21366

2024

[15] [15]

MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera,M.,Bulò,S.R.,Richardt,C.,Ramanan,D.,Scherer, S., Kontschieder, P., 2026. MapAnything: Universal feed-forward metric 3D reconstruction, in: International Conference on 3D Vision (3DV), IEEE

2026

[16] [16]

3d gaussiansplattingforreal-timeradiancefieldrendering

Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G., 2023. 3d gaussiansplattingforreal-timeradiancefieldrendering. ACMTrans- actions on Graphics 42. URL:https://repo-sam.inria.fr/fungraph/ 3d-gaussian-splatting/

2023

[17] [17]

Tanks and temples: Benchmarking large-scale scene reconstruction

Knapitsch, A., Park, J., Zhou, Q.Y., Koltun, V., 2017. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Transactions on Graphics 36

2017

[18] [18]

Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction

Li, H., Zou, Z., Liu, F., Zhang, X., Hong, F., Cao, Y., Lan, Y., Zhang, M., Yu, G., Zhang, D., et al., 2025. Iggt: Instance-grounded geometry transformer for semantic 3d reconstruction. arXiv preprint arXiv:2510.22706

work page arXiv 2025

[19] [19]

Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction

Li, Y., Lv, C., Tang, Z., Yang, H., Huang, D., 2026. Tokensplat: Token-aligned3dgaussiansplattingforfeed-forwardpose-freerecon- struction. arXiv preprint arXiv:2603.00697

work page arXiv 2026

[20] [20]

Depth Anything 3: Recovering the Visual Space from Any Views

Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Shi, G., Feng, J., Kang, B., 2025. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647

work page internal anchor Pith review Pith/arXiv arXiv 2025

[21] [21]

Scale invariant feature transform

Lindeberg, T., 2012. Scale invariant feature transform

2012

[22] [22]

Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

Lu, T., Yu, M., Xu, L., Xiangli, Y., Wang, L., Lin, D., Dai, B., 2024. Scaffold-gs: Structured 3d gaussians for view-adaptive rendering, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 20654–20664

2024

[23] [23]

Rethinking network design and local geometry in point cloud: A simple residual mlp framework

Ma, X., Qin, C., You, H., Ran, H., Fu, Y., 2022. Rethinking network design and local geometry in point cloud: A simple residual mlp framework. arXiv preprint arXiv:2202.07123

work page arXiv 2022

[24] [24]

Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

Meng, Q., Chen, A., Luo, H., Wu, M., Su, H., Xu, L., He, X., Yu, J., 2021. Gnerf: Gan-based neural radiance field without posed camera, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6351–6361

2021

[25] [25]

Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

Mildenhall,B.,Srinivasan,P.P.,Tancik,M.,Barron,J.T.,Ramamoor- thi, R., Ng, R., 2020. Nerf: Representing scenes as neural radiance fields for view synthesis, in: ECCV

2020

[26] [26]

Objects as volumes: A stochastic geometry view of opaque solids

Miller, B., Chen, H., Lai, A., Gkioulekas, I., 2023. Objects as volumes: A stochastic geometry view of opaque solids. IEEE

2023

[27] [27]

Dinov2: Learning robust visual features without supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Howes, R., Huang, P.Y., Xu, H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P., 2023. Dinov2: Learning robust visua...

2023

[28] [28]

Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer

Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L., 2024. Global structure-from-motion revisited, in: European Conference on Com- puter Vision, Springer. pp. 58–77

2024

[29] [29]

Vision transformers for dense prediction

Ranftl, R., Bochkovskiy, A., Koltun, V., 2021. Vision transformers for dense prediction. ArXiv preprint

2021

[30] [30]

Fastgs:Training3dgaussian splatting in 100 seconds

Ren,S.,Wen,T.,Fang,Y.,Lu,B.,2025. Fastgs:Training3dgaussian splatting in 100 seconds. arXiv preprint arXiv:2511.04283

work page arXiv 2025

[31] [31]

Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp

Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A., 2020. Superglue: Learning feature matching with graph neural networks, in:ProceedingsoftheIEEE/CVFconferenceoncomputervisionand pattern recognition, pp. 4938–4947

2020

[32] [32]

Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

Schönberger, J.L., Frahm, J.M., 2016. Structure-from-motion revis- ited, in: Conference on Computer Vision and Pattern Recognition (CVPR)

2016

[33] [33]

Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

Schönberger,J.L.,Zheng,E.,Pollefeys,M.,Frahm,J.M.,2016. Pixel- wise view selection for unstructured multi-view stereo, in: European Conference on Computer Vision (ECCV)

2016

[34] [34]

The Replica Dataset: A Digital Replica of Indoor Spaces

Straub, J., Whelan, T., Ma, L., Chen, Y., Wijmans, E., Green, S., Engel, J.J., Mur-Artal, R., Ren, C., Verma, S., Clarkson, A., Yan, M., Budge, B., Yan, Y., Pan, X., Yon, J., Zou, Y., Leon, K., Carter, N., Briales, J., Gillingham, T., Mueggler, E., Pesqueira, L., Savva, M., Batra, D., Strasdat, H.M., Nardi, R.D., Goesele, M., Lovegrove, S., Newcombe,R.,20...

work page internal anchor Pith review Pith/arXiv arXiv 2019

[35] [35]

Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.,2017. Sparsityinvariantcnns,in:InternationalConferenceon3D Vision (3DV)

2017

[36] [36]

Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend

Wang, H., Agapito, L., 2025. Amb3r: Accurate feed-forward metric-scale 3d reconstruction with backend. arXiv preprint arXiv:2511.20343

work page arXiv 2025

[37] [37]

Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Wang,J.,Chen,M.,Karaev,N.,Vedaldi,A.,Rupprecht,C.,Novotny, D.,2025a. Vggt:Visualgeometrygroundedtransformer,in:Proceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

[38] [38]

Wang, J., Chen, M., Zhang, S., Karaev, N., Schönberger, J., Labatut, P., Bojanowski, P., Novotny, D., Vedaldi, A., Rupprecht, C., 2026. Vggt-𝜔. URL:https://arxiv.org/abs/2605.15195,arXiv:2605.15195

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Wang, J., Karaev, N., Rupprecht, C., Novotny, D., 2024a. Vggsfm: Visual geometry grounded deep structure from motion, in: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 21686–21697

[40] [40]

Dust3r: Geometric 3d vision made easy, in: CVPR

Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J., 2024b. Dust3r: Geometric 3d vision made easy, in: CVPR

[41] [41]

VolSplat: Rethinking Feed-Forward 3D Gaussian Splatting with Voxel-Aligned Prediction

Wang, W., Chen, Y., Zhang, Z., Liu, H., Wang, H., Feng, Z., Qin, W., Chen, F., Zhu, Z., Chen, D.Y., Zhuang, B., 2025b. Volsplat: Rethinking feed-forward 3d gaussian splatting with voxel-aligned prediction. arXiv preprint arXiv:2509.19297

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang,J.,Shen,C.,He,T.,2025c.𝜋 3:Permutation-EquivariantVisual Geometry Learning. arXiv preprint arXiv:2507.13347

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004a. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE transactions on image processing 13, 600–612

[44] [44]

Image qualityassessment:fromerrorvisibilitytostructuralsimilarity

Wang,Z.,Bovik,A.C.,Sheikh,H.R.,Simoncelli,E.P.,2004b. Image qualityassessment:fromerrorvisibilitytostructuralsimilarity. IEEE Trans Image Process 13

[45] [45]

Nerf–: Neural radiance fields without known camera parameters

Wang, Z., Wu, S., Xie, W., Chen, M., Prisacariu, V.A., 2021. Nerf–: Neural radiance fields without known camera parameters

2021

[46] [46]

4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp

Wu,G.,Yi,T.,Fang,J.,Xie,L.,Zhang,X.,Wei,W.,Liu,W.,Tian,Q., Wang, X., 2024a. 4d gaussian splatting for real-time dynamic scene rendering,in:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR), pp. 20310–20320

[47] [47]

Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views

Wu,J.,Li,R.,Zhu,Y.,Guo,R.,Sun,J.,Zhang,Y.,2025. Sparse2dgs: Geometry-prioritized gaussian splatting for surface reconstruction from sparse views. arXiv preprint arXiv:2504.20378

work page arXiv 2025

[48] [48]

Point transformer v3: Simpler, faster, stronger, in: CVPR

Wu, X., Jiang, L., Wang, P.S., Liu, Z., Liu, X., Qiao, Y., Ouyang, W., He, T., Zhao, H., 2024b. Point transformer v3: Simpler, faster, stronger, in: CVPR

[49] [49]

Depthsplat: Connecting gaussian splatting and depth, in: CVPR

Xu,H.,Peng,S.,Wang,F.,Blum,H.,Barath,D.,Geiger,A.,Pollefeys, M., 2025. Depthsplat: Connecting gaussian splatting and depth, in: CVPR

2025

[50] [50]

Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction

Xu, J., Gao, S., Shan, Y., 2024. Freesplatter: Pose-free gaus- sian splatting for sparse-view 3d reconstruction. arXiv preprint arXiv:2412.09573

work page arXiv 2024

[51] [51]

Mvsnet: Depth inference for unstructured multi-view stereo

Yao, Y., Luo, Z., Li, S., Fang, T., Quan, L., 2018. Mvsnet: Depth inference for unstructured multi-view stereo. European Conference on Computer Vision (ECCV)

2018

[52] [52]

arXiv preprint arXiv:2511.07321 (2025) 1, 3, 7, 9, 10, 11, 16, 17, 18, 19

Ye, B., Chen, B., Xu, H., Barath, D., Pollefeys, M., 2025. Yonosplat: You only need one model for feedforward 3d gaussian splatting. arXiv:2511.07321

work page arXiv 2025

[53] [53]

Geometry- grounded gaussian splatting

Zhang, B., Jiang, C., Li, H., Shen, S., Tan, P., 2026. Geometry- grounded gaussian splatting. arXiv preprint arXiv:2601.17835

work page arXiv 2026

[54] [54]

The unreasonable effectiveness of deep features as a perceptual metric

Zhang,R.,Isola,P.,Efros,A.A.,Shechtman,E.,Wang,O.,2018. The unreasonable effectiveness of deep features as a perceptual metric. IEEE

2018

[55] [55]

Applied Intelligence 55, 1118

Zhao,Y.,Yi,J.,Pan,Y.,Chen,L.,2025.Robustgeometricreconstruc- tion of rgb-d data based on gaussian splatting. Applied Intelligence 55, 1118. J.K. Krishnan et al.:Preprint submitted to ElsevierPage 14 of 15 Leveraging social media news

2025

[56] [56]

Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp

Zhu, Z., Wang, S., Xie, J., Liu, J.j., Wang, J., Yang, J., 2025. Voxel- splat: Dynamic gaussian splatting as an effective loss for occupancy and flow prediction, in: Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 6761–6771. Yibin Zhaois currently a Ph.D. student at East China University of Science and Technology at Shanghai. He...

2025