Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Kaede Shiohara; Takashi Otonari; Toshihiko Yamasaki; Yuto Furutani

arxiv: 2606.11894 · v2 · pith:PZVEFC7Enew · submitted 2026-06-10 · 💻 cs.CV

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Yuto Furutani , Takashi Otonari , Kaede Shiohara , Toshihiko Yamasaki This is my paper

Pith reviewed 2026-06-27 10:11 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingfeed-forward reconstructionsparse photo collectionstransient removalappearance consistencyWildCity datasetunconstrained 3D reconstruction

0 comments

The pith

Wild3R produces 3D Gaussian splats from unconstrained sparse photos without per-scene optimization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Wild3R, a feed-forward model for 3D Gaussian Splatting that processes real-world photo collections containing varying illumination and transient objects. Standard 3DGS requires slow per-scene optimization, and prior feed-forward methods break down under these conditions. The authors address the data shortage by releasing the WildCity dataset of 200 scenes, 170 lighting conditions, and transient elements, for a total of 337,500 images. Training on this data lets the model enforce appearance consistency across reference views while suppressing transients, yielding better results than other feed-forward baselines and performance comparable to optimized per-scene methods.

Core claim

Wild3R is a feed-forward network for 3D Gaussian Splatting that, when trained on the WildCity dataset, learns to generate consistent scene representations from sparse unconstrained photos by conditioning on reference views and removing transient content.

What carries the argument

The WildCity dataset of 200 scenes with 170 lighting conditions and transients, used to train a feed-forward model that enforces appearance consistency across viewpoints while removing transient objects.

If this is right

The method outperforms existing feed-forward 3DGS approaches on real-world sparse collections.
Results are competitive with traditional per-scene optimization methods.
The approach removes the need for time-consuming optimization when reconstructing from casual photos.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Casual smartphone photos could become sufficient input for usable 3D models without expert capture or long compute.
The same training strategy of conditioning on references while suppressing transients may transfer to other 3D representations.
Larger or more diverse scene collections could further reduce remaining gaps with per-scene methods.

Load-bearing premise

The WildCity dataset supplies enough variety in viewpoints, illuminations, and transients at sufficient scale to train a model that generalizes to other real-world photo collections.

What would settle it

A controlled test on a new collection of photos with lighting or transient patterns absent from WildCity where the model produces view-inconsistent appearances or retains moving objects.

Figures

Figures reproduced from arXiv: 2606.11894 by Kaede Shiohara, Takashi Otonari, Toshihiko Yamasaki, Yuto Furutani.

**Figure 2.** Figure 2: PSNR vs. Reconstruction Time. The horizontal axis represents the reconstruction time in log scale to generate 3D Gaussians from input images. The vertical axis indicates the corresponding PSNR of each model on the Photo Tourism dataset [31] using 16 context views. Star markers denote methods that operate without ground truth camera parameters, while circle markers represent those that require them. 3D Gau… view at source ↗

**Figure 3.** Figure 3: WildCity Dataset Creation Pipeline. (i) First, we collect 3D assets from the SceneCity, a Blender add-on, and Sketchfab (Sec. 3.1). (ii) Then, we generate scenes using these assets (Sec. 3.2). (iii) Next, we render the scenes from multiple viewpoints and multiple lighting conditions using the HDRI maps (Sec. 3.3). (iv) Finally, we add transient objects to the rendered images using Gemini (Sec. 3.4). 3.2 Sc… view at source ↗

**Figure 4.** Figure 4: Scene Examples from WildCity Dataset. WildCity dataset contains images of various 3D assets captured under different viewpoints and illuminations. The added transient objects are highlighted with red dashed boxes. We also provide corresponding depth maps and masks indicating the sky regions. where ID, IG, IT , IB, and IE denote the diffuse, glossy, transmission, background, and emission components, respect… view at source ↗

**Figure 5.** Figure 5: Learning Appearance Consistency and Transient-free Geometry. During training, Wild3R takes multi-view images with random view-dependent illuminations and transient objects, and is constrained to predict transient-free 3D Gaussian primitives, depth maps, and camera poses. The illumination of the reconstructed scene is conditioned on the reference view. 4.2 Learning Appearance Consistency and Transient-free … view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison with 16 Context Views on the Photo Tourism Dataset. demonstrating its robust capability to synthesize perceptually high-quality reconstructions from highly limited observations. Qualitative Comparison. Qualitative comparisons on the Photo Tourism dataset [31] are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative Ablation Study with 16 Context Views. (a) Our full model enables transient object removal while the variant fails. (b) Our full model is more consistent on fine-grained geometries than the model trained without Sketchfab assets. 6 Limitations While Wild3R demonstrates impressive results in unconstrained 3D reconstruction, it has several limitations inherent to its design. First, our method does… view at source ↗

**Figure 8.** Figure 8: Qualitative Validation of Adding Transient Objects. We show randomly selected examples from the WildCity dataset before and after adding transient objects via a text-driven image editing model [35], alongside their corresponding pixel-wise difference heatmaps. SSIM evaluates the perceptual similarity by comparing luminance, contrast, and structural information between two images. It is defined as SSIM(I, ˆ… view at source ↗

**Figure 9.** Figure 9: Additional Qualitative Comparison on the Photo Tourism Dataset. since its depth maps are not provided, we generated pseudo ground truth depth maps from the ground truth camera parameters and images using COLMAP [29]. These datasets inherently limit model generalization: first, both lack transient objects; second, LightCity provides insufficient scene diversity; finally, DTU is object-centric, lacking scene… view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison on the NeRF-OSR Dataset. Training Dataset 4 Context Views 16 Context Views 64 Context Views PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ PSNR↑ SSIM↑ LPIPS↓ None 11.25 0.320 0.593 13.72 0.377 0.546 14.88 0.417 0.512 DTU [11] 11.29 0.299 0.629 12.79 0.352 0.599 13.51 0.391 0.585 LightCity [38] 9.69 0.288 0.667 11.17 0.309 0.676 12.33 0.340 0.690 WildCity (Ours) 13.04 0.370 0.556 15.87 0.435 … view at source ↗

read the original abstract

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core move is releasing the WildCity dataset to train a feed-forward 3DGS model that handles lighting shifts and transients in casual photo sets, but the claims hinge on whether that dataset actually supplies independent variation for the supervision to work.

read the letter

The main thing to know is that Wild3R pairs a new large-scale dataset with a feed-forward architecture meant to replace per-scene optimization for 3D Gaussian Splatting on unconstrained images. The dataset covers 200 scenes across 170 lighting conditions plus transients for a total of 337500 images, and the model is conditioned on reference views to enforce appearance consistency while dropping transient content.

What stands out as new is the explicit focus on collecting data that mixes viewpoint, illumination, and transient changes at this scale; prior feed-forward 3DGS work has mostly stayed on cleaner or synthetic sets. The paper does a reasonable job naming the practical bottleneck—existing methods fail on real photo collections—and positions the dataset as the fix that lets the model learn the desired invariances.

The soft spot is the load-bearing assumption that the collected scenes give cleanly separable factors for supervision. If lighting changes are entangled with transients or viewpoints in the capture process, or if there is no reliable signal for what counts as transient versus permanent, the training objective cannot isolate the effects the abstract claims. The abstract states outperformance over other feed-forward baselines and competitiveness with optimization methods, but without the actual tables, ablations, or training details it is impossible to judge whether those numbers reflect genuine generalization or dataset-specific fitting. The stress-test note about independent variation is worth checking directly in the methods section.

This is aimed at people working on practical 3D reconstruction pipelines for AR, robotics, or casual capture. A reader already following feed-forward 3DGS papers would get concrete value from the dataset release and the conditioning approach if the experiments are reproducible. It is worth sending to peer review because the problem is real, the dataset is a tangible addition, and the architecture is a straightforward extension; any referee would focus on verifying the data collection protocol and the quantitative support for the transient-removal claim.

Referee Report

1 major / 1 minor

Summary. The paper presents Wild3R, a feed-forward method for 3D Gaussian Splatting from unconstrained sparse photo collections containing diverse lighting and transient objects. It introduces the WildCity dataset (200 scenes, 170 lighting conditions, transients, 337500 images) to enable learning of viewpoint-conditioned appearance consistency and transient removal, claiming outperformance over existing feed-forward approaches and competitiveness with per-scene optimization methods.

Significance. If the central claims hold with rigorous validation, the work would advance practical feed-forward 3D reconstruction by addressing real-world variations without per-scene optimization, potentially broadening applicability of 3DGS to casual photo collections.

major comments (1)

[Abstract] Abstract: The claim that WildCity supplies independent multi-view, lighting, and transient variations sufficient for isolating appearance consistency and transient removal in a feed-forward objective is load-bearing for the outperformance and generalization assertions, yet the provided text gives no details on data collection protocol, independence of factors, or supervision signals for transient removal; without these, the training cannot be shown to disentangle the factors as required.

minor comments (1)

The abstract references 'extensive experiments' demonstrating outperformance but provides no metrics, tables, baselines, or ablation details to allow assessment of whether reported gains are supported by data or affected by post-hoc choices.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and for identifying the need for greater clarity in the abstract regarding the WildCity dataset. We address the comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract: The claim that WildCity supplies independent multi-view, lighting, and transient variations sufficient for isolating appearance consistency and transient removal in a feed-forward objective is load-bearing for the outperformance and generalization assertions, yet the provided text gives no details on data collection protocol, independence of factors, or supervision signals for transient removal; without these, the training cannot be shown to disentangle the factors as required.

Authors: We agree that the abstract is concise and does not itself supply the requested protocol details. The full manuscript (Section 3) describes the WildCity construction: each of the 200 scenes was captured from multiple viewpoints under 170 distinct lighting conditions, with and without transient objects, using a protocol that isolates the three factors through repeated captures of the same geometry. Transient supervision is obtained from paired images of identical viewpoint and lighting that differ only in the presence of transients. We will revise the abstract to add one sentence summarizing this design, thereby making the load-bearing claim more self-contained while remaining within length constraints. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external dataset and standard training

full rationale

The paper introduces the WildCity dataset as newly collected external data (200 scenes, 170 lighting conditions, transients) to train the feed-forward model for appearance consistency and transient removal. Performance claims rest on experiments comparing against other methods, with no quoted equations or steps showing a result defined in terms of itself, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. The central premise (learning from multi-view/illumination/transient variations) is supported by the dataset's independent collection rather than reducing to internal definitions or prior self-work by construction. This matches the default case of a self-contained paper with no circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or training details; therefore no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5706 in / 1148 out tokens · 22465 ms · 2026-06-27T10:11:12.771759+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 5 canonical work pages · 4 internal anchors

[1]

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bulo, Y . Kuang, and P. Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020

2020
[2]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[3]

X. Chen, Q. Zhang, X. Li, Y . Chen, Y . Feng, X. Wang, and J. Wang. Hallucinated neural radiance fields in the wild. InCVPR, 2022

2022
[4]

Couturier

A. Couturier. SceneCity.https://www.cgchan.com/
[5]

Dahmani, M

H. Dahmani, M. Bennehar, N. Piasco, L. Roldao, and D. Tsishkou. Swag: Splatting in the wild images with appearance-conditioned gaussians. InECCV, 2024

2024
[6]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017
[7]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

2023
[8]

Fridovich-Keil, G

S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. InCVPR, 2023

2023
[9]

Greff, F

K. Greff, F. Belletti, L. Beyer, C. Doersch, Y . Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. (Derek)Liu, H. Meyer, Y . Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V . Sitzmann, A. Stone, D. Sun, S. V ora, Z. Wang, T. Wu...

2022
[10]

Huang, K

P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InCVPR, 2018

2018
[11]

Jensen, A

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. InCVPR, 2014

2014
[12]

Jiang, Y

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.TOG, 2025

2025
[13]

Kassab, A

K. Kassab, A. Schnepf, J.-Y . Franceschi, L. Caraffa, J. Mary, and V . Gouet-Brunet. Refinedfields: Radiance fields refinement for planar scene representations.TMLR, 2025

2025
[14]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. InTOG, 2023

2023
[15]

Kulhanek, S

J. Kulhanek, S. Peng, Z. Kukelova, M. Pollefeys, and T. Sattler. WildGaussians: 3D gaussian splatting in the wild. InNeurIPS, 2024

2024
[16]

C. Li, Z. Shi, Y . Lu, W. He, and X. Xu. Robust neural rendering in the wild with asymmetric dual 3d gaussian splatting. InNeurIPS, 2025

2025
[17]

Li and N

Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018
[18]

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026. 10

2026
[19]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, X. Li, X. Sun, R. Ashok, A. Mukherjee, H. Kang, X. Kong, G. Hua, T. Zhang, B. Benes, and A. Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024

2024
[20]

I. Liu, L. Chen, Z. Fu, L. Wu, H. Jin, Z. Li, C. M. R. Wong, Y . Xu, R. Ramamoorthi, Z. Xu, and H. Su. Openillumination: A multi-illumination dataset for inverse rendering evaluation on real objects. InNeurIPS, 2023

2023
[21]

Martin-Brualla, N

R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InCVPR, 2021

2021
[22]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

2020
[23]

Moreau, R

A. Moreau, R. Shaw, M. Nazarczuk, J. Shin, T. Tanay, Z. Zhang, S. Xu, and E. Pérez-Pellitero. Off the grid: Detection of primitives for feed-forward 3d gaussian splatting. InCVPR, 2026

2026
[24]

X. Pan, N. Charron, Y . Yang, S. Peters, T. Whelan, C. Kong, O. Parkhi, R. Newcombe, and C. Y . Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023

2023
[25]

F. Pei, J. Bai, X. Feng, Z. Bi, K. Zhou, and H. Wu. Opensubstance: A high-quality measured dataset of multi-view and -lighting images and shapes. InICCV, 2025

2025
[26]

Reizenstein, R

J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021

2021
[27]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021
[28]

Rudnev, M

V . Rudnev, M. Elgharib, W. Smith, L. Liu, V . Golyanik, and C. Theobalt. Nerf for outdoor scene relighting. InECCV, 2022

2022
[29]

J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. InCVPR, 2016

2016
[30]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Snavely, S

N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d.TOG, 2006

2006
[32]

The Replica Dataset: A Digital Replica of Indoor Spaces

J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Yon, Y . Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. The replica dataset:...

work page internal anchor Pith review Pith/arXiv arXiv 1906
[33]

J. Sun, X. Chen, Q. Wang, Z. Li, H. Averbuch-Elor, X. Zhou, and N. Snavely. Neural 3d reconstruction in the wild. InSIGGRAPH, 2022

2022
[34]

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra. Habitat 2.0: Training home assistants to rearrange their habitat. InNeurIPS, 2021

2021
[35]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

Toschi, R

M. Toschi, R. De Matteo, R. Spezialetti, D. De Gregorio, L. Di Stefano, and S. Salti. Relight my nerf: A dataset for novel view synthesis and relighting of real world objects. InCVPR, 2023. 11

2023
[37]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025
[38]

J. Wang, Q. Hu, C. Bao, Y . Zhu, H. Bao, Z. Cui, and G. Zhang. Lightcity: An urban dataset for outdoor inverse rendering and reconstruction under multi-illumination conditions. InICCV, 2025

2025
[39]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024
[40]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004

2004
[41]

Wu and T

L. Wu and T. Zhang. Wildsplatting: Unposed incremental 3d gaussian splatting reconstruction in the wild. InVRCAI, 2025

2025
[42]

H. Xia, Y . Fu, S. Liu, and X. Wang. Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos. InCVPR, 2024

2024
[43]

J. Xu, Y . Mei, and V . M. Patel. Wild-gs: Real-time novel view synthesis from unconstrained photo collections. InNeurIPS, 2024

2024
[44]

Y . Yang, S. Zhang, Z. Huang, Y . Zhang, and M. Tan. Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections. InICCV, 2023

2023
[45]

Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. InCVPR, 2020

2020
[46]

B. Ye, B. Chen, H. Xu, D. Barath, and M. Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting. InICLR, 2026

2026
[47]

Zhang, C

D. Zhang, C. Wang, W. Wang, P. Li, M. Qin, and H. Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. InECCV, 2024

2024
[48]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018
[49]

Zhang, J

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In CVPR, 2025

2025
[50]

Zhang, X

X. Zhang, X. Zheng, Y . Yin, T. Zhao, K. Tang, M. B. Mi, Z. Xu, and D. Z. Chen. Anchorsplat: Feed-forward 3d gaussian splatting with 3d geometric priors. InCVPR, 2026

2026
[51]

Zheng, A

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

2023
[52]

Ziwen, H

C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InICCV, 2025. 12 A Implementation Details Scene and Camera Setup.During the manual selection of buildings for scene generation, we generally maintained a distance of at least 20 meters between the...

2025
[53]

Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness

If there are roads where people and cars can be placed, we use: “Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness.”
[54]

Please add ˜tb, while maintaining the geometry and lightness/darkness

If not, we use: “Please add ˜tb, while maintaining the geometry and lightness/darkness.” where the words tp, tv, ts, tb, and ˜tb are sampled from Tp, Tv, Ts, Tb, and eTb as in Table 4, respectively. We classify each image into the two types above, using another variant of Gemini (gemini-3-flash-preview) with a predefined promptq, as shown in Table 4. Vali...

work page arXiv

[1] [1]

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bulo, Y . Kuang, and P. Kontschieder. Mapillary planet-scale depth dataset. InECCV, 2020

2020

[2] [2]

Virtual KITTI 2

Y . Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[3] [3]

X. Chen, Q. Zhang, X. Li, Y . Chen, Y . Feng, X. Wang, and J. Wang. Hallucinated neural radiance fields in the wild. InCVPR, 2022

2022

[4] [4]

Couturier

A. Couturier. SceneCity.https://www.cgchan.com/

[5] [5]

Dahmani, M

H. Dahmani, M. Bennehar, N. Piasco, L. Roldao, and D. Tsishkou. Swag: Splatting in the wild images with appearance-conditioned gaussians. InECCV, 2024

2024

[6] [6]

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR, 2017

2017

[7] [7]

Deitke, D

M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi. Objaverse: A universe of annotated 3d objects. InCVPR, 2023

2023

[8] [8]

Fridovich-Keil, G

S. Fridovich-Keil, G. Meanti, F. R. Warburg, B. Recht, and A. Kanazawa. K-planes: Explicit radiance fields in space, time, and appearance. InCVPR, 2023

2023

[9] [9]

Greff, F

K. Greff, F. Belletti, L. Beyer, C. Doersch, Y . Du, D. Duckworth, D. J. Fleet, D. Gnanapragasam, F. Golemo, C. Herrmann, T. Kipf, A. Kundu, D. Lagun, I. Laradji, H.-T. (Derek)Liu, H. Meyer, Y . Miao, D. Nowrouzezahrai, C. Oztireli, E. Pot, N. Radwan, D. Rebain, S. Sabour, M. S. M. Sajjadi, M. Sela, V . Sitzmann, A. Stone, D. Sun, S. V ora, Z. Wang, T. Wu...

2022

[10] [10]

Huang, K

P.-H. Huang, K. Matzen, J. Kopf, N. Ahuja, and J.-B. Huang. Deepmvs: Learning multi-view stereopsis. InCVPR, 2018

2018

[11] [11]

Jensen, A

R. Jensen, A. Dahl, G. V ogiatzis, E. Tola, and H. Aanæs. Large scale multi-view stereopsis evaluation. InCVPR, 2014

2014

[12] [12]

Jiang, Y

L. Jiang, Y . Mao, L. Xu, T. Lu, K. Ren, Y . Jin, X. Xu, M. Yu, J. Pang, F. Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.TOG, 2025

2025

[13] [13]

Kassab, A

K. Kassab, A. Schnepf, J.-Y . Franceschi, L. Caraffa, J. Mary, and V . Gouet-Brunet. Refinedfields: Radiance fields refinement for planar scene representations.TMLR, 2025

2025

[14] [14]

Kerbl, G

B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis. 3d gaussian splatting for real-time radiance field rendering. InTOG, 2023

2023

[15] [15]

Kulhanek, S

J. Kulhanek, S. Peng, Z. Kukelova, M. Pollefeys, and T. Sattler. WildGaussians: 3D gaussian splatting in the wild. InNeurIPS, 2024

2024

[16] [16]

C. Li, Z. Shi, Y . Lu, W. He, and X. Xu. Robust neural rendering in the wild with asymmetric dual 3d gaussian splatting. InNeurIPS, 2025

2025

[17] [17]

Li and N

Z. Li and N. Snavely. Megadepth: Learning single-view depth prediction from internet photos. InCVPR, 2018

2018

[18] [18]

H. Lin, S. Chen, J. H. Liew, D. Y . Chen, Z. Li, Y . Zhao, S. Peng, H. Guo, X. Zhou, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026. 10

2026

[19] [19]

L. Ling, Y . Sheng, Z. Tu, W. Zhao, C. Xin, K. Wan, L. Yu, Q. Guo, Z. Yu, Y . Lu, X. Li, X. Sun, R. Ashok, A. Mukherjee, H. Kang, X. Kong, G. Hua, T. Zhang, B. Benes, and A. Bera. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InCVPR, 2024

2024

[20] [20]

I. Liu, L. Chen, Z. Fu, L. Wu, H. Jin, Z. Li, C. M. R. Wong, Y . Xu, R. Ramamoorthi, Z. Xu, and H. Su. Openillumination: A multi-illumination dataset for inverse rendering evaluation on real objects. InNeurIPS, 2023

2023

[21] [21]

Martin-Brualla, N

R. Martin-Brualla, N. Radwan, M. S. M. Sajjadi, J. T. Barron, A. Dosovitskiy, and D. Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InCVPR, 2021

2021

[22] [22]

Mildenhall, P

B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

2020

[23] [23]

Moreau, R

A. Moreau, R. Shaw, M. Nazarczuk, J. Shin, T. Tanay, Z. Zhang, S. Xu, and E. Pérez-Pellitero. Off the grid: Detection of primitives for feed-forward 3d gaussian splatting. InCVPR, 2026

2026

[24] [24]

X. Pan, N. Charron, Y . Yang, S. Peters, T. Whelan, C. Kong, O. Parkhi, R. Newcombe, and C. Y . Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In ICCV, 2023

2023

[25] [25]

F. Pei, J. Bai, X. Feng, Z. Bi, K. Zhou, and H. Wu. Opensubstance: A high-quality measured dataset of multi-view and -lighting images and shapes. InICCV, 2025

2025

[26] [26]

Reizenstein, R

J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021

2021

[27] [27]

Roberts, J

M. Roberts, J. Ramapuram, A. Ranjan, A. Kumar, M. A. Bautista, N. Paczan, R. Webb, and J. M. Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. InICCV, 2021

2021

[28] [28]

Rudnev, M

V . Rudnev, M. Elgharib, W. Smith, L. Liu, V . Golyanik, and C. Theobalt. Nerf for outdoor scene relighting. InECCV, 2022

2022

[29] [29]

J. L. Schönberger and J.-M. Frahm. Structure-from-Motion Revisited. InCVPR, 2016

2016

[30] [30]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

B. Smart, C. Zheng, I. Laina, and V . A. Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [31]

Snavely, S

N. Snavely, S. M. Seitz, and R. Szeliski. Photo tourism: exploring photo collections in 3d.TOG, 2006

2006

[32] [32]

The Replica Dataset: A Digital Replica of Indoor Spaces

J. Straub, T. Whelan, L. Ma, Y . Chen, E. Wijmans, S. Green, J. J. Engel, R. Mur-Artal, C. Ren, S. Verma, A. Clarkson, M. Yan, B. Budge, Y . Yan, X. Pan, J. Yon, Y . Zou, K. Leon, N. Carter, J. Briales, T. Gillingham, E. Mueggler, L. Pesqueira, M. Savva, D. Batra, H. M. Strasdat, R. D. Nardi, M. Goesele, S. Lovegrove, and R. Newcombe. The replica dataset:...

work page internal anchor Pith review Pith/arXiv arXiv 1906

[33] [33]

J. Sun, X. Chen, Q. Wang, Z. Li, H. Averbuch-Elor, X. Zhou, and N. Snavely. Neural 3d reconstruction in the wild. InSIGGRAPH, 2022

2022

[34] [34]

A. Szot, A. Clegg, E. Undersander, E. Wijmans, Y . Zhao, J. Turner, N. Maestre, M. Mukadam, D. Chaplot, O. Maksymets, A. Gokaslan, V . V ondrus, S. Dharur, F. Meier, W. Galuba, A. Chang, Z. Kira, V . Koltun, J. Malik, M. Savva, and D. Batra. Habitat 2.0: Training home assistants to rearrange their habitat. InNeurIPS, 2021

2021

[35] [35]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

Toschi, R

M. Toschi, R. De Matteo, R. Spezialetti, D. De Gregorio, L. Di Stefano, and S. Salti. Relight my nerf: A dataset for novel view synthesis and relighting of real world objects. InCVPR, 2023. 11

2023

[37] [37]

J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry grounded transformer. InCVPR, 2025

2025

[38] [38]

J. Wang, Q. Hu, C. Bao, Y . Zhu, H. Bao, Z. Cui, and G. Zhang. Lightcity: An urban dataset for outdoor inverse rendering and reconstruction under multi-illumination conditions. InICCV, 2025

2025

[39] [39]

S. Wang, V . Leroy, Y . Cabon, B. Chidlovskii, and J. Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024

2024

[40] [40]

Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. Image quality assessment: from error visibility to structural similarity.TIP, 2004

2004

[41] [41]

Wu and T

L. Wu and T. Zhang. Wildsplatting: Unposed incremental 3d gaussian splatting reconstruction in the wild. InVRCAI, 2025

2025

[42] [42]

H. Xia, Y . Fu, S. Liu, and X. Wang. Rgbd objects in the wild: Scaling real-world 3d object learning from rgb-d videos. InCVPR, 2024

2024

[43] [43]

J. Xu, Y . Mei, and V . M. Patel. Wild-gs: Real-time novel view synthesis from unconstrained photo collections. InNeurIPS, 2024

2024

[44] [44]

Y . Yang, S. Zhang, Z. Huang, Y . Zhang, and M. Tan. Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections. InICCV, 2023

2023

[45] [45]

Y . Yao, Z. Luo, S. Li, J. Zhang, Y . Ren, L. Zhou, T. Fang, and L. Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. InCVPR, 2020

2020

[46] [46]

B. Ye, B. Chen, H. Xu, D. Barath, and M. Pollefeys. Yonosplat: You only need one model for feedforward 3d gaussian splatting. InICLR, 2026

2026

[47] [47]

Zhang, C

D. Zhang, C. Wang, W. Wang, P. Li, M. Qin, and H. Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. InECCV, 2024

2024

[48] [48]

Zhang, P

R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. InCVPR, 2018

2018

[49] [49]

Zhang, J

S. Zhang, J. Wang, Y . Xu, N. Xue, C. Rupprecht, X. Zhou, Y . Shen, and G. Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In CVPR, 2025

2025

[50] [50]

Zhang, X

X. Zhang, X. Zheng, Y . Yin, T. Zhao, K. Tang, M. B. Mi, Z. Xu, and D. Z. Chen. Anchorsplat: Feed-forward 3d gaussian splatting with 3d geometric priors. InCVPR, 2026

2026

[51] [51]

Zheng, A

Y . Zheng, A. W. Harley, B. Shen, G. Wetzstein, and L. J. Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InICCV, 2023

2023

[52] [52]

Ziwen, H

C. Ziwen, H. Tan, K. Zhang, S. Bi, F. Luan, Y . Hong, L. Fuxin, and Z. Xu. Long-lrm: Long-sequence large reconstruction model for wide-coverage gaussian splats. InICCV, 2025. 12 A Implementation Details Scene and Camera Setup.During the manual selection of buildings for scene generation, we generally maintained a distance of at least 20 meters between the...

2025

[53] [53]

Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness

If there are roads where people and cars can be placed, we use: “Please addt p,t v,t s,t b, while maintaining the geometry and lightness/darkness.”

[54] [54]

Please add ˜tb, while maintaining the geometry and lightness/darkness

If not, we use: “Please add ˜tb, while maintaining the geometry and lightness/darkness.” where the words tp, tv, ts, tb, and ˜tb are sampled from Tp, Tv, Ts, Tb, and eTb as in Table 4, respectively. We classify each image into the two types above, using another variant of Gemini (gemini-3-flash-preview) with a predefined promptq, as shown in Table 4. Vali...

work page arXiv