arxiv: 2604.21182 · v1 · submitted 2026-04-23 · 💻 cs.CV

Recognition: unknown

WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

Yuki Fujimura , Takahiro Kushida , Kazuya Kitano , Takuya Funatomi , Yasuhiro Mukaigawa

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D Gaussian Splattingfeed-forward reconstructionappearance controlunconstrained imagespose-free reconstructionnovel view synthesislighting variationreal-time rendering

0 comments

The pith

A feed-forward model reconstructs 3D Gaussians from sparse unconstrained images with unknown cameras and varying lighting in under one second.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a model trained on casual photo collections that predicts 3D scene elements directly from a few input views. It simultaneously learns embeddings that capture appearance differences so the same underlying geometry can be rendered under new lighting conditions. This removes the usual requirements for known camera positions, consistent illumination, and slow per-scene optimization. A reader would care because everyday photographs rarely meet those constraints, yet many applications need quick 3D models that can still be adjusted after capture.

Core claim

WildSplatter is a feed-forward 3D Gaussian Splatting model trained on unconstrained photo collections that jointly learns 3D Gaussians and appearance embeddings conditioned on input images, enabling reconstruction from sparse views in under one second and flexible modulation of Gaussian colors to represent lighting variations.

What carries the argument

Joint learning of 3D Gaussians and appearance embeddings conditioned on input images, which modulates per-Gaussian colors to handle appearance changes.

If this is right

Scene reconstruction becomes possible from everyday photo collections without requiring calibrated rigs or controlled lighting.
Appearance can be edited after reconstruction by changing the conditioning embedding rather than re-optimizing the entire model.
Real-time novel-view synthesis is available immediately after the single forward pass instead of minutes or hours of iterative fitting.
Pose-free pipelines can now handle real-world datasets that contain illumination changes, outperforming prior methods that assume consistent lighting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same conditioning mechanism could be extended to video frames to produce temporally coherent 3D models of moving scenes.
Appearance embeddings might serve as a compact way to store multiple lighting states of one geometry, reducing storage for AR content.
Combining the feed-forward reconstruction with downstream tasks such as object insertion or relighting would create end-to-end pipelines that start from casual photos.

Load-bearing premise

A network trained on unconstrained photo collections will generalize to new scenes whose camera parameters and lighting lie outside the training distribution while keeping appearance control accurate.

What would settle it

Reconstructing a held-out scene captured under lighting or camera angles markedly different from the training set and checking whether appearance edits produce color-consistent renderings without introducing artifacts or geometry drift.

Figures

Figures reproduced from arXiv: 2604.21182 by Kazuya Kitano, Takahiro Kushida, Takuya Funatomi, Yasuhiro Mukaigawa, Yuki Fujimura.

**Figure 2.** Figure 2: Overview of the proposed method. Given sparse-view context images captured under [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Examples of training samples. The first two rows show context images captured under [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative comparison for novel-view synthesis on the NeRF-OSR dataset [22]. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Appearance interpolation. The first and third rows show rendered images from novel views [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: (a) t-SNE visualization [28] of learned appearance embeddings, where each point corresponds to an image and colors indicate K-means clusters. (b) Representative images for each cluster, showing that images within the same cluster share similar appearance characteristics. Context view s Rendere d Referenc e [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-dataset appearance control. The geometry of 3D Gaussians estimated from context [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: From left to right: context images, estimated opacity maps for the context views, target [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

read the original abstract

We propose WildSplatter, a feed-forward 3D Gaussian Splatting (3DGS) model for unconstrained images with unknown camera parameters and varying lighting conditions. 3DGS is an effective scene representation that enables high-quality, real-time rendering; however, it typically requires iterative optimization and multi-view images captured under consistent lighting with known camera parameters. WildSplatter is trained on unconstrained photo collections and jointly learns 3D Gaussians and appearance embeddings conditioned on input images. This design enables flexible modulation of Gaussian colors to represent significant variations in lighting and appearance. Our method reconstructs 3D Gaussians from sparse input views in under one second, while also enabling appearance control under diverse lighting conditions. Experimental results demonstrate that our approach outperforms existing pose-free 3DGS methods on challenging real-world datasets with varying illumination.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildSplatter gives a fast feed-forward path to 3D Gaussians from casual photos with some lighting tweaks via embeddings, but the control may not separate lighting from scene-specific patterns.

read the letter

WildSplatter is a feed-forward network that takes a few unconstrained photos, predicts 3D Gaussians directly, and adds appearance embeddings that let you adjust colors for different lighting after the fact. The reconstruction runs in under a second and claims to beat prior pose-free 3DGS methods on real datasets with illumination changes. That speed and the move away from per-scene optimization are the concrete advances here. Training jointly on photo collections with varying light and unknown cameras is a reasonable practical choice, and the embeddings conditioned on the inputs give a mechanism for post-hoc modulation without needing explicit lighting models or consistent capture conditions. The experiments on challenging real-world sets with varying illumination show measurable gains, which is worth noting. The soft spot sits in the appearance embeddings themselves. Because the model trains end-to-end on collections where lighting tends to correlate strongly with each scene, nothing forces the embeddings to learn a general lighting factor separate from geometry or albedo. If they mostly capture training-set correlations, the reported control and outperformance may not transfer to new scenes or truly novel lighting. The abstract and results do not include clear ablations on out-of-distribution lighting or explicit disentanglement checks, so the generalization claim rests on the assumption that the learned modulation is robust rather than memorized. This paper is for people building quick 3D pipelines from phone or casual captures in graphics and vision. Readers who need fast reconstruction and are willing to test the lighting control on their own data will find the most value. It deserves a serious referee because the speed claim is testable and the core architecture is grounded enough to generate useful feedback, even if the lighting part needs tighter validation.

Referee Report

2 major / 2 minor

Summary. The paper proposes WildSplatter, a feed-forward 3D Gaussian Splatting model trained on unconstrained photo collections. It jointly predicts 3D Gaussians and appearance embeddings conditioned on input images with unknown camera parameters and varying lighting. This enables reconstruction from sparse views in under one second and flexible modulation of Gaussian colors for appearance control under diverse lighting conditions. Experiments claim outperformance over existing pose-free 3DGS methods on challenging real-world datasets with illumination variations.

Significance. If the claims hold, the work would be significant for practical 3D reconstruction from casual, unconstrained imagery, removing the need for controlled capture setups or per-scene optimization. The feed-forward design and reported sub-second inference are notable strengths, as is the attempt to handle appearance variation without explicit lighting supervision. No machine-checked proofs or parameter-free derivations are present, but the empirical focus on real-world varying-illumination datasets provides a concrete testbed.

major comments (2)

[§3.2] §3.2 (Appearance Embedding Module): The architecture conditions appearance embeddings solely on input images without explicit regularization, adversarial losses, or cycle-consistency terms to enforce disentanglement from geometry and albedo. This directly bears on the central claim of generalizable lighting modulation, as the skeptic concern notes that end-to-end training on per-scene illumination correlations in unconstrained collections risks the embeddings memorizing dataset-specific patterns rather than learning transferable modulation.
[§4.2] §4.2 (Quantitative Evaluation on Varying-Illumination Datasets): The reported outperformance over pose-free baselines is presented, but the evaluation protocol for appearance control (e.g., how novel lighting conditions are synthesized or measured via PSNR/SSIM on held-out illuminations) is not detailed enough to rule out overfitting; if the test scenes share lighting statistics with training, the generalization claim is not load-bearingly supported.

minor comments (2)

[Figure 3] Figure 3 and §4.1: The qualitative examples of appearance control would benefit from side-by-side comparison with ground-truth novel lighting captures rather than only relit renderings.
[§3.1] Notation in §3.1: The definition of the appearance embedding vector a_i should explicitly state its dimensionality and how it is injected into the Gaussian color prediction (e.g., via concatenation or FiLM).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications and indicating where revisions will be made to improve the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (Appearance Embedding Module): The architecture conditions appearance embeddings solely on input images without explicit regularization, adversarial losses, or cycle-consistency terms to enforce disentanglement from geometry and albedo. This directly bears on the central claim of generalizable lighting modulation, as the skeptic concern notes that end-to-end training on per-scene illumination correlations in unconstrained collections risks the embeddings memorizing dataset-specific patterns rather than learning transferable modulation.

Authors: We acknowledge the absence of explicit regularization mechanisms such as adversarial or cycle-consistency losses in the appearance embedding module. The embeddings are predicted directly from input images and modulate Gaussian colors via a lightweight network, with the entire model trained end-to-end using only photometric reconstruction losses on diverse unconstrained photo collections. This setup encourages the embeddings to capture residual appearance factors after geometry is accounted for by the Gaussian prediction branch. Our experiments demonstrate that the learned embeddings generalize to novel scenes and enable plausible lighting modulation on held-out data, suggesting they capture transferable rather than purely memorized patterns. Nevertheless, we agree that further analysis would strengthen the disentanglement claim, and we will add visualizations of the embedding space along with ablations on modulation behavior in the revised manuscript. revision: partial
Referee: [§4.2] §4.2 (Quantitative Evaluation on Varying-Illumination Datasets): The reported outperformance over pose-free baselines is presented, but the evaluation protocol for appearance control (e.g., how novel lighting conditions are synthesized or measured via PSNR/SSIM on held-out illuminations) is not detailed enough to rule out overfitting; if the test scenes share lighting statistics with training, the generalization claim is not load-bearingly supported.

Authors: We thank the referee for highlighting the need for greater detail in the evaluation protocol. In the revised version of §4.2, we will explicitly describe the appearance control evaluation: novel lighting conditions are synthesized by extracting appearance embeddings from reference images captured under different illuminations (distinct from the input views) and applying them to modulate the colors of the predicted 3D Gaussians. Quantitative metrics (PSNR/SSIM) are then computed on held-out test views. The dataset splits ensure that training and test scenes come from separate capture sessions with non-overlapping lighting statistics, as verified by per-scene illumination histograms. We will also report additional cross-dataset results to further support generalization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical feed-forward model with external validation

full rationale

The paper describes a neural architecture for feed-forward 3D Gaussian reconstruction and appearance embedding from unconstrained images. All performance claims rest on end-to-end training followed by quantitative evaluation on held-out real-world datasets with varying illumination. No mathematical derivation, uniqueness theorem, or parameter-fitting step is presented that reduces by construction to the training inputs; the architecture is a standard encoder-decoder design whose outputs are tested against independent benchmarks rather than being tautological with the loss or data statistics.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Insufficient information in the abstract to identify any free parameters, axioms, or invented entities. The method likely uses standard deep learning components for 3D reconstruction.

pith-pipeline@v0.9.0 · 5461 in / 1380 out tokens · 42206 ms · 2026-05-09T22:53:10.768465+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 6 canonical work pages · 3 internal anchors

[1]

Wildcat3d: Appearance-aware multi-view diffusion in the wild

Morris Alper, David Novotny, Filippos Kokkinos, Hadar Averbuch-Elor, and Tom Monnier. Wildcat3d: Appearance-aware multi-view diffusion in the wild. InNeurIPS, 2025

2025
[2]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InCVPR, pages 19457–19467, 2024

2024
[3]

Hallucinated neural radiance fields in the wild

Xingyu Chen, Qi Zhang, Xiaoyu Li, Yue Chen, Ying Feng, Xuan Wang, and Jue Wang. Hallucinated neural radiance fields in the wild. InCVPR, pages 12943–12952, 2022

2022
[4]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InECCV, pages 370–386, 2024

2024
[5]

Mvsplat360: Feed-forward 360 scene synthesis from sparse views

Yuedong Chen, Chuanxia Zheng, Haofei Xu, Bohan Zhuang, Andrea Vedaldi, Tat-Jen Cham, and Jianfei Cai. Mvsplat360: Feed-forward 360 scene synthesis from sparse views. InNeurIPS, 2024

2024
[6]

Swag: Splatting in the wild images with appearance-conditioned gaussians

Hiba Dahmani, Moussab Bennehar, Nathan Piasco, Luis Roldão, and Dzmitry Tsishkou. Swag: Splatting in the wild images with appearance-conditioned gaussians. InECCV, page 325–340, 2024

2024
[7]

Ufv-splatter: Pose-free feed-forward 3d gaussian splatting adapted to unfavorable views.arXiv preprint arXiv:2507.22342, 2025

Yuki Fujimura, Takahiro Kushida, Kazuya Kitano, Takuya Funatomi, and Yasuhiro Mukaigawa. Ufv-splatter: Pose-free feed-forward 3d gaussian splatting adapted to unfavorable views.arXiv preprint arXiv:2507.22342, 2025

work page arXiv 2025
[8]

Cam- bridge University Press, 2003

Richard Hartley and Andrew Zisserman.Multiple View Geometry in Computer Vision. Cam- bridge University Press, 2003

2003
[9]

Pf3plat: Pose-free feed-forward 3d gaussian splatting

Sunghwan Hong, Jaewoo Jung, Heeseong Shin, Jisang Han, Jiaolong Yang, Chong Luo, and Seungryong Kim. Pf3plat: Pose-free feed-forward 3d gaussian splatting. InICML, 2025

2025
[10]

2d gaussian splatting for geometrically accurate radiance fields

Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accurate radiance fields. InACM SIGGRAPH, 2024

2024
[11]

No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views

Ranran Huang and Krystian Mikolajczyk. No pose at all: Self-supervised pose-free 3d gaussian splatting from sparse views. InICCV, pages 27947–27957, 2025

2025
[12]

Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM TOG, 44(6):1–16, 2025

Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM TOG, 44(6):1–16, 2025

2025
[13]

Lumigauss: Relightable gaussian splatting in the wild

Joanna Kaleta, Kacper Kania, Tomasz Trzcinski, and Marek Kowalski. Lumigauss: Relightable gaussian splatting in the wild. InWACV, pages 1–10, 2025

2025
[14]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 42(4), 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkuehler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 42(4), 2023

2023
[15]

Wildgaussians: 3d gaussian splatting in the wild

Jonas Kulhanek, Songyou Peng, Zuzana Kukelova, Marc Pollefeys, and Torsten Sattler. Wildgaussians: 3d gaussian splatting in the wild. InNeurIPS, 2024

2024
[16]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views. InICLR, 2026

2026
[17]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InICLR, 2019

2019
[18]

Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duckworth. Nerf in the wild: Neural radiance fields for unconstrained photo collections. InCVPR, pages 7210–7219, 2021. 10

2021
[19]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InECCV, 2020

2020
[20]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick La...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Vision transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InICCV, 2021

2021
[22]

Nerf for outdoor scene relighting

Viktor Rudnev, Mohamed Elgharib, William Smith, Lingjie Liu, Vladislav Golyanik, and Christian Theobalt. Nerf for outdoor scene relighting. InECCV, 2022

2022
[23]

Schonberger and Jan-Michael Frahm

Johannes L. Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. InCVPR, 2016

2016
[24]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review arXiv 2002
[25]

Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs. InarXiv preprint arXiv:2408.13912, 2024

work page arXiv 2024
[26]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.arXiv preprint arXiv:2104.09864, 2021

work page internal anchor Pith review arXiv 2021
[27]

Megascenes: Scene-level view synthesis at scale

Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. InECCV, 2024

2024
[28]

Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9(86):2579–2605, 2008

2008
[29]

Grounding image matching in 3d with mast3r, 2024

Jérôme Revaud Vincent Leroy, Yohann Cabon. Grounding image matching in 3d with mast3r. InarXiv preprint arXiv:2406.09756, 2024

work page arXiv 2024
[30]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294–5306, 2025

2025
[31]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, pages 20697–20709, 2024

2024
[32]

Jiacong Xu, Yiqun Mei, and Vishal M. Patel. Wild-gs: Real-time novel view synthesis from unconstrained photo collections. InNeurIPS, 2024

2024
[33]

Freesplatter: Pose-free gaussian splatting for sparse- view 3d reconstruction

Jiale Xu, Shenghua Gao, and Ying Shan. Freesplatter: Pose-free gaussian splatting for sparse- view 3d reconstruction. InICCV, pages 25442–25452, 2025

2025
[34]

Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections

Yifan Yang, Shuhai Zhang, Zixiong Huang, Yubing Zhang, and Mingkui Tan. Cross-ray neural radiance fields for novel-view synthesis from unconstrained image collections. InICCV, pages 15901–15911, 2023

2023
[35]

No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. InICLR, 2025

2025
[36]

Gaussian in the wild: 3d gaussian splatting for unconstrained image collections

Dongbin Zhang, Chuming Wang, Weitao Wang, Peihao Li, Minghan Qin, and Haoqian Wang. Gaussian in the wild: 3d gaussian splatting for unconstrained image collections. InECCV, pages 341–359, 2024

2024
[37]

Efros, Eli Shechtman, and Oliver Wang

Richard Zhang, Phillip Isola, Alexei A. Efros, Eli Shechtman, and Oliver Wang. The unreason- able effectiveness of deep features as a perceptual metric. InCVPR, 2018. 11

2018
[38]

Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. InCVPR, pages 21936–21947, June 2025. 12 Appendix A Training dataset We use the MegaScenes dataset [27], which consists of Internet photos o...

2025