pith. the verified trust layer for science. sign in

arxiv: 2506.09885 · v2 · submitted 2025-06-11 · 💻 cs.CV

The Less You Depend, The More You Learn: Synthesizing Novel Views from Sparse, Unposed Images with Minimal 3D Knowledge

Pith reviewed 2026-05-19 09:28 UTC · model grok-4.3

classification 💻 cs.CV
keywords novel view synthesisfeed-forward networksunposed imagesimplicit 3D learningdata scalingsparse viewsminimal priors
0
0 comments X p. Extension

The pith

Novel view synthesis methods relying on less explicit 3D knowledge improve faster with more data and eventually outperform pose-dependent approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that in novel view synthesis, approaches depending less on explicit 3D knowledge like NeRF or camera poses improve their performance more quickly when given more training data. This trend eventually allows them to surpass methods built around strong 3D priors. The authors use this insight to create a new system that generates novel views from sparse, unposed images by learning 3D understanding implicitly from 2D data alone. This data-centric design removes the need for Structure-from-Motion or handcrafted representations at both training and test time. A reader would care because it points to a scalable path forward as image datasets continue to grow.

Core claim

The authors discover that the performance of novel view synthesis methods requiring less 3D knowledge accelerates more as training data increases, eventually outperforming 3D knowledge-driven counterparts. They term this 'the less you depend, the more you learn.' Building on this, they design a feed-forward framework that eliminates dependence on explicit scene structure and pose annotations, learning implicit 3D awareness directly from vast quantities of 2D images without any pose information for training or inference.

What carries the argument

A feed-forward novel view synthesis network designed to learn implicit 3D structure from unposed 2D images without explicit scene representations or camera poses.

If this is right

  • The new framework achieves state-of-the-art performance on novel view synthesis tasks.
  • It works even when outperforming methods that rely on posed training data.
  • Performance gains accelerate with larger training sets for low-dependence methods.
  • This validates shifting design focus toward data-centric paradigms in 3D vision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar scaling advantages might appear in other vision tasks like depth estimation or 3D reconstruction if explicit priors are minimized.
  • The approach could enable training on much larger, noisier internet image collections without pose estimation.
  • One testable extension is applying the method to dynamic scenes or video data where poses are hard to obtain.

Load-bearing premise

The observed advantage of low-3D-knowledge methods will continue when the new architecture is scaled to much larger data volumes without introducing unmeasured errors from missing explicit structure.

What would settle it

Training the proposed model on increasingly large datasets and checking whether its novel view synthesis accuracy continues to exceed that of pose-based methods or degrades on scenes requiring precise geometric constraints.

Figures

Figures reproduced from arXiv: 2506.09885 by Baoquan Chen, Haoru Wang, Kai Ye, Minghan Qin, Wenzheng Chen, Yangyan Li.

Figure 1
Figure 1. Figure 1: , we categorize the task into three settings based on the pose availability: the posed setting, where both input and target poses are provided; the posed-target setting, where only the target pose is available; and the unposed setting, where only images are provided, without any pose information. Input Views Recon. Module Render Module Scene Repr. Input Poses Target View Target Pose Training Input Views La… view at source ↗
Figure 2
Figure 2. Figure 2: Scalability Overview. We first choose the RealEstate10K bench￾mark [55] for our comparisons, which is one of the largest open-source datasets for generalizable novel view synthesis, containing multi-view im￾agery from over 70K scenes. We construct sub￾sets of the RealEstate10K dataset at four different scales (little, medium, large, and full, as shown in [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Scalability Comparison on Different Levels of 3D Inductive Bias. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Intuitive Explanation. On posed-target setting, both NoPoSplat and PT-LVSM fail to infer correct spatial structure when trained with 1K scenes, resulting in artifacts at the right bottom of target views. While bias-driven NoPoSplat consistently makes mistakes, PT-LVSM significantly improves when training data scales up from 1K to 66K, eventually outperforming NoPoSplat. Intuitively, when data is scarce, me… view at source ↗
Figure 5
Figure 5. Figure 5: Scalability of LVSM and PT-LVSM. Intuitively, posed settings provide more information, so for the same input views, a method that can utilize pose information (i.e., posed￾setting method) is expected to achieve a higher performance upper bound compared to an unposed-setting method. However, our com￾parisons suggest that this gap can be mitigated by rich training data. We find that PT-LVSM generally exhibit… view at source ↗
Figure 6
Figure 6. Figure 6: UP-LVSM Overview. 4.2 Experimental Results Qualitative and quantitative results are shown in [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative View Synthesis Comparisons. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: shows the cross-view attention weights, where regions with higher attention correspond to the same area in 3D space. The visualization demonstrates our method learns excellent 3D spatial correspondence, which is essential for camera prediction and novel view synthesis. We further conduct ablation studies by comparing similarity of DINOv2 [27] features (dinov2_vitb14_reg4_pretrain.pth) across views to confi… view at source ↗
Figure 9
Figure 9. Figure 9: Camera Control. (a) The t-SNE [38] visualization shows the learned latent space aligns with the ground-truth space through a twisted domain transformation. (b) A linear mapping can help convert input camera sequence into latent space, facilitating explicit camera control. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Scalability Comparison of Implicit Methods. [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: ). w/o. Mapper w. Mapper w/o. Mapper w. Mapper [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Performance of Methods Trained with Noisy Poses. Different levels of Gaussian noise (σ 2 = 0.001, 0.01, 0.1) were added to the rotation (in quaternion form) and translation components of the poses in training data. While UP-LVSM remains agnostic to noisy poses, LVSM experiences significant degradation with increasing noise levels, exhibiting sensitivity even to small amounts of noise (0.001). G More Discu… view at source ↗
read the original abstract

Recent advances in feed-forward Novel View Synthesis (NVS) have led to a divergence between two design philosophies: bias-driven methods, which rely on explicit 3D knowledge, such as handcrafted 3D representations (e.g., NeRF and 3DGS) and camera poses annotated by Structure-from-Motion algorithms, and data-centric methods, which learn to understand 3D structure implicitly from large-scale imagery data. This raises a fundamental question: which paradigm is more scalable in an era of ever-increasing data availability? In this work, we conduct a comprehensive analysis of existing methods and uncover a critical trend that the performance of methods requiring less 3D knowledge accelerates more as training data increases, eventually outperforming their 3D knowledge-driven counterparts, which we term "the less you depend, the more you learn." Guided by this finding, we design a feed-forward NVS framework that removes both explicit scene structure and pose annotation reliance. By eliminating these dependencies, our method leverages great scalability, learning implicit 3D awareness directly from vast quantities of 2D images, without any pose information for training or inference. Extensive experiments demonstrate that our model achieves state-of-the-art NVS performance, even outperforming methods relying on posed training data. The results validate not only the effectiveness of our data-centric paradigm but also the power of our scalability finding as a guiding principle.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes existing novel view synthesis (NVS) methods and identifies a scaling trend: approaches relying on less explicit 3D knowledge (no poses, no handcrafted representations) improve faster with more training data and eventually surpass bias-driven methods that use SfM poses or explicit 3D structures such as NeRF/3DGS. Guided by this observation, the authors introduce a feed-forward NVS architecture trained and tested without any pose or scene-structure supervision, claiming state-of-the-art performance on standard benchmarks even when compared to pose-supervised baselines.

Significance. If the reported scaling advantage is robust and the new pose-free model continues to improve at larger scales, the work would provide both an empirical principle and a practical architecture that reduces dependence on costly 3D annotations, potentially enabling NVS on massive unposed image collections. The explicit credit for reproducible code or parameter-free derivations is not present in the manuscript; the strength lies in the empirical trend analysis and the architectural simplification.

major comments (2)
  1. The central scaling claim (performance acceleration driven by reduced 3D dependence) rests on a comparison of prior methods whose groups differ systematically in backbone (CNN vs. transformer), parameter count, optimization schedule, and data curation. Without a controlled ablation that holds architecture and capacity fixed while varying only the amount of explicit 3D input, it remains possible that the steeper curves reflect recent architectural progress rather than the 'less you depend' principle. This directly affects the justification for designing the new pose-free model.
  2. The manuscript states that the new model is trained without pose supervision and still outperforms posed baselines, yet no quantitative breakdown is given for how much of the gain comes from the architecture versus the larger effective training set enabled by dropping pose requirements. A direct comparison on identical data with and without pose inputs would isolate the contribution.
minor comments (2)
  1. Figure captions and axis labels in the scaling plots should explicitly state the exact metric (e.g., PSNR, SSIM) and the precise definition of 'training data volume' used for each curve.
  2. The abstract claims 'state-of-the-art NVS performance' without specifying the evaluation protocol (number of input views, scene categories, or whether test poses are known at inference). Adding these details would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating planned revisions to strengthen the manuscript where appropriate.

read point-by-point responses
  1. Referee: The central scaling claim (performance acceleration driven by reduced 3D dependence) rests on a comparison of prior methods whose groups differ systematically in backbone (CNN vs. transformer), parameter count, optimization schedule, and data curation. Without a controlled ablation that holds architecture and capacity fixed while varying only the amount of explicit 3D input, it remains possible that the steeper curves reflect recent architectural progress rather than the 'less you depend' principle. This directly affects the justification for designing the new pose-free model.

    Authors: We acknowledge that the literature methods aggregated in our scaling analysis differ in backbone, capacity, and training details, which could introduce confounding factors. The trend we report is nevertheless consistent across multiple independent works, providing empirical motivation for the data-centric direction. In the revision we will add explicit discussion of these potential confounders and their possible influence on the observed curves. The design of the new pose-free model is justified primarily by its own empirical results on standard benchmarks without pose supervision; we will clarify that the literature analysis serves as guiding observation rather than conclusive causal proof. revision: partial

  2. Referee: The manuscript states that the new model is trained without pose supervision and still outperforms posed baselines, yet no quantitative breakdown is given for how much of the gain comes from the architecture versus the larger effective training set enabled by dropping pose requirements. A direct comparison on identical data with and without pose inputs would isolate the contribution.

    Authors: We agree that isolating the contribution of architecture versus expanded data scale would be valuable. Because our architecture is specifically engineered for unposed inputs, constructing an otherwise identical posed variant is non-trivial; however, we will add scaling experiments on matched data subsets in the revision to quantify the benefit of the larger effective training set made possible by removing pose requirements. We will also discuss the practical trade-offs involved in such a comparison. revision: yes

Circularity Check

0 steps flagged

Empirical scaling analysis of prior methods and independent new-model validation contain no circular reductions

full rationale

The paper derives its guiding principle from a comparative performance analysis of existing NVS methods (NeRF/3DGS/pose-based vs. implicit feed-forward) as data volume grows; this is an external empirical observation on published results rather than any self-referential fit, definition, or self-citation chain. The subsequent architecture is explicitly constructed to remove pose and structure inputs and is evaluated as a fresh test of the observed trend. No equations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided abstract or described derivation; the chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only, the central claim rests on the unverified assumption that the scaling trend generalizes to the authors' architecture and that implicit learning from 2D images alone suffices for high-quality NVS. No explicit free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.0 · 5806 in / 1126 out tokens · 26632 ms · 2026-05-19T09:28:53.309266+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 5 internal anchors

  1. [1]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5470–5479, 2022

  2. [2]

    Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer

    Eric Brachmann, Jamie Wynn, Shuai Chen, Tommaso Cavallari, Áron Monszpart, Daniyar Turmukham- betov, and Victor Adrian Prisacariu. Scene coordinate reconstruction: Posing of image collections via incremental learning of a relocalizer. In European Conference on Computer Vision , pages 421–440. Springer, 2024

  3. [3]

    Matterport3D: Learning from RGB-D Data in Indoor Environments

    Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niessner, Manolis Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments. arXiv preprint arXiv:1709.06158, 2017

  4. [4]

    ShapeNet: An Information-Rich 3D Model Repository

    Angel X Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio Savarese, Manolis Savva, Shuran Song, Hao Su, et al. Shapenet: An information-rich 3d model repository. arXiv preprint arXiv:1512.03012, 2015

  5. [5]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In CVPR, 2024

  6. [6]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. arXiv preprint arXiv:2403.14627, 2024

  7. [7]

    Abo: Dataset and benchmarks for real-world 3d object understanding

    Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F Yago Vicente, Thomas Dideriksen, Himanshu Arora, et al. Abo: Dataset and benchmarks for real-world 3d object understanding. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21126–21136, 2022

  8. [8]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 5828–5839, 2017

  9. [9]

    Objaverse-xl: A universe of 10m+ 3d objects

    Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram V oleti, Samir Yitzhak Gadre, et al. Objaverse-xl: A universe of 10m+ 3d objects. Advances in Neural Information Processing Systems , 36:35799–35813, 2023

  10. [10]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 10

  11. [11]

    Google scanned objects: A high-quality dataset of 3d scanned household items

    Laura Downs, Anthony Francis, Nate Koenig, Brandon Kinman, Ryan Hickman, Krista Reymann, Thomas B McHugh, and Vincent Vanhoucke. Google scanned objects: A high-quality dataset of 3d scanned household items. In 2022 International Conference on Robotics and Automation (ICRA) , pages 2553–2560. IEEE, 2022

  12. [12]

    Learning to render novel views from wide-baseline stereo pairs

    Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitzmann. Learning to render novel views from wide-baseline stereo pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  13. [13]

    Learning to render novel views from wide-baseline stereo pairs

    Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitzmann. Learning to render novel views from wide-baseline stereo pairs. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4970–4980, 2023

  14. [14]

    Pose-free generalizable rendering transformer

    Zhiwen Fan, Panwang Pan, Peihao Wang, Yifan Jiang, Hanwen Jiang, Dejia Xu, Zehao Zhu, Dilin Wang, and Zhangyang Wang. Pose-free generalizable rendering transformer. arXiv preprint arXiv:2310.03704, 2023

  15. [15]

    Instantsplat: Unbounded sparse-view pose-free gaussian splatting in 40 seconds

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Unbounded sparse-view pose-free gaussian splatting in 40 seconds. arXiv preprint arXiv:2403.20309, 2(3):4, 2024

  16. [16]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers. arXiv preprint arXiv:2010.04245, 2020

  17. [17]

    Rayzer: A self-supervised large view synthesis model, 2025

    Hanwen Jiang, Hao Tan, Peng Wang, Haian Jin, Yue Zhao, Sai Bi, Kai Zhang, Fujun Luan, Kalyan Sunkavalli, Qixing Huang, and Georgios Pavlakos. Rayzer: A self-supervised large view synthesis model, 2025

  18. [18]

    Lvsm: A large view synthesis model with minimal 3d inductive bias

    Haian Jin, Hanwen Jiang, Hao Tan, Kai Zhang, Sai Bi, Tianyuan Zhang, Fujun Luan, Noah Snavely, and Zexiang Xu. Lvsm: A large view synthesis model with minimal 3d inductive bias. In The Thirteenth International Conference on Learning Representations , 2025

  19. [19]

    Perceptual losses for real-time style transfer and super- resolution

    Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer and super- resolution. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14 , pages 694–711. Springer, 2016

  20. [20]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ACM Transactions on Graphics, 42(4), 2023

  21. [21]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2041–2050, 2018

  22. [22]

    Openrooms: An open framework for photorealistic indoor scene datasets

    Zhengqin Li, Ting-Wei Yu, Shen Sang, Sarah Wang, Meng Song, Yuhan Liu, Yu-Ying Yeh, Rui Zhu, Nitesh Gundavarapu, Jia Shi, et al. Openrooms: An open framework for photorealistic indoor scene datasets. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 7190–7199, 2021

  23. [23]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 22160–22169, 2024

  24. [24]

    Infinite nature: Perpetual view generation of natural scenes from a single image

    Andrew Liu, Richard Tucker, Varun Jampani, Ameesh Makadia, Noah Snavely, and Angjoo Kanazawa. Infinite nature: Perpetual view generation of natural scenes from a single image. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 14458–14467, 2021

  25. [25]

    Matrix3d: Large photogrammetry model all-in-one

    Yuanxun Lu†, Jingyang Zhang, Tian Fang, Jean–Daniel Nahmias, Yanghai Tsin, Long Quan‡, Xun Cao†, Yao Yao†, and Shiwei Li. Matrix3d: Large photogrammetry model all-in-one. In CVPR, 2025

  26. [26]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. NeRF: Representing scenes as neural radiance fields for view synthesis. In The European Conference on Computer Vision (ECCV), 2020

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  28. [28]

    Julius Plucker. Xvii. on a new geometry of space. Philosophical Transactions of the Royal Society of London, 155:725–791, 1865. 11

  29. [29]

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10901–10911, 2021

  30. [30]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10912–10922, 2021

  31. [31]

    Geometry-free view synthesis: Transformers and no 3d priors

    Robin Rombach, Patrick Esser, and Björn Ommer. Geometry-free view synthesis: Transformers and no 3d priors. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 14356–14366, 2021

  32. [32]

    Mehdi S. M. Sajjadi, Henning Meyer, Etienne Pot, Urs Bergmann, Klaus Greff, Noha Radwan, Suhani V ora, Mario Lucic, Daniel Duckworth, Alexey Dosovitskiy, Jakob Uszkoreit, Thomas Funkhouser, and Andrea Tagliasacchi. Scene Representation Transformer: Geometry-Free Novel View Synthesis Through Set-Latent Scene Representations. CVPR, 2022

  33. [33]

    Mehdi S. M. Sajjadi, Aravindh Mahendran, Thomas Kipf, Etienne Pot, Daniel Duckworth, Mario Luˇci´c, and Klaus Greff. RUST: Latent Neural Scene Representations from Unposed Imagery. CVPR, 2023

  34. [34]

    Structure-from-motion revisited

    Johannes L Schonberger and Jan-Michael Frahm. Structure-from-motion revisited. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 4104–4113, 2016

  35. [35]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs. arXiv preprint arXiv:2408.13912, 2024

  36. [36]

    Generalizable patch-based neural rendering

    Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural rendering. In European Conference on Computer Vision, pages 156–174. Springer, 2022

  37. [37]

    Megascenes: Scene-level view synthesis at scale

    Joseph Tung, Gene Chou, Ruojin Cai, Guandao Yang, Kai Zhang, Gordon Wetzstein, Bharath Hariharan, and Noah Snavely. Megascenes: Scene-level view synthesis at scale. In ECCV, 2024

  38. [38]

    Visualizing data using t-sne

    Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning research, 9(11), 2008

  39. [39]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems , 30, 2017

  40. [40]

    Vggsfm: Visual geometry grounded deep structure from motion

    Jianyuan Wang, Nikita Karaev, Christian Rupprecht, and David Novotny. Vggsfm: Visual geometry grounded deep structure from motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21686–21697, 2024

  41. [41]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  42. [42]

    Ibrnet: Learning multi-view image-based rendering

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul P Srinivasan, Howard Zhou, Jonathan T Barron, Ricardo Martin-Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 4690–4699, 2021

  43. [43]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. In CVPR, 2024

  44. [44]

    NeRF−−: Neural radiance fields without known camera parameters

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. NeRF−−: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021

  45. [45]

    latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction

    Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen. latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction. In European Conference on Computer Vision, pages 456–473. Springer, 2024

  46. [46]

    Murf: multi-baseline radiance fields

    Haofei Xu, Anpei Chen, Yuedong Chen, Christos Sakaridis, Yulun Zhang, Marc Pollefeys, Andreas Geiger, and Fisher Yu. Murf: multi-baseline radiance fields. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 20041–20050, 2024

  47. [47]

    Blendedmvs: A large-scale dataset for generalized multi-view stereo networks

    Yao Yao, Zixin Luo, Shiwei Li, Jingyang Zhang, Yufan Ren, Lei Zhou, Tian Fang, and Long Quan. Blendedmvs: A large-scale dataset for generalized multi-view stereo networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 1790–1799, 2020. 12

  48. [48]

    No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no problem: Surprisingly simple 3d gaussian splats from sparse unposed images. In The Thirteenth International Conference on Learning Representations , 2025

  49. [49]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 12–22, 2023

  50. [50]

    pixelNeRF: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelNeRF: Neural radiance fields from one or few images. In CVPR, 2021

  51. [51]

    Mvimgnet: A large-scale dataset of multi-view images

    Xianggang Yu, Mutian Xu, Yidan Zhang, Haolin Liu, Chongjie Ye, Yushuang Wu, Zizheng Yan, Chenming Zhu, Zhangyang Xiong, Tianyou Liang, et al. Mvimgnet: A large-scale dataset of multi-view images. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 9150–9161, 2023

  52. [52]

    Mip-splatting: Alias-free 3d gaussian splatting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splatting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19447–19456, 2024

  53. [53]

    Gs-lrm: Large reconstruction model for 3d gaussian splatting

    Kai Zhang, Sai Bi, Hao Tan, Yuanbo Xiangli, Nanxuan Zhao, Kalyan Sunkavalli, and Zexiang Xu. Gs-lrm: Large reconstruction model for 3d gaussian splatting. European Conference on Computer Vision, 2024

  54. [54]

    Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views

    Shangzhan Zhang, Jianyuan Wang, Yinghao Xu, Nan Xue, Christian Rupprecht, Xiaowei Zhou, Yujun Shen, and Gordon Wetzstein. Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. arXiv preprint arXiv:2502.12138, 2025

  55. [55]

    Stereo magnification: learning view synthesis using multiplane images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: learning view synthesis using multiplane images. ACM Trans. Graph., 37(4), 2018. 13 A Problem Setting Details In this section, we provide additional details regarding the problem settings described in Section 3 of the main paper, where we present our analysis t...