pith. machine review for the scientific record. sign in

arxiv: 2408.13912 · v2 · submitted 2024-08-25 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Gaussian Splatting3D ReconstructionNovel View SynthesisUncalibrated ImagesStereo PairsFeed-forward NetworkZero-shot Generalization
0
0 comments X

The pith

Splatt3R turns any uncalibrated stereo image pair into a 3D Gaussian splat without camera parameters or depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Splatt3R is a feed-forward network that reconstructs scenes as 3D Gaussian splats from two uncalibrated natural images. It builds on a 3D geometry foundation by adding the attributes needed for Gaussian primitives and uses a two-stage training: first matching 3D point cloud geometry, then optimizing for novel view synthesis with a special masking strategy. This sequence avoids the local minima that arise when training splats directly from limited stereo views. The result is a model trained on indoor scans that generalizes well to outdoor and in-the-wild photos, producing splats that support real-time rendering after quick reconstruction.

Core claim

Given two uncalibrated images, Splatt3R predicts a set of 3D Gaussians whose positions, colors, and other attributes allow both accurate geometry reconstruction and high-quality novel view synthesis. The method first optimizes a geometry loss on the point cloud and later switches to a rendering loss, with masking on extrapolated regions, to achieve this from stereo pairs alone.

What carries the argument

Two-stage training that first optimizes only the 3D point cloud geometry loss before applying the novel view synthesis objective, together with a loss masking strategy for extrapolated viewpoints.

If this is right

  • Scenes can be reconstructed and rendered in real time from casual uncalibrated photo pairs.
  • 3D modeling becomes possible in environments where acquiring camera poses or depth is impractical.
  • The approach supports generalization from controlled training data to diverse natural images.
  • Reconstruction runs at 4 frames per second for 512x512 images with real-time splat rendering afterward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the method to sequences of more than two images could improve consistency and reduce artifacts in complex scenes.
  • Integration with mobile devices might enable instant 3D capture from everyday photography without special equipment.
  • Similar staged training could benefit other 3D representation learning tasks that suffer from local minima in direct optimization.

Load-bearing premise

The assumption that beginning with geometry optimization and then moving to appearance optimization, plus the masking, reliably sidesteps the local minima encountered in direct Gaussian splat training from stereo pairs.

What would settle it

Running Splatt3R on a dataset of uncalibrated image pairs with ground-truth novel views from significantly different angles and measuring whether the rendered images match the ground truth within a small error margin.

read the original abstract

In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Splatt3R, a feed-forward, pose-free method that predicts 3D Gaussian splats directly from uncalibrated stereo image pairs. It extends the MASt3R foundation model to output per-point Gaussian attributes (scales, rotations, opacities, and appearance coefficients) in addition to 3D points. Training follows a two-stage schedule: first optimizing a 3D point-cloud geometry loss inherited from MASt3R, then switching to a novel-view-synthesis objective, together with a proposed loss-masking strategy. The model is trained on ScanNet++ and claims strong generalization to in-the-wild images, with reconstruction at 4 FPS (512×512) and real-time splat rendering.

Significance. If the results hold, the work would be significant for enabling calibration-free, feed-forward Gaussian splatting and novel-view synthesis from casual stereo pairs. Leveraging a foundation model plus staged training could simplify pipelines that currently rely on SfM or multi-view optimization, while the reported speed supports practical deployment. Explicit credit is due for the reproducible integration with MASt3R and the focus on in-the-wild generalization.

major comments (2)
  1. [§3.2] §3.2 (Training Procedure): The central claim that the two-stage schedule (geometry loss followed by NVS objective) plus loss masking reliably avoids the local minima that arise in direct Gaussian-splat optimization from stereo views is load-bearing for the method. No ablation is reported that compares staged training against joint optimization of all Gaussian attributes or against MASt3R features alone; quantitative metrics (PSNR/SSIM/LPIPS on extrapolated views) for these variants are required to substantiate the claim.
  2. [§4.3] §4.3 (Ablations and Masking): The loss-masking strategy is described as 'empirically critical' for performance on extrapolated viewpoints, yet the manuscript provides no isolated quantitative ablation (e.g., with/without masking on the same backbone) or details on mask computation. This omission weakens the ability to attribute gains specifically to the proposed masking rather than to the base MASt3R geometry.
minor comments (2)
  1. [Abstract] The abstract states 'excellent generalisation' without citing any numerical metrics or baseline comparisons; adding a brief quantitative statement would improve clarity.
  2. [§3.1] Notation for the additional Gaussian attributes (scale, rotation, opacity, appearance) should be introduced once in §3.1 and used consistently thereafter to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of Splatt3R for calibration-free Gaussian splatting. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Training Procedure): The central claim that the two-stage schedule (geometry loss followed by NVS objective) plus loss masking reliably avoids the local minima that arise in direct Gaussian-splat optimization from stereo views is load-bearing for the method. No ablation is reported that compares staged training against joint optimization of all Gaussian attributes or against MASt3R features alone; quantitative metrics (PSNR/SSIM/LPIPS on extrapolated views) for these variants are required to substantiate the claim.

    Authors: We agree that an explicit ablation is needed to substantiate the benefit of the two-stage schedule. In the revised manuscript we will add a quantitative comparison on the same evaluation protocol, reporting PSNR, SSIM and LPIPS on extrapolated views for three variants: (1) joint optimization of all Gaussian attributes from the first epoch, (2) MASt3R geometry features without the novel-view-synthesis stage, and (3) the proposed two-stage procedure. We expect the staged approach to show clear gains because early geometry stabilization prevents the optimizer from settling into poor local minima once appearance attributes are introduced. revision: yes

  2. Referee: [§4.3] §4.3 (Ablations and Masking): The loss-masking strategy is described as 'empirically critical' for performance on extrapolated viewpoints, yet the manuscript provides no isolated quantitative ablation (e.g., with/without masking on the same backbone) or details on mask computation. This omission weakens the ability to attribute gains specifically to the proposed masking rather than to the base MASt3R geometry.

    Authors: We acknowledge the omission. In the revision we will expand the description of the loss-masking strategy with the exact computation (thresholding on MASt3R per-point confidence combined with cross-view geometric consistency checks) and add an isolated ablation that trains the identical backbone with and without masking. The table will report PSNR/SSIM/LPIPS on extrapolated views so that the contribution of masking can be isolated from the base MASt3R geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity: extension of external MASt3R with empirical staged training

full rationale

The derivation relies on an external foundation model (MASt3R) for base geometry reconstruction and a public dataset (ScanNet++). The core extension—predicting per-point Gaussian attributes (scales, rotations, opacities, appearance) and applying a two-stage optimization (first 3D point cloud geometry loss, then novel-view synthesis objective) plus loss masking—is presented as an empirical design choice to avoid local minima, not as a mathematical reduction or self-definitional fit. No equations equate outputs to inputs by construction, no uniqueness theorem is imported from self-citations, and no ansatz is smuggled via prior author work. Generalization claims rest on reported experiments rather than forced predictions. This is a standard non-circular extension of prior independent work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that MASt3R already provides reliable geometry and that adding Gaussian attributes plus staged training plus masking will produce view-consistent splats; no new physical entities are introduced.

axioms (2)
  • domain assumption MASt3R produces sufficiently accurate 3D point clouds from uncalibrated pairs to serve as a stable base for subsequent Gaussian attribute prediction.
    Invoked when the paper states it builds upon MASt3R by extending it to deal with both 3D structure and appearance.
  • ad hoc to paper Optimizing geometry loss first followed by novel-view-synthesis loss avoids the local minima that arise when training Gaussian splats directly from stereo views.
    Explicitly stated as the reason for the two-stage training procedure.

pith-pipeline@v0.9.0 · 5561 in / 1464 out tokens · 67050 ms · 2026-05-17T01:11:07.424871+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views

    cs.CV 2026-05 unverdicted novelty 8.0

    GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.

  2. ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes

    cs.CV 2026-05 unverdicted novelty 7.0

    ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.

  3. SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis

    cs.CV 2026-05 unverdicted novelty 7.0

    SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.

  4. Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes

    cs.CV 2026-05 unverdicted novelty 7.0

    Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prio...

  5. WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images

    cs.CV 2026-04 unverdicted novelty 7.0

    WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.

  6. Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction

    cs.CV 2026-04 unverdicted novelty 7.0

    Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...

  7. 3AM: 3egment Anything with Geometric Consistency in Videos

    cs.CV 2026-01 unverdicted novelty 7.0

    3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.

  8. MODEST: Multi-Optics Depth-of-Field Stereo Dataset

    cs.CV 2025-11 accept novelty 7.0

    MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.

  9. FluSplat: Sparse-View 3D Editing without Test-Time Optimization

    cs.CV 2026-04 unverdicted novelty 6.0

    FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.

  10. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  11. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  12. LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video

    cs.CV 2026-04 unverdicted novelty 6.0

    LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.

  13. DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

    cs.CV 2025-12 unverdicted novelty 6.0

    DePT3R performs joint dense point tracking and 3D reconstruction of dynamic scenes from multiple unposed images using a single neural network forward pass.

  14. C3G: Learning Compact 3D Representations with 2K Gaussians

    cs.CV 2025-12 unverdicted novelty 6.0

    C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.

  15. Depth Anything 3: Recovering the Visual Space from Any Views

    cs.CV 2025-11 unverdicted novelty 6.0

    DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.

  16. Streaming 4D Visual Geometry Transformer

    cs.CV 2025-07 unverdicted novelty 6.0

    A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.

  17. ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting

    cs.CV 2026-05 unverdicted novelty 5.0

    ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.

  18. Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images

    cs.CV 2026-04 unverdicted novelty 5.0

    UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.

  19. VGGT-SLAM++

    cs.CV 2026-04 unverdicted novelty 4.0

    VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 19 Pith papers

  1. [1]

    The plenoptic func- tion and the elements of early vision

    Edward H Adelson and James R Bergen. The plenoptic func- tion and the elements of early vision. MIT Press, 1991. 2

  2. [2]

    Computational stereo

    Stephen T Barnard and Martin A Fischler. Computational stereo. ACM Computing Surveys (CSUR), 1982. 3

  3. [3]

    Mip-nerf 360: Unbounded anti-aliased neural radiance fields

    Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022. 2

  4. [4]

    Porf: Pose residual field for accurate neural sur- face reconstruction

    Jia-Wang Bian, Wenjing Bian, Victor Adrian Prisacariu, and Philip Torr. Porf: Pose residual field for accurate neural sur- face reconstruction. In ICLR, 2023. 3

  5. [5]

    Nope-nerf: Optimising neural ra- diance field with no pose prior

    Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. In CVPR, 2023. 3

  6. [6]

    Pyramid stereo matching network

    Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, 2018. 3

  7. [7]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In CVPR,

  8. [8]

    Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. In ICCV, 2021. 2

  9. [9]

    Dbarf: Deep bundle-adjusting generalizable neural radiance fields

    Yu Chen and Gim Hee Lee. Dbarf: Deep bundle-adjusting generalizable neural radiance fields. In CVPR, 2023. 3

  10. [10]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. ECCV, 2024. 2, 3, 4, 5, 6

  11. [11]

    Feature-level collaboration: Joint unsupervised learn- ing of optical flow, stereo depth and camera motion

    Cheng Chi, Qingjie Wang, Tianyu Hao, Peng Guo, and Xin Yang. Feature-level collaboration: Joint unsupervised learn- ing of optical flow, stereo depth and camera motion. In CVPR, 2021. 3

  12. [12]

    Stereo radiance fields (srf): Learning view syn- thesis for sparse views of novel scenes

    Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard Pons-Moll. Stereo radiance fields (srf): Learning view syn- thesis for sparse views of novel scenes. In CVPR, 2021. 2

  13. [13]

    Photometric bundle adjustment for dense multi-view 3d modeling

    Ama ¨el Delaunoy and Marc Pollefeys. Photometric bundle adjustment for dense multi-view 3d modeling. In CVPR,

  14. [14]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018. 3

  15. [15]

    An image is worth 16x16 words: Transformers for image recognition at scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkorei, and Neil Houlsy. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5

  16. [16]

    Learning to render novel views from wide-baseline stereo pairs

    Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs. In CVPR, 2023. 2, 3, 5

  17. [17]

    Unsupervised cnn for single view depth estimation: Geome- try to the rescue

    Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geome- try to the rescue. In ECCV, 2016. 3

  18. [18]

    Unsupervised monocular depth estimation with left- right consistency

    Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In CVPR, 2017. 3

  19. [19]

    The lumigraph

    Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In Computer Graphics and Interactive Techniques, 1996. 2

  20. [20]

    A combined corner and edge detector

    Chris Harris, Mike Stephens, et al. A combined corner and edge detector. In Alvey Vision Conference, 1988. 3

  21. [21]

    L/sub/spl in- fin//minimization in geometric reconstruction problems

    Richard Hartley and Frederik Schaffalitzky. L/sub/spl in- fin//minimization in geometric reconstruction problems. In CVPR, 2004. 3

  22. [22]

    Triangulation

    Richard I Hartley and Peter Sturm. Triangulation. Computer Vision and Image Understanding, 1997

  23. [23]

    Stereo from uncalibrated cameras

    Richard I Hartley, Rajiv Gupta, and Tom Chang. Stereo from uncalibrated cameras. In CVPR, 1992. 3

  24. [24]

    Occlusions, discontinu- ities, and epipolar lines in stereo

    Hiroshi Ishikawa and Davi Geiger. Occlusions, discontinu- ities, and epipolar lines in stereo. In ECCV, 1998. 3

  25. [25]

    Self-calibrating neural radiance fields

    Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating neural radiance fields. In ICCV, 2021. 3

  26. [26]

    Leap: Liberate sparse-view 3d modeling from camera poses

    Hanwen Jiang, Zhenyu Jiang, Yue Zhao, and Qixing Huang. Leap: Liberate sparse-view 3d modeling from camera poses. In ICLR, 2023. 3

  27. [27]

    Geonerf: Generalizing nerf with geometry priors

    Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. In CVPR, 2022. 2

  28. [28]

    A stereo machine for video-rate dense depth mapping and its new applications

    Takeo Kanade, Atsushi Yoshida, Kazuo Oda, Hiroshi Kano, and Masaya Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. In CVPR, 1996. 3

  29. [29]

    3d gaussian splatting for real-time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ToG, 2023. 1, 2, 3, 4

  30. [30]

    Generalizable novel-view synthesis using a stereo camera

    Haechan Lee, Wonjoon Jin, Seung-Hwan Baek, and Sunghyun Cho. Generalizable novel-view synthesis using a stereo camera. CVPR, 2024. 3

  31. [31]

    Grounding image matching in 3d with mast3r,

    Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756, 2024. 2, 3, 4, 6

  32. [32]

    Light field rendering

    Marc Levoy and Pat Hanrahan. Light field rendering. In SIGGRAPH, 1996. 2

  33. [33]

    Ggrt: Towards generalizable 3d gaussians without pose priors in real-time

    Hao Li, Yuanyuan Gao, Dingwen Zhang, Chenming Wu, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jing- dong Wang, and Junwei Han. Ggrt: Towards generalizable 3d gaussians without pose priors in real-time. ECCV, 2024. 2, 3

  34. [34]

    Taming uncertainty in sparse-view generalizable nerf via indirect diffusion guid- ance

    Yaokun Li, Chao Gou, and Guang Tan. Taming uncertainty in sparse-view generalizable nerf via indirect diffusion guid- ance. arXiv preprint arXiv:2402.01217, 2024. 3

  35. [35]

    Barf: Bundle-adjusting neural radiance fields

    Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Si- mon Lucey. Barf: Bundle-adjusting neural radiance fields. In ICCV, 2021. 3

  36. [36]

    Fast generalizable gaussian splatting reconstruction from multi-view stereo

    Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo. ECCV, 2024. 3

  37. [37]

    Neural rays for occlusion-aware image-based render- ing

    Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping 9 Wang. Neural rays for occlusion-aware image-based render- ing. In CVPR, 2022. 3

  38. [38]

    Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views

    Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views. In ECCV, 2022. 3

  39. [39]

    Distinctive image features from scale- invariant keypoints

    David G Lowe. Distinctive image features from scale- invariant keypoints. IJCV, 2004. 3

  40. [40]

    The fundamen- tal matrix: Theory, algorithms, and stability analysis

    Quan-Tuan Luong and Olivier D Faugeras. The fundamen- tal matrix: Theory, algorithms, and stability analysis. IJCV,

  41. [41]

    Srinivasan, Matthew Tancik, Jonathan T

    Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, 2020. 1, 2

  42. [42]

    Instant neural graphics primitives with a multires- olution hash encoding

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. In SIGGRAPH, 2022. 2

  43. [43]

    Colnerf: Collaboration for generaliz- able sparse input neural radiance field

    Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, and Sam Kwong. Colnerf: Collaboration for generaliz- able sparse input neural radiance field. In AAAI, 2024. 3

  44. [44]

    Deep fundamental matrix estimation

    Ren ´e Ranftl and Vladlen Koltun. Deep fundamental matrix estimation. In ECCV, 2018. 3

  45. [45]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In ICCV, 2021. 5

  46. [46]

    Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021. 3

  47. [47]

    Scene representation networks: Continuous 3d- structure-aware neural scene representations

    Vincent Sitzmann, Michael Zollh ¨ofer, and Gordon Wet- zstein. Scene representation networks: Continuous 3d- structure-aware neural scene representations. NeurIPS,

  48. [48]

    Light field networks: Neu- ral scene representations with single-evaluation rendering

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. NeurIPS, 2021. 1

  49. [49]

    Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow

    Cameron Smith, Yilun Du, Ayush Tewari, and Vincent Sitz- mann. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow. In NeurIPS, 2023. 3

  50. [50]

    Generalizable patch-based neural render- ing

    Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural render- ing. In ECCV, 2022. 3

  51. [51]

    Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image

    Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, Jo ˜ao F Henriques, Christian Rup- precht, and Andrea Vedaldi. Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image. arXiv preprint arXiv:2406.04343, 2024. 2, 4, 6

  52. [52]

    Splatter image: Ultra-fast single-view 3d recon- struction

    Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. In CVPR, 2024. 2, 3, 4, 5

  53. [53]

    Fast corner detection

    Miroslav Trajkovi ´c and Mark Hedley. Fast corner detection. Image and Vision Computing, 1998. 3

  54. [54]

    Sparf: Neural radiance fields from sparse and noisy poses

    Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. Sparf: Neural radiance fields from sparse and noisy poses. In CVPR, 2023. 3

  55. [55]

    Deep two-view structure-from-motion revisited

    Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li. Deep two-view structure-from-motion revisited. In CVPR,

  56. [56]

    Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser

    Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini- vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021. 2, 3

  57. [57]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohan Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, 2024. 3, 4

  58. [58]

    Nerf–: Neural radiance fields without known camera parameters

    Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021. 3

  59. [59]

    latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction

    Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen. latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction. ECCV,

  60. [60]

    Large scale photo- metric bundle adjustment

    Oliver J Woodford and Edward Rosten. Large scale photo- metric bundle adjustment. In BMVC, 2020. 3

  61. [61]

    Scannet++: A high-fidelity dataset of 3d indoor scenes

    Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In ICCV, 2023. 6

  62. [62]

    Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose. In CVPR,

  63. [63]

    pixelnerf: Neural radiance fields from one or few images

    Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3

  64. [64]

    Computing the stereo match- ing cost with a convolutional neural network

    Jure Zbontar and Yann LeCun. Computing the stereo match- ing cost with a convolutional neural network. In CVPR,

  65. [65]

    Unsupervised learn- ing of monocular depth estimation and visual odometry with deep feature reconstruction

    Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learn- ing of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018. 3

  66. [66]

    Ga-net: Guided aggregation net for end- to-end stereo matching

    Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end- to-end stereo matching. In CVPR, 2019. 3

  67. [67]

    A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry

    Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence, 1995. 3

  68. [68]

    Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

    Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. In CVPR, 2024. 2, 3

  69. [69]

    Stereo magnification: Learning view syn- thesis using multiplane images

    Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view syn- thesis using multiplane images. In SIGGRAPH, 2018. 3 10