arxiv: 2408.13912 · v2 · submitted 2024-08-25 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart , Chuanxia Zheng , Iro Laina , Victor Adrian Prisacariu

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:11 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords Gaussian Splatting3D ReconstructionNovel View SynthesisUncalibrated ImagesStereo PairsFeed-forward NetworkZero-shot Generalization

0 comments

The pith

Splatt3R turns any uncalibrated stereo image pair into a 3D Gaussian splat without camera parameters or depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Splatt3R is a feed-forward network that reconstructs scenes as 3D Gaussian splats from two uncalibrated natural images. It builds on a 3D geometry foundation by adding the attributes needed for Gaussian primitives and uses a two-stage training: first matching 3D point cloud geometry, then optimizing for novel view synthesis with a special masking strategy. This sequence avoids the local minima that arise when training splats directly from limited stereo views. The result is a model trained on indoor scans that generalizes well to outdoor and in-the-wild photos, producing splats that support real-time rendering after quick reconstruction.

Core claim

Given two uncalibrated images, Splatt3R predicts a set of 3D Gaussians whose positions, colors, and other attributes allow both accurate geometry reconstruction and high-quality novel view synthesis. The method first optimizes a geometry loss on the point cloud and later switches to a rendering loss, with masking on extrapolated regions, to achieve this from stereo pairs alone.

What carries the argument

Two-stage training that first optimizes only the 3D point cloud geometry loss before applying the novel view synthesis objective, together with a loss masking strategy for extrapolated viewpoints.

If this is right

Scenes can be reconstructed and rendered in real time from casual uncalibrated photo pairs.
3D modeling becomes possible in environments where acquiring camera poses or depth is impractical.
The approach supports generalization from controlled training data to diverse natural images.
Reconstruction runs at 4 frames per second for 512x512 images with real-time splat rendering afterward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the method to sequences of more than two images could improve consistency and reduce artifacts in complex scenes.
Integration with mobile devices might enable instant 3D capture from everyday photography without special equipment.
Similar staged training could benefit other 3D representation learning tasks that suffer from local minima in direct optimization.

Load-bearing premise

The assumption that beginning with geometry optimization and then moving to appearance optimization, plus the masking, reliably sidesteps the local minima encountered in direct Gaussian splat training from stereo pairs.

What would settle it

Running Splatt3R on a dataset of uncalibrated image pairs with ground-truth novel views from significantly different angles and measuring whether the rendered images match the ground truth within a small error margin.

read the original abstract

In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Splatt3R adds Gaussian attributes to MASt3R with staged geometry-then-NVS training for uncalibrated pairs, but the key claims rest on unshown ablations and metrics.

read the letter

Splatt3R extends MASt3R to predict the full set of Gaussian attributes from uncalibrated image pairs and trains in two stages—first the inherited 3D geometry loss, then a novel view synthesis objective—plus a loss masking step for extrapolated views. The idea is to produce usable splats without camera parameters or depth maps at inference time. That combination is the concrete addition over the base MASt3R work. The practical angle is clear: it targets casual capture where calibration is a hassle, and the reported 4 FPS reconstruction plus real-time rendering would matter for AR or robotics pipelines if the numbers hold. Building on a foundation model also gives a plausible route to in-the-wild generalization. The soft spots are the missing details. The abstract asserts that the staged schedule avoids local minima and that masking is empirically critical, yet supplies no quantitative ablations, no direct comparison to joint optimization, and no error breakdowns. Without those, it is difficult to judge how much the new pieces actually move the needle versus the underlying MASt3R features or a simpler end-to-end baseline. The dependence on ScanNet++ for training is standard but means the result is an extension rather than a fully independent method. Readers working on feed-forward 3D reconstruction for everyday images would get the most from this if the full experiments show solid gains on standard benchmarks. It is coherent enough on its own terms to deserve a serious referee who can check the training curves, ablations, and held-out generalization tests.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Splatt3R, a feed-forward, pose-free method that predicts 3D Gaussian splats directly from uncalibrated stereo image pairs. It extends the MASt3R foundation model to output per-point Gaussian attributes (scales, rotations, opacities, and appearance coefficients) in addition to 3D points. Training follows a two-stage schedule: first optimizing a 3D point-cloud geometry loss inherited from MASt3R, then switching to a novel-view-synthesis objective, together with a proposed loss-masking strategy. The model is trained on ScanNet++ and claims strong generalization to in-the-wild images, with reconstruction at 4 FPS (512×512) and real-time splat rendering.

Significance. If the results hold, the work would be significant for enabling calibration-free, feed-forward Gaussian splatting and novel-view synthesis from casual stereo pairs. Leveraging a foundation model plus staged training could simplify pipelines that currently rely on SfM or multi-view optimization, while the reported speed supports practical deployment. Explicit credit is due for the reproducible integration with MASt3R and the focus on in-the-wild generalization.

major comments (2)

[§3.2] §3.2 (Training Procedure): The central claim that the two-stage schedule (geometry loss followed by NVS objective) plus loss masking reliably avoids the local minima that arise in direct Gaussian-splat optimization from stereo views is load-bearing for the method. No ablation is reported that compares staged training against joint optimization of all Gaussian attributes or against MASt3R features alone; quantitative metrics (PSNR/SSIM/LPIPS on extrapolated views) for these variants are required to substantiate the claim.
[§4.3] §4.3 (Ablations and Masking): The loss-masking strategy is described as 'empirically critical' for performance on extrapolated viewpoints, yet the manuscript provides no isolated quantitative ablation (e.g., with/without masking on the same backbone) or details on mask computation. This omission weakens the ability to attribute gains specifically to the proposed masking rather than to the base MASt3R geometry.

minor comments (2)

[Abstract] The abstract states 'excellent generalisation' without citing any numerical metrics or baseline comparisons; adding a brief quantitative statement would improve clarity.
[§3.1] Notation for the additional Gaussian attributes (scale, rotation, opacity, appearance) should be introduced once in §3.1 and used consistently thereafter to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of Splatt3R for calibration-free Gaussian splatting. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [§3.2] §3.2 (Training Procedure): The central claim that the two-stage schedule (geometry loss followed by NVS objective) plus loss masking reliably avoids the local minima that arise in direct Gaussian-splat optimization from stereo views is load-bearing for the method. No ablation is reported that compares staged training against joint optimization of all Gaussian attributes or against MASt3R features alone; quantitative metrics (PSNR/SSIM/LPIPS on extrapolated views) for these variants are required to substantiate the claim.

Authors: We agree that an explicit ablation is needed to substantiate the benefit of the two-stage schedule. In the revised manuscript we will add a quantitative comparison on the same evaluation protocol, reporting PSNR, SSIM and LPIPS on extrapolated views for three variants: (1) joint optimization of all Gaussian attributes from the first epoch, (2) MASt3R geometry features without the novel-view-synthesis stage, and (3) the proposed two-stage procedure. We expect the staged approach to show clear gains because early geometry stabilization prevents the optimizer from settling into poor local minima once appearance attributes are introduced. revision: yes
Referee: [§4.3] §4.3 (Ablations and Masking): The loss-masking strategy is described as 'empirically critical' for performance on extrapolated viewpoints, yet the manuscript provides no isolated quantitative ablation (e.g., with/without masking on the same backbone) or details on mask computation. This omission weakens the ability to attribute gains specifically to the proposed masking rather than to the base MASt3R geometry.

Authors: We acknowledge the omission. In the revision we will expand the description of the loss-masking strategy with the exact computation (thresholding on MASt3R per-point confidence combined with cross-view geometric consistency checks) and add an isolated ablation that trains the identical backbone with and without masking. The table will report PSNR/SSIM/LPIPS on extrapolated views so that the contribution of masking can be isolated from the base MASt3R geometry. revision: yes

Circularity Check

0 steps flagged

No significant circularity: extension of external MASt3R with empirical staged training

full rationale

The derivation relies on an external foundation model (MASt3R) for base geometry reconstruction and a public dataset (ScanNet++). The core extension—predicting per-point Gaussian attributes (scales, rotations, opacities, appearance) and applying a two-stage optimization (first 3D point cloud geometry loss, then novel-view synthesis objective) plus loss masking—is presented as an empirical design choice to avoid local minima, not as a mathematical reduction or self-definitional fit. No equations equate outputs to inputs by construction, no uniqueness theorem is imported from self-citations, and no ansatz is smuggled via prior author work. Generalization claims rest on reported experiments rather than forced predictions. This is a standard non-circular extension of prior independent work.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that MASt3R already provides reliable geometry and that adding Gaussian attributes plus staged training plus masking will produce view-consistent splats; no new physical entities are introduced.

axioms (2)

domain assumption MASt3R produces sufficiently accurate 3D point clouds from uncalibrated pairs to serve as a stable base for subsequent Gaussian attribute prediction.
Invoked when the paper states it builds upon MASt3R by extending it to deal with both 3D structure and appearance.
ad hoc to paper Optimizing geometry loss first followed by novel-view-synthesis loss avoids the local minima that arise when training Gaussian splats directly from stereo views.
Explicitly stated as the reason for the two-stage training procedure.

pith-pipeline@v0.9.0 · 5561 in / 1464 out tokens · 67050 ms · 2026-05-17T01:11:07.424871+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mind the Gap: Geometrically Accurate Generative Reconstruction from Disjoint Views
cs.CV 2026-05 unverdicted novelty 8.0

GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
ConFixGS: Learning to Fix Feedforward 3D Gaussian Splatting with Confidence-Aware Diffusion Priors in Driving Scenes
cs.CV 2026-05 unverdicted novelty 7.0

ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
SplatWeaver: Learning to Allocate Gaussian Primitives for Generalizable Novel View Synthesis
cs.CV 2026-05 unverdicted novelty 7.0

SplatWeaver dynamically allocates Gaussian primitives via cardinality experts and pixel-level routing guided by high-frequency cues for improved generalizable novel view synthesis.
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes
cs.CV 2026-05 unverdicted novelty 7.0

Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prio...
WildSplatter: Feed-forward 3D Gaussian Splatting with Appearance Control from Unconstrained Images
cs.CV 2026-04 unverdicted novelty 7.0

WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
Free-Range Gaussians: Non-Grid-Aligned Generative 3D Gaussian Reconstruction
cs.CV 2026-04 unverdicted novelty 7.0

Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives tha...
3AM: 3egment Anything with Geometric Consistency in Videos
cs.CV 2026-01 unverdicted novelty 7.0

3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
MODEST: Multi-Optics Depth-of-Field Stereo Dataset
cs.CV 2025-11 accept novelty 7.0

MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.
FluSplat: Sparse-View 3D Editing without Test-Time Optimization
cs.CV 2026-04 unverdicted novelty 6.0

FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
Geometric Context Transformer for Streaming 3D Reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
LiveStre4m: Feed-Forward Live Streaming of Novel Views from Unposed Multi-View Video
cs.CV 2026-04 unverdicted novelty 6.0

LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass
cs.CV 2025-12 unverdicted novelty 6.0

DePT3R performs joint dense point tracking and 3D reconstruction of dynamic scenes from multiple unposed images using a single neural network forward pass.
C3G: Learning Compact 3D Representations with 2K Gaussians
cs.CV 2025-12 unverdicted novelty 6.0

C3G creates compact 3D Gaussian representations with 2K points by guiding placement via learnable tokens that aggregate multi-view features through attention, yielding better efficiency and performance than dense methods.
Depth Anything 3: Recovering the Visual Space from Any Views
cs.CV 2025-11 unverdicted novelty 6.0

DA3 recovers consistent visual geometry from arbitrary views via a vanilla DINO transformer and depth-ray target, setting new SOTA on a visual geometry benchmark while outperforming DA2 on monocular depth.
Streaming 4D Visual Geometry Transformer
cs.CV 2025-07 unverdicted novelty 6.0

A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
ReorgGS: Equivalent Distribution Reorganization for 3D Gaussian Splatting
cs.CV 2026-05 unverdicted novelty 5.0

ReorgGS reorganizes the Gaussian distribution in converged 3DGS models by resampling centers and covariances to reduce parameterization degeneration and enable better subsequent optimization.
Learning 3D Representations for Spatial Intelligence from Unposed Multi-View Images
cs.CV 2026-04 unverdicted novelty 5.0

UniSplat learns consistent 3D geometry, appearance, and semantics from unposed images using dual masking, progressive Gaussian splatting, and recalibration to align predictions across tasks.
VGGT-SLAM++
cs.CV 2026-04 unverdicted novelty 4.0

VGGT-SLAM++ improves on prior transformer SLAM by adding dense DEM submap graphs and high-cadence local optimization, achieving SOTA accuracy with reduced drift and bounded memory on benchmarks.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 19 Pith papers

[1]

The plenoptic func- tion and the elements of early vision

Edward H Adelson and James R Bergen. The plenoptic func- tion and the elements of early vision. MIT Press, 1991. 2

work page 1991
[2]

Computational stereo

Stephen T Barnard and Martin A Fischler. Computational stereo. ACM Computing Surveys (CSUR), 1982. 3

work page 1982
[3]

Mip-nerf 360: Unbounded anti-aliased neural radiance fields

Jonathan T Barron, Ben Mildenhall, Dor Verbin, Pratul P Srinivasan, and Peter Hedman. Mip-nerf 360: Unbounded anti-aliased neural radiance fields. In CVPR, 2022. 2

work page 2022
[4]

Porf: Pose residual field for accurate neural sur- face reconstruction

Jia-Wang Bian, Wenjing Bian, Victor Adrian Prisacariu, and Philip Torr. Porf: Pose residual field for accurate neural sur- face reconstruction. In ICLR, 2023. 3

work page 2023
[5]

Nope-nerf: Optimising neural ra- diance field with no pose prior

Wenjing Bian, Zirui Wang, Kejie Li, Jia-Wang Bian, and Victor Adrian Prisacariu. Nope-nerf: Optimising neural ra- diance field with no pose prior. In CVPR, 2023. 3

work page 2023
[6]

Pyramid stereo matching network

Jia-Ren Chang and Yong-Sheng Chen. Pyramid stereo matching network. In CVPR, 2018. 3

work page 2018
[7]

pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. In CVPR,

work page
[8]

Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo

Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast general- izable radiance field reconstruction from multi-view stereo. In ICCV, 2021. 2

work page 2021
[9]

Dbarf: Deep bundle-adjusting generalizable neural radiance fields

Yu Chen and Gim Hee Lee. Dbarf: Deep bundle-adjusting generalizable neural radiance fields. In CVPR, 2023. 3

work page 2023
[10]

Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. ECCV, 2024. 2, 3, 4, 5, 6

work page 2024
[11]

Feature-level collaboration: Joint unsupervised learn- ing of optical flow, stereo depth and camera motion

Cheng Chi, Qingjie Wang, Tianyu Hao, Peng Guo, and Xin Yang. Feature-level collaboration: Joint unsupervised learn- ing of optical flow, stereo depth and camera motion. In CVPR, 2021. 3

work page 2021
[12]

Stereo radiance fields (srf): Learning view syn- thesis for sparse views of novel scenes

Julian Chibane, Aayush Bansal, Verica Lazova, and Gerard Pons-Moll. Stereo radiance fields (srf): Learning view syn- thesis for sparse views of novel scenes. In CVPR, 2021. 2

work page 2021
[13]

Photometric bundle adjustment for dense multi-view 3d modeling

Ama ¨el Delaunoy and Marc Pollefeys. Photometric bundle adjustment for dense multi-view 3d modeling. In CVPR,

work page
[14]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In CVPRW, 2018. 3

work page 2018
[15]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, Jakob Uszkorei, and Neil Houlsy. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021. 5

work page 2021
[16]

Learning to render novel views from wide-baseline stereo pairs

Yilun Du, Cameron Smith, Ayush Tewari, and Vincent Sitz- mann. Learning to render novel views from wide-baseline stereo pairs. In CVPR, 2023. 2, 3, 5

work page 2023
[17]

Unsupervised cnn for single view depth estimation: Geome- try to the rescue

Ravi Garg, Vijay Kumar Bg, Gustavo Carneiro, and Ian Reid. Unsupervised cnn for single view depth estimation: Geome- try to the rescue. In ECCV, 2016. 3

work page 2016
[18]

Unsupervised monocular depth estimation with left- right consistency

Cl ´ement Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In CVPR, 2017. 3

work page 2017
[19]

The lumigraph

Steven J Gortler, Radek Grzeszczuk, Richard Szeliski, and Michael F Cohen. The lumigraph. In Computer Graphics and Interactive Techniques, 1996. 2

work page 1996
[20]

A combined corner and edge detector

Chris Harris, Mike Stephens, et al. A combined corner and edge detector. In Alvey Vision Conference, 1988. 3

work page 1988
[21]

L/sub/spl in- fin//minimization in geometric reconstruction problems

Richard Hartley and Frederik Schaffalitzky. L/sub/spl in- fin//minimization in geometric reconstruction problems. In CVPR, 2004. 3

work page 2004
[22]

Triangulation

Richard I Hartley and Peter Sturm. Triangulation. Computer Vision and Image Understanding, 1997

work page 1997
[23]

Stereo from uncalibrated cameras

Richard I Hartley, Rajiv Gupta, and Tom Chang. Stereo from uncalibrated cameras. In CVPR, 1992. 3

work page 1992
[24]

Occlusions, discontinu- ities, and epipolar lines in stereo

Hiroshi Ishikawa and Davi Geiger. Occlusions, discontinu- ities, and epipolar lines in stereo. In ECCV, 1998. 3

work page 1998
[25]

Self-calibrating neural radiance fields

Yoonwoo Jeong, Seokjun Ahn, Christopher Choy, Anima Anandkumar, Minsu Cho, and Jaesik Park. Self-calibrating neural radiance fields. In ICCV, 2021. 3

work page 2021
[26]

Leap: Liberate sparse-view 3d modeling from camera poses

Hanwen Jiang, Zhenyu Jiang, Yue Zhao, and Qixing Huang. Leap: Liberate sparse-view 3d modeling from camera poses. In ICLR, 2023. 3

work page 2023
[27]

Geonerf: Generalizing nerf with geometry priors

Mohammad Mahdi Johari, Yann Lepoittevin, and Franc ¸ois Fleuret. Geonerf: Generalizing nerf with geometry priors. In CVPR, 2022. 2

work page 2022
[28]

A stereo machine for video-rate dense depth mapping and its new applications

Takeo Kanade, Atsushi Yoshida, Kazuo Oda, Hiroshi Kano, and Masaya Tanaka. A stereo machine for video-rate dense depth mapping and its new applications. In CVPR, 1996. 3

work page 1996
[29]

3d gaussian splatting for real-time radiance field rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering. ToG, 2023. 1, 2, 3, 4

work page 2023
[30]

Generalizable novel-view synthesis using a stereo camera

Haechan Lee, Wonjoon Jin, Seung-Hwan Baek, and Sunghyun Cho. Generalizable novel-view synthesis using a stereo camera. CVPR, 2024. 3

work page 2024
[31]

Grounding image matching in 3d with mast3r,

Vincent Leroy, Yohann Cabon, and J´erˆome Revaud. Ground- ing image matching in 3d with mast3r. arXiv preprint arXiv:2406.09756, 2024. 2, 3, 4, 6

work page arXiv 2024
[32]

Light field rendering

Marc Levoy and Pat Hanrahan. Light field rendering. In SIGGRAPH, 1996. 2

work page 1996
[33]

Ggrt: Towards generalizable 3d gaussians without pose priors in real-time

Hao Li, Yuanyuan Gao, Dingwen Zhang, Chenming Wu, Yalun Dai, Chen Zhao, Haocheng Feng, Errui Ding, Jing- dong Wang, and Junwei Han. Ggrt: Towards generalizable 3d gaussians without pose priors in real-time. ECCV, 2024. 2, 3

work page 2024
[34]

Taming uncertainty in sparse-view generalizable nerf via indirect diffusion guid- ance

Yaokun Li, Chao Gou, and Guang Tan. Taming uncertainty in sparse-view generalizable nerf via indirect diffusion guid- ance. arXiv preprint arXiv:2402.01217, 2024. 3

work page arXiv 2024
[35]

Barf: Bundle-adjusting neural radiance fields

Chen-Hsuan Lin, Wei-Chiu Ma, Antonio Torralba, and Si- mon Lucey. Barf: Bundle-adjusting neural radiance fields. In ICCV, 2021. 3

work page 2021
[36]

Fast generalizable gaussian splatting reconstruction from multi-view stereo

Tianqi Liu, Guangcong Wang, Shoukang Hu, Liao Shen, Xinyi Ye, Yuhang Zang, Zhiguo Cao, Wei Li, and Ziwei Liu. Fast generalizable gaussian splatting reconstruction from multi-view stereo. ECCV, 2024. 3

work page 2024
[37]

Neural rays for occlusion-aware image-based render- ing

Yuan Liu, Sida Peng, Lingjie Liu, Qianqian Wang, Peng Wang, Christian Theobalt, Xiaowei Zhou, and Wenping 9 Wang. Neural rays for occlusion-aware image-based render- ing. In CVPR, 2022. 3

work page 2022
[38]

Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views

Xiaoxiao Long, Cheng Lin, Peng Wang, Taku Komura, and Wenping Wang. Sparseneus: Fast generalizable neural sur- face reconstruction from sparse views. In ECCV, 2022. 3

work page 2022
[39]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. IJCV, 2004. 3

work page 2004
[40]

The fundamen- tal matrix: Theory, algorithms, and stability analysis

Quan-Tuan Luong and Olivier D Faugeras. The fundamen- tal matrix: Theory, algorithms, and stability analysis. IJCV,

work page
[41]

Srinivasan, Matthew Tancik, Jonathan T

Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. In ECCV, 2020. 1, 2

work page 2020
[42]

Instant neural graphics primitives with a multires- olution hash encoding

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding. In SIGGRAPH, 2022. 2

work page 2022
[43]

Colnerf: Collaboration for generaliz- able sparse input neural radiance field

Zhangkai Ni, Peiqi Yang, Wenhan Yang, Hanli Wang, Lin Ma, and Sam Kwong. Colnerf: Collaboration for generaliz- able sparse input neural radiance field. In AAAI, 2024. 3

work page 2024
[44]

Deep fundamental matrix estimation

Ren ´e Ranftl and Vladlen Koltun. Deep fundamental matrix estimation. In ECCV, 2018. 3

work page 2018
[45]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. In ICCV, 2021. 5

work page 2021
[46]

Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Com- mon objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. In ICCV, 2021. 3

work page 2021
[47]

Scene representation networks: Continuous 3d- structure-aware neural scene representations

Vincent Sitzmann, Michael Zollh ¨ofer, and Gordon Wet- zstein. Scene representation networks: Continuous 3d- structure-aware neural scene representations. NeurIPS,

work page
[48]

Light field networks: Neu- ral scene representations with single-evaluation rendering

Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neu- ral scene representations with single-evaluation rendering. NeurIPS, 2021. 1

work page 2021
[49]

Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow

Cameron Smith, Yilun Du, Ayush Tewari, and Vincent Sitz- mann. Flowcam: Training generalizable 3d radiance fields without camera poses via pixel-aligned scene flow. In NeurIPS, 2023. 3

work page 2023
[50]

Generalizable patch-based neural render- ing

Mohammed Suhail, Carlos Esteves, Leonid Sigal, and Ameesh Makadia. Generalizable patch-based neural render- ing. In ECCV, 2022. 3

work page 2022
[51]

Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image

Stanislaw Szymanowicz, Eldar Insafutdinov, Chuanxia Zheng, Dylan Campbell, Jo ˜ao F Henriques, Christian Rup- precht, and Andrea Vedaldi. Flash3d: Feed-forward gener- alisable 3d scene reconstruction from a single image. arXiv preprint arXiv:2406.04343, 2024. 2, 4, 6

work page arXiv 2024
[52]

Splatter image: Ultra-fast single-view 3d recon- struction

Stanislaw Szymanowicz, Chrisitian Rupprecht, and Andrea Vedaldi. Splatter image: Ultra-fast single-view 3d recon- struction. In CVPR, 2024. 2, 3, 4, 5

work page 2024
[53]

Fast corner detection

Miroslav Trajkovi ´c and Mark Hedley. Fast corner detection. Image and Vision Computing, 1998. 3

work page 1998
[54]

Sparf: Neural radiance fields from sparse and noisy poses

Prune Truong, Marie-Julie Rakotosaona, Fabian Manhardt, and Federico Tombari. Sparf: Neural radiance fields from sparse and noisy poses. In CVPR, 2023. 3

work page 2023
[55]

Deep two-view structure-from-motion revisited

Jianyuan Wang, Yiran Zhong, Yuchao Dai, Stan Birchfield, Kaihao Zhang, Nikolai Smolyanskiy, and Hongdong Li. Deep two-view structure-from-motion revisited. In CVPR,

work page
[56]

Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser

Qianqian Wang, Zhicheng Wang, Kyle Genova, Pratul Srini- vasan, Howard Zhou, Jonathan T. Barron, Ricardo Martin- Brualla, Noah Snavely, and Thomas Funkhouser. Ibrnet: Learning multi-view image-based rendering. InCVPR, 2021. 2, 3

work page 2021
[57]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohan Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. In CVPR, 2024. 3, 4

work page 2024
[58]

Nerf–: Neural radiance fields without known camera parameters

Zirui Wang, Shangzhe Wu, Weidi Xie, Min Chen, and Victor Adrian Prisacariu. Nerf–: Neural radiance fields without known camera parameters. arXiv preprint arXiv:2102.07064, 2021. 3

work page arXiv 2021
[59]

latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction

Christopher Wewer, Kevin Raj, Eddy Ilg, Bernt Schiele, and Jan Eric Lenssen. latentsplat: Autoencoding variational gaussians for fast generalizable 3d reconstruction. ECCV,

work page
[60]

Large scale photo- metric bundle adjustment

Oliver J Woodford and Edward Rosten. Large scale photo- metric bundle adjustment. In BMVC, 2020. 3

work page 2020
[61]

Scannet++: A high-fidelity dataset of 3d indoor scenes

Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. In ICCV, 2023. 6

work page 2023
[62]

Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose

Zhichao Yin and Jianping Shi. Geonet: Unsupervised learn- ing of dense depth, optical flow and camera pose. In CVPR,

work page
[63]

pixelnerf: Neural radiance fields from one or few images

Alex Yu, Vickie Ye, Matthew Tancik, and Angjoo Kanazawa. pixelnerf: Neural radiance fields from one or few images. In CVPR, 2021. 2, 3

work page 2021
[64]

Computing the stereo match- ing cost with a convolutional neural network

Jure Zbontar and Yann LeCun. Computing the stereo match- ing cost with a convolutional neural network. In CVPR,

work page
[65]

Unsupervised learn- ing of monocular depth estimation and visual odometry with deep feature reconstruction

Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learn- ing of monocular depth estimation and visual odometry with deep feature reconstruction. In CVPR, 2018. 3

work page 2018
[66]

Ga-net: Guided aggregation net for end- to-end stereo matching

Feihu Zhang, Victor Prisacariu, Ruigang Yang, and Philip HS Torr. Ga-net: Guided aggregation net for end- to-end stereo matching. In CVPR, 2019. 3

work page 2019
[67]

A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry

Zhengyou Zhang, Rachid Deriche, Olivier Faugeras, and Quang-Tuan Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence, 1995. 3

work page 1995
[68]

Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis

Shunyuan Zheng, Boyao Zhou, Ruizhi Shao, Boning Liu, Shengping Zhang, Liqiang Nie, and Yebin Liu. Gps- gaussian: Generalizable pixel-wise 3d gaussian splatting for real-time human novel view synthesis. In CVPR, 2024. 2, 3

work page 2024
[69]

Stereo magnification: Learning view syn- thesis using multiplane images

Tinghui Zhou, Richard Tucker, John Flynn, Graham Fyffe, and Noah Snavely. Stereo magnification: Learning view syn- thesis using multiplane images. In SIGGRAPH, 2018. 3 10

work page 2018