GLADOS reconstructs 3D geometry from disjoint views by generating intermediate perspectives, performing robust coarse alignment that tolerates generative inconsistencies, and iteratively expanding context for consistency.
hub Canonical reference
Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
In this paper, we introduce Splatt3R, a pose-free, feed-forward method for in-the-wild 3D reconstruction and novel view synthesis from stereo pairs. Given uncalibrated natural images, Splatt3R can predict 3D Gaussian Splats without requiring any camera parameters or depth information. For generalizability, we build Splatt3R upon a ``foundation'' 3D geometry reconstruction method, MASt3R, by extending it to deal with both 3D structure and appearance. Specifically, unlike the original MASt3R which reconstructs only 3D point clouds, we predict the additional Gaussian attributes required to construct a Gaussian primitive for each point. Hence, unlike other novel view synthesis methods, Splatt3R is first trained by optimizing the 3D point cloud's geometry loss, and then a novel view synthesis objective. By doing this, we avoid the local minima present in training 3D Gaussian Splats from stereo views. We also propose a novel loss masking strategy that we empirically find is critical for strong performance on extrapolated viewpoints. We train Splatt3R on the ScanNet++ dataset and demonstrate excellent generalisation to uncalibrated, in-the-wild images. Splatt3R can reconstruct scenes at 4FPS at 512 x 512 resolution, and the resultant splats can be rendered in real-time.
hub tools
citation-role summary
citation-polarity summary
roles
background 6polarities
background 6representative citing papers
ZipSplat uses multi-view token extraction followed by k-means clustering and attention to decode compact scene tokens into unconstrained 3D Gaussians, achieving SOTA pose-free results with ~6x fewer primitives.
ArtSplat is the first feed-forward framework for articulated 3D Gaussian Splatting that reconstructs geometry and joints from sparse multi-state uncalibrated views in one pass.
NoPo4D is the first feed-forward system for dynamic 4D Gaussian splatting from unposed multi-view videos, using velocity decomposition supervised by optical flow and a bidirectional motion encoder.
ConFixGS repairs feedforward 3D Gaussian Splatting with confidence-aware diffusion priors, delivering up to 3.68 dB PSNR gains and halved FID scores on Waymo, nuScenes, and KITTI novel view synthesis tasks.
SplatWeaver uses cardinality Gaussian experts and pixel-level routing to dynamically allocate varying numbers of Gaussian primitives for generalizable novel view synthesis.
Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prior methods on ORAD-3D and RELLIS-3D while generalizing zero-shot.
WildSplatter jointly learns 3D Gaussians and appearance embeddings from unconstrained photo collections to enable fast feed-forward reconstruction and flexible lighting control in 3D Gaussian Splatting.
Free-Range Gaussians uses flow matching over Gaussian parameters to predict non-grid-aligned 3D Gaussians from multi-view images, enabling synthesis of plausible content in unobserved regions with fewer primitives than grid-aligned methods.
3AM integrates MUSt3R 3D features into SAM2 via a Feature Merger and FOV-aware sampling to deliver geometry-consistent video object segmentation from RGB alone, with large gains on wide-baseline datasets.
MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.
VGGT-SLAM aligns VGGT submaps via SL(4) manifold optimization of 15-DoF homographies to enable consistent dense RGB SLAM on long uncalibrated monocular videos.
A plug-and-play view partitioning scheme using combinatorial graph partitioning on visual dissimilarity and approximated spatial dispersion makes VGGT scalable to large view collections with gains in pose estimation, depth prediction, and reconstruction.
A feed-forward framework learns instance-structured 3D token groups from unposed multi-view images via differentiable rendering, enabling native object-level segmentation, editing, and retrieval without 3D supervision.
StructSplat introduces a structured 3D Gaussian splatting framework that performs feed-forward reconstruction from uncalibrated sparse views using pixel-aligned features, semantic priors, and camera alignment.
Wild3R is a feed-forward 3D Gaussian Splatting model trained on the new WildCity dataset of 200 scenes with 170 lighting conditions and transients to handle unconstrained sparse photo collections.
Robust Dreamer uses Latent Gaussian Memory anchored to diffusion latents and Deviation Learning with a Dynamic Deviation Archive to reduce drift in long-horizon action-controlled image-to-video generation, reporting SOTA results on ScanNet, DL3DV, and OmniWorldGame.
TriSplat predicts oriented triangle primitives from images in one forward pass to produce simulation-ready 3D meshes with competitive rendering quality.
A feed-forward model aligns ground and satellite features to predict Gaussian splats for improved novel-view synthesis on georeferenced outdoor scenes.
DeG models 3D Gaussians via learned octree density and uses VecSeq Sobol re-indexing to turn set generation into sequence modeling, claiming SOTA quality in single-image-to-3D.
FluSplat trains a model with geometric alignment constraints on multi-view edits to produce consistent 3D scene edits from sparse views in a single forward pass without test-time optimization.
LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20 FPS over sequences longer than 10,000 frames.
The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temporal-aware modeling.
LiveStre4m delivers real-time novel-view video streaming from unposed multi-view inputs via a multi-view vision transformer, diffusion-transformer interpolation, and a learned camera pose predictor.
citing papers explorer
-
MODEST: Multi-Optics Depth-of-Field Stereo Dataset
MODEST provides the first large-scale high-resolution stereo DSLR dataset with systematic variation of focal length and aperture to support research on real-world optical effects in depth estimation.