pith. machine review for the scientific record. sign in

arxiv: 2511.10647 · v1 · submitted 2025-11-13 · 💻 cs.CV

Recognition: 3 theorem links

Depth Anything 3: Recovering the Visual Space from Any Views

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords depth estimationany-view geometrytransformer backboneteacher-student distillationcamera pose estimationvisual renderingspatially consistent reconstructionmonocular depth
0
0 comments X

The pith

A plain transformer with a single depth-ray target recovers consistent geometry from arbitrary views.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Depth Anything 3 as a model that predicts spatially consistent 3D geometry from any number of visual inputs, whether or not camera poses are known. It shows that a minimal setup suffices: a vanilla transformer backbone paired with one depth-ray prediction target, trained through teacher-student distillation on public datasets. This produces detail and generalization matching earlier Depth Anything models while delivering large gains on a new benchmark spanning pose estimation, any-view geometry, and rendering. The results include a 44.3 percent improvement in camera pose accuracy and 25.1 percent in geometric accuracy over the prior leader, plus stronger monocular depth performance.

Core claim

Depth Anything 3 uses a single plain transformer encoder and a singular depth-ray prediction target to recover spatially consistent geometry from an arbitrary number of views with or without known poses. Trained via teacher-student distillation exclusively on public academic datasets, the model matches the detail and generalization of Depth Anything 2 while establishing new state-of-the-art results across camera pose estimation, any-view geometry, and visual rendering on a dedicated benchmark.

What carries the argument

A teacher-student training paradigm applied to a vanilla transformer encoder that uses one depth-ray prediction target in place of multi-task heads.

If this is right

  • The model produces consistent geometry outputs from any number of input views.
  • It operates without requiring known camera poses as input.
  • It sets new performance records on pose estimation, geometric reconstruction, and rendering tasks.
  • All training uses only publicly available academic datasets.
  • It improves monocular depth estimation accuracy relative to Depth Anything 2.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark introduced here could serve as a standard testbed for future any-view geometry methods.
  • Success with a single prediction target may indicate that separate heads for pose and depth are often redundant.
  • The approach could extend to video sequences by treating frames as arbitrary views.

Load-bearing premise

A single unmodified transformer encoder plus a singular depth-ray prediction target, trained via teacher-student distillation on public datasets, suffices to produce spatially consistent geometry from arbitrary views without known poses.

What would settle it

A collection of test scenes with many arbitrary views where the model's predicted geometry shows measurable inconsistencies or lower accuracy than models that rely on explicit pose inputs or specialized multi-view fusion modules.

read the original abstract

We present Depth Anything 3 (DA3), a model that predicts spatially consistent geometry from an arbitrary number of visual inputs, with or without known camera poses. In pursuit of minimal modeling, DA3 yields two key insights: a single plain transformer (e.g., vanilla DINO encoder) is sufficient as a backbone without architectural specialization, and a singular depth-ray prediction target obviates the need for complex multi-task learning. Through our teacher-student training paradigm, the model achieves a level of detail and generalization on par with Depth Anything 2 (DA2). We establish a new visual geometry benchmark covering camera pose estimation, any-view geometry and visual rendering. On this benchmark, DA3 sets a new state-of-the-art across all tasks, surpassing prior SOTA VGGT by an average of 44.3% in camera pose accuracy and 25.1% in geometric accuracy. Moreover, it outperforms DA2 in monocular depth estimation. All models are trained exclusively on public academic datasets.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents Depth Anything 3 (DA3), a model that recovers spatially consistent 3D geometry from an arbitrary number of input views (with or without known poses) using only a plain transformer encoder (e.g., vanilla DINO) and a single depth-ray prediction target. It is trained via teacher-student distillation from Depth Anything 2 on public datasets, claims performance on par with DA2 for monocular depth, and introduces a new visual geometry benchmark on which it reports SOTA results, outperforming VGGT by 44.3% in camera pose accuracy and 25.1% in geometric accuracy across pose estimation, any-view geometry, and visual rendering tasks.

Significance. If the central claims hold under rigorous verification, the result would be significant: it would demonstrate that minimal architectural choices (plain encoder + singular depth-ray target) plus distillation suffice for cross-view geometric consistency, reducing the need for specialized multi-view modules or explicit pose/alignment losses. The new benchmark spanning multiple tasks could become a useful standard for evaluating visual geometry models. Credit is due for the emphasis on public-data-only training and the attempt at architectural simplification.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (results): The reported 44.3% average improvement in camera pose accuracy and 25.1% in geometric accuracy over VGGT are presented without error bars, per-task standard deviations, statistical significance tests, or ablation tables isolating the contribution of the depth-ray target versus data correlations; this is load-bearing for the SOTA claim given the empirical benchmark.
  2. [§3] §3 (method): The teacher-student paradigm is described as using only per-view depth rays from a monocular teacher with no explicit cross-view alignment, pose, or consistency losses; it is unclear how the student discovers spatially consistent geometry from arbitrary views, raising the risk that benchmark gains arise from scene correlations in public multi-view datasets rather than the proposed minimal representation.
  3. [Benchmark definition] Benchmark definition (likely §4 or appendix): Details on how the new visual geometry benchmark constructs ground truth for arbitrary views, defines the three tasks, ensures no train-test leakage, and handles pose-free inference are insufficient to reproduce or independently validate the cross-task SOTA margins.
minor comments (2)
  1. [§3] Notation for the depth-ray target and its relation to standard depth or ray representations should be clarified with an equation or diagram in §3 to aid reproducibility.
  2. The manuscript would benefit from an explicit limitations paragraph discussing failure cases (e.g., highly dynamic scenes or extreme viewpoint changes) where the plain encoder plus depth-ray may break down.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, providing clarifications on our approach and outlining planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (results): The reported 44.3% average improvement in camera pose accuracy and 25.1% in geometric accuracy over VGGT are presented without error bars, per-task standard deviations, statistical significance tests, or ablation tables isolating the contribution of the depth-ray target versus data correlations; this is load-bearing for the SOTA claim given the empirical benchmark.

    Authors: We agree that error bars, per-task standard deviations, and ablations would provide stronger support for the SOTA claims. In the revised manuscript we will report per-task standard deviations computed across multiple evaluation seeds, include an ablation study that isolates the depth-ray target from other factors, and add statistical significance testing (e.g., paired t-tests) on the key metrics. We note that the large reported margins make the overall ranking robust, but we will incorporate these elements to address the concern directly. revision: partial

  2. Referee: [§3] §3 (method): The teacher-student paradigm is described as using only per-view depth rays from a monocular teacher with no explicit cross-view alignment, pose, or consistency losses; it is unclear how the student discovers spatially consistent geometry from arbitrary views, raising the risk that benchmark gains arise from scene correlations in public multi-view datasets rather than the proposed minimal representation.

    Authors: The student processes an arbitrary number of views jointly through the shared vanilla DINO transformer while predicting depth rays; this joint encoding enables the model to discover cross-view geometric consistency implicitly from the multi-view training distribution, even though supervision remains per-view. To address the possibility of scene-specific correlations, the revision will include additional experiments on held-out scenes and cross-dataset generalization that demonstrate the consistency generalizes beyond training-scene statistics. revision: yes

  3. Referee: [Benchmark definition] Benchmark definition (likely §4 or appendix): Details on how the new visual geometry benchmark constructs ground truth for arbitrary views, defines the three tasks, ensures no train-test leakage, and handles pose-free inference are insufficient to reproduce or independently validate the cross-task SOTA margins.

    Authors: We acknowledge that additional detail is required for full reproducibility. The revised manuscript will expand Section 4 and the appendix with explicit descriptions of ground-truth construction for arbitrary views, precise definitions and metrics for the three tasks, the train-test split procedure used to prevent leakage, and the exact protocol for pose-free inference, including pseudocode for the evaluation pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations or self-referential predictions

full rationale

The paper describes an empirical teacher-student training setup using a plain DINO encoder and depth-ray target on public datasets, evaluated via benchmark comparisons (e.g., 44.3% pose accuracy gain over VGGT). No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Performance claims rest on external benchmark results rather than any quantity defined in terms of itself. This is the expected non-finding for an architecture paper without mathematical modeling.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions that a transformer can learn consistent geometry via distillation and that public academic datasets are sufficient for the reported generalization.

free parameters (1)
  • training hyperparameters
    Standard neural-network training choices (learning rate, batch size, etc.) that are not enumerated in the abstract.
axioms (1)
  • domain assumption Teacher-student distillation on public datasets yields spatially consistent multi-view geometry.
    Invoked to justify that the plain transformer plus single target suffices.

pith-pipeline@v0.9.0 · 5495 in / 1218 out tokens · 58023 ms · 2026-05-11T02:03:23.535596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. TrackCraft3R: Repurposing Video Diffusion Transformers for Dense 3D Tracking

    cs.CV 2026-05 unverdicted novelty 8.0

    TrackCraft3R is the first method to repurpose a video diffusion transformer as a feed-forward dense 3D tracker via dual-latent representations and temporal RoPE alignment, achieving SOTA performance with lower compute.

  2. Revisiting Photometric Ambiguity for Accurate Gaussian-Splatting Surface Reconstruction

    cs.CV 2026-05 unverdicted novelty 7.0

    AmbiSuR adds intrinsic photometric disambiguation and a self-indication module to Gaussian Splatting to resolve ambiguities and improve surface reconstruction accuracy.

  3. PointForward: Feedforward Driving Reconstruction through Point-Aligned Representations

    cs.CV 2026-05 unverdicted novelty 7.0

    PointForward uses sparse world-space 3D queries and scene graphs to deliver consistent single-pass reconstruction of dynamic driving scenes via point-aligned representations.

  4. Seeing Across Skies and Streets: Feedforward 3D Reconstruction from Satellite, Drone, and Ground Images

    cs.CV 2026-05 unverdicted novelty 7.0

    Cross3R performs feed-forward 3D reconstruction and 6-DoF pose estimation from any combination of satellite, UAV, and ground images, outperforming baselines on a new 278K-image tri-view dataset.

  5. Mix3R: Mixing Feed-forward Reconstruction and Generative 3D Priors for Joint Multi-view Aligned 3D Reconstruction and Pose Estimation

    cs.CV 2026-05 unverdicted novelty 7.0

    Mix3R mixes feed-forward reconstruction and generative 3D priors via Mixture-of-Transformers and overlap-based attention bias to achieve better-aligned 3D shapes and more accurate poses than either approach alone.

  6. AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision

    cs.CV 2026-04 conditional novelty 7.0

    AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.

  7. Face Anything: 4D Face Reconstruction from Any Image Sequence

    cs.CV 2026-04 unverdicted novelty 7.0

    A single transformer model jointly predicts depth and normalized canonical coordinates to deliver state-of-the-art 4D facial geometry and tracking with 3x lower correspondence error and 16% better depth accuracy.

  8. URoPE: Universal Relative Position Embedding across Geometric Spaces

    cs.CV 2026-04 unverdicted novelty 7.0

    URoPE is a parameter-free relative position embedding for transformers that works across arbitrary geometric spaces by ray sampling and projection, yielding consistent gains on novel view synthesis, 3D detection, trac...

  9. View-Consistent 3D Scene Editing via Dual-Path Structural Correspondense and Semantic Continuity

    cs.CV 2026-04 unverdicted novelty 7.0

    A dual-path consistency framework for text-driven 3D scene editing that models cross-view dependencies via structural correspondence and semantic continuity, trained on a newly constructed paired multi-view dataset.

  10. GlobalSplat: Efficient Feed-Forward 3D Gaussian Splatting via Global Scene Tokens

    cs.CV 2026-04 unverdicted novelty 7.0

    GlobalSplat achieves competitive novel-view synthesis on RealEstate10K and ACID using only 16K Gaussians via global scene tokens and coarse-to-fine training, with a 4MB footprint and under 78ms inference.

  11. Any 3D Scene is Worth 1K Tokens: 3D-Grounded Representation for Scene Generation at Scale

    cs.CV 2026-04 unverdicted novelty 7.0

    A 3D-grounded autoencoder and diffusion transformer allow direct generation of 3D scenes in an implicit latent space using a fixed 1K-token representation for arbitrary views and resolutions.

  12. EgoFun3D: Modeling Interactive Objects from Egocentric Videos using Function Templates

    cs.CV 2026-04 unverdicted novelty 7.0

    EgoFun3D creates a new task, 271-video dataset, and pipeline using function templates to model interactive 3D objects from egocentric videos for simulation.

  13. LuMon: A Comprehensive Benchmark and Development Suite with Novel Datasets for Lunar Monocular Depth Estimation

    cs.CV 2026-04 unverdicted novelty 7.0

    A new benchmark with real lunar stereo ground truth and analog data shows that sim-to-real fine-tuned monocular depth models achieve large in-domain gains but minimal generalization to actual lunar images.

  14. TAIHRI: Task-Aware 3D Human Keypoints Localization for Close-Range Human-Robot Interaction

    cs.CV 2026-04 unverdicted novelty 7.0

    TAIHRI is the first task-aware VLM for close-range HRI that localizes metric-scale 3D coordinates of critical keypoints by quantizing space and performing 2D keypoint reasoning via next-token prediction.

  15. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  16. SEM-ROVER: Semantic Voxel-Guided Diffusion for Large-Scale Driving Scene Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    SEM-ROVER generates large multiview-consistent 3D urban driving scenes via semantic-conditioned diffusion on Σ-Voxfield voxel grids with progressive outpainting and deferred rendering.

  17. AnyImageNav: Any-View Geometry for Precise Last-Meter Image-Goal Navigation

    cs.RO 2026-04 unverdicted novelty 7.0

    AnyImageNav uses a semantic-to-geometric cascade with 3D multi-view foundation models to recover precise 6-DoF poses from goal images, achieving 0.27m position error and state-of-the-art success rates on Gibson and HM...

  18. MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    cs.CV 2025-09 unverdicted novelty 7.0

    MapAnything is a unified feed-forward transformer that regresses metric 3D scene geometry and cameras from images using a factored representation of depth maps, ray maps, poses, and scale.

  19. GeoQuery: Geometry-Query Diffusion for Sparse-View Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    GeoQuery replaces corrupted rendering features with geometry-aligned proxy queries and restricts cross-view attention to local windows, enabling robust diffusion-based refinement under extreme view sparsity.

  20. GuidedVLA: Specifying Task-Relevant Factors via Plug-and-Play Action Attention Specialization

    cs.RO 2026-05 unverdicted novelty 6.0

    GuidedVLA improves VLA success rates by manually supervising separate attention heads in the action decoder with auxiliary signals for task-relevant factors.

  21. UniFixer: A Universal Reference-Guided Fixer for Diffusion-Based View Synthesis

    cs.CV 2026-05 unverdicted novelty 6.0

    UniFixer is a universal reference-guided framework that fixes spatial, temporal, and backbone-related degradations in diffusion-based view synthesis via coarse-to-fine modules and achieves zero-shot SOTA results on no...

  22. Focusable Monocular Depth Estimation

    cs.CV 2026-05 unverdicted novelty 6.0

    FocusDepth is a prompt-conditioned framework that fuses SAM3 features into Depth Anything models via Multi-Scale Spatial-Aligned Fusion to improve target-region depth accuracy on the new FDE-Bench.

  23. Pixal3D: Pixel-Aligned 3D Generation from Images

    cs.CV 2026-05 unverdicted novelty 6.0

    Pixal3D performs pixel-aligned 3D generation from images via back-projected multi-scale feature volumes, achieving fidelity close to reconstruction while supporting multi-view and scene synthesis.

  24. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth predicts inter-frame camera poses to inject geometric embeddings into a spatio-temporal transformer, yielding state-of-the-art 3D-consistent video depth.

  25. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth achieves improved 3D-consistent video depth by embedding predicted inter-frame camera poses into a network with an Alternating Spatio-Temporal Transformer for better spatial precision and temporal coherence.

  26. GemDepth: Geometry-Embedded Features for 3D-Consistent Video Depth

    cs.CV 2026-05 unverdicted novelty 6.0

    GemDepth embeds predicted camera poses into a spatio-temporal transformer to achieve state-of-the-art 3D-consistent video depth estimation.

  27. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  28. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  29. Geometric 4D Stitching for Grounded 4D Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Geometric 4D Stitching explicitly complements missing geometric regions in 4D generated scenes with grounded stitches to achieve consistent 4D representations in under 10 minutes on a single GPU.

  30. Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning

    cs.CV 2026-05 unverdicted novelty 6.0

    Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.

  31. Spark3R: Asymmetric Token Reduction Makes Fast Feed-Forward 3D Reconstruction

    cs.CV 2026-05 unverdicted novelty 6.0

    Asymmetric token reduction, with distinct merging for queries and pruning for key-values plus layer-wise adaptation, delivers up to 28x speedup on 1000-frame 3D reconstruction inputs while preserving competitive quality.

  32. AnchorD: Metric Grounding of Monocular Depth Using Factor Graphs

    cs.RO 2026-05 unverdicted novelty 6.0

    AnchorD anchors monocular depth priors in metric sensor data via patch-wise affine alignment using factor graph optimization, improving accuracy on non-Lambertian objects and introducing a new benchmark dataset with d...

  33. 3D-ReGen: A Unified 3D Geometry Regeneration Framework

    cs.CV 2026-04 unverdicted novelty 6.0

    3D-ReGen is a conditioned 3D regenerator using VecSet that learns a regeneration prior from unlabeled 3D datasets via self-supervised tasks and achieves state-of-the-art results on controllable 3D geometry tasks.

  34. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  35. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  36. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  37. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    SS3D pretrains an end-to-end 3D estimator on filtered YouTube-8M videos via SfM self-supervision, achieving improved zero-shot transfer and fine-tuning over prior baselines.

  38. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 6.0

    SS3D pretrains an end-to-end feed-forward 3D estimator on filtered YouTube-8M videos via SfM self-supervision, MVS filtering, and expert distillation, delivering stronger zero-shot transfer and fine-tuning than prior ...

  39. MLG-Stereo: ViT Based Stereo Matching with Multi-Stage Local-Global Enhancement

    cs.CV 2026-04 unverdicted novelty 6.0

    MLG-Stereo adds multi-granularity feature extraction, local-global cost volumes, and guided recurrent refinement to ViT stereo matching, yielding competitive results on Middlebury, KITTI-2015, and strong results on KI...

  40. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 conditional novelty 6.0

    Image generation pretraining builds generalist vision models that reach SOTA on 2D and 3D perception tasks by reframing them as RGB image outputs.

  41. Image Generators are Generalist Vision Learners

    cs.CV 2026-04 unverdicted novelty 6.0

    Image generation pretraining produces generalist vision models that reframe perception tasks as image synthesis and reach SOTA results on segmentation, depth estimation, and other 2D/3D tasks.

  42. FurnSet: Exploiting Repeats for 3D Scene Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    FurnSet improves single-view 3D scene reconstruction by using per-object CLS tokens and set-aware self-attention to group and jointly reconstruct repeated object instances, with added scene-object conditioning and lay...

  43. Enhancing Glass Surface Reconstruction via Depth Prior for Robot Navigation

    cs.RO 2026-04 unverdicted novelty 6.0

    A training-free RANSAC-based fusion of depth foundation model priors with sensor data recovers accurate metric depth on glass, supported by a new GlassRecon RGB-D dataset with derived ground truth.

  44. VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

    cs.CV 2026-04 unverdicted novelty 6.0

    VEFX-Bench releases a large human-labeled video editing dataset, a multi-dimensional reward model, and a standardized benchmark that better matches human judgments than generic evaluators.

  45. Geometric Context Transformer for Streaming 3D Reconstruction

    cs.CV 2026-04 unverdicted novelty 6.0

    LingBot-Map is a streaming 3D reconstruction model built on a geometric context transformer that combines anchor context, pose-reference window, and trajectory memory to deliver accurate, drift-resistant results at 20...

  46. Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective

    cs.CV 2026-04 unverdicted novelty 6.0

    The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...

  47. Lyra 2.0: Explorable Generative 3D Worlds

    cs.CV 2026-04 unverdicted novelty 6.0

    Lyra 2.0 produces persistent 3D-consistent video sequences for large explorable worlds by using per-frame geometry for information routing and self-augmented training to correct temporal drift.

  48. Grasp in Gaussians: Fast Monocular Reconstruction of Dynamic Hand-Object Interactions

    cs.CV 2026-04 unverdicted novelty 6.0

    GraG reconstructs dynamic 3D hand-object interactions from monocular video 6.4x faster than prior work by using compact Sum-of-Gaussians tracking initialized from large models and refined with 2D losses.

  49. Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

    cs.CV 2026-04 unverdicted novelty 6.0

    A video diffusion model learns a joint distribution over videos and camera trajectories by representing cameras as pixel-aligned ray encodings (raxels) denoised jointly with video frames via decoupled attention.

  50. Self-Improving 4D Perception via Self-Distillation

    cs.CV 2026-04 unverdicted novelty 6.0

    SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...

  51. PhyEdit: Towards Real-World Object Manipulation via Physically-Grounded Image Editing

    cs.CV 2026-04 unverdicted novelty 6.0

    PhyEdit improves physical accuracy in image object manipulation by using explicit geometric simulation as 3D-aware guidance combined with joint 2D-3D supervision.

  52. LSRM: High-Fidelity Object-Centric Reconstruction via Scaled Context Windows

    cs.CV 2026-04 conditional novelty 6.0

    LSRM scales transformer context windows with native sparse attention and geometric routing to deliver high-fidelity feed-forward 3D reconstruction and inverse rendering that approaches dense optimization quality.

  53. LoMa: Local Feature Matching Revisited

    cs.CV 2026-04 unverdicted novelty 6.0

    Scaling data, model size, and compute for local feature matching produces large performance gains on challenging benchmarks and a new manually annotated HardMatch dataset.

  54. Memory Over Maps: 3D Object Localization Without Reconstruction

    cs.RO 2026-03 unverdicted novelty 6.0

    A map-free localization method stores posed RGB-D keyframes, retrieves and re-ranks them with a VLM, then fuses sparse depth for on-demand 3D target estimates, matching reconstruction-based performance on navigation b...

  55. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  56. Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence

    cs.CV 2026-05 unverdicted novelty 5.0

    Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.

  57. Syn4D: A Multiview Synthetic 4D Dataset

    cs.CV 2026-05 unverdicted novelty 5.0

    Syn4D is a new multiview synthetic 4D dataset supplying dense ground-truth annotations for dynamic scene reconstruction, tracking, and human pose estimation.

  58. SS3D: End2End Self-Supervised 3D from Web Videos

    cs.CV 2026-04 unverdicted novelty 5.0

    SS3D scales SfM-based self-supervision to ~100M frames from YouTube-8M using a multi-view signal proxy for filtering and a two-stage training schedule, achieving strong zero-shot transfer and better fine-tuning than p...

  59. Context Unrolling in Omni Models

    cs.CV 2026-04 unverdicted novelty 5.0

    Omni is a multimodal model whose native training on diverse data types enables context unrolling, allowing explicit reasoning across modalities to better approximate shared knowledge and improve downstream performance.

  60. MonoEM-GS: Monocular Expectation-Maximization Gaussian Splatting SLAM

    cs.RO 2026-04 unverdicted novelty 5.0

    MonoEM-GS stabilizes view-dependent geometry from foundation models inside a global Gaussian Splatting representation via EM and adds multi-modal features for in-place open-set segmentation.

Reference graph

Works this paper leans on

127 extracted references · 127 canonical work pages · cited by 59 Pith papers · 5 internal anchors

  1. [1]

    Large-scale data for multiple-view stereopsis.Int

    Henrik Aanæs, Rasmus Ramsbøl Jensen, George Vogiatzis, Engin Tola, and Anders Bjorholm Dahl. Large-scale data for multiple-view stereopsis.Int. J. Comput. Vis., 120(2):153–168, 2016

  2. [2]

    Yousset I Abdel-Aziz, Hauck Michael Karara, and Michael Hauck. Direct linear transformation from comparator coordinates into object space coordinates in close-range photogrammetry.Photogrammetric engineering & remote sensing, 81(2):103–107, 2015

  3. [3]

    Map-free visual relocalization: Metric pose relative to a single image

    Eduardo Arnold, Jamie Wynn, Sara Vicente, Guillermo Garcia-Hernando, Aron Monszpart, Victor Prisacariu, Daniyar Turmukhambetov, and Eric Brachmann. Map-free visual relocalization: Metric pose relative to a single image. In European Conference on Computer Vision, pages 690–708. Springer, 2022

  4. [4]

    Perception of three-dimensional shape specified by optic flow by 8-week-old infants.Perception & Psychophysics, 62(3):550–556, 2000

    Martha E Arterberry and Albert Yonas. Perception of three-dimensional shape specified by optic flow by 8-week-old infants.Perception & Psychophysics, 62(3):550–556, 2000

  5. [5]

    Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.Adv

    Gilad Baruch, Zhuoyuan Chen, Afshin Dehghan, Tal Dimry, Yuri Feigin, Peter Fu, Thomas Gebauer, Brandon Joffe, Daniel Kurz, Arik Schwartz, et al. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data.Adv. Neural Inform. Process. Syst., 2021

  6. [6]

    Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

    Aleksei Bochkovskii, AmaÃG, l Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXivpreprintarXiv:2410.02073, 2024

  7. [7]

    Depth pro: Sharp monocular metric depth in less than a second

    Alexey Bochkovskiy, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInt. Conf. Learn. Represent., 2025

  8. [8]

    Unstructured lumigraph rendering

    Chris Buehler, Michael Bosse, Leonard McMillan, Steven Gortler, and Michael Cohen. Unstructured lumigraph rendering. InProceedings of the 28th annual conference on Computer graphics and interactivetechniques, pages 425–432, 2001

  9. [9]

    Virtual kitti 2, 2020

    Yohann Cabon, Naila Murray, and Martin Humenberger. Virtual kitti 2, 2020

  10. [10]

    Must3r: Multi-view network for stereo 3d reconstruction

    Yohann Cabon, Lucas Stoffl, Leonid Antsfeld, Gabriela Csurka, Boris Chidlovskii, Jerome Revaud, and Vincent Leroy. Must3r: Multi-view network for stereo 3d reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1050–1060, 2025

  11. [11]

    pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction

    David Charatan, Sizhe Lester Li, Andrea Tagliasacchi, and Vincent Sitzmann. pixelsplat: 3d gaussian splats from image pairs for scalable generalizable 3d reconstruction. InIEEE Conf. Comput. Vis. Pattern Recog., pages 19457–19467, 2024

  12. [12]

    Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo

    Anpei Chen, Zexiang Xu, Fuqiang Zhao, Xiaoshuai Zhang, Fanbo Xiang, Jingyi Yu, and Hao Su. Mvsnerf: Fast generalizable radiance field reconstruction from multi-view stereo. InIEEE Conf. Comput. Vis. Pattern Recog., pages 14124–14133, 2021

  13. [13]

    Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images

    Yuedong Chen, Haofei Xu, Chuanxia Zheng, Bohan Zhuang, Marc Pollefeys, Andreas Geiger, Tat-Jen Cham, and Jianfei Cai. Mvsplat: Efficient 3d gaussian splatting from sparse multi-view images. InEur. Conf. Comput. Vis., pages 370–386. Springer, 2024

  14. [14]

    Explicit correspondence matching for generalizable neural radiance fields.IEEE Trans.Pattern Anal

    Yuedong Chen, Haofei Xu, Qianyi Wu, Chuanxia Zheng, Tat-Jen Cham, and Jianfei Cai. Explicit correspondence matching for generalizable neural radiance fields.IEEE Trans.Pattern Anal. Mach. Intell., 2025

  15. [15]

    Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes

    Jaehoon Cho, Dongbo Min, Youngjung Kim, and Kwanghoon Sohn. Diml/cvl rgb-d dataset: 2m rgb-d images of natural indoor and outdoor scenes.arXiv preprint arXiv:2110.11590, 2021

  16. [16]

    The cityscapes dataset for semantic urban scene understanding

    Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3213–3223, 2016

  17. [17]

    Objaverse: A universe of annotated 3d objects

    Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsani, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13142–13153, 2023

  18. [18]

    Sail-recon: Large sfm by augmenting scene regression with localization.arXiv preprint arXiv:2508.17972, 2025

    Junyuan Deng, Heng Li, Tao Xie, Weiqiang Ren, Qian Zhang, Ping Tan, and Xiaoyang Guo. Sail-recon: Large sfm by augmenting scene regression with localization.arXiv preprint arXiv:2508.17972, 2025. 24

  19. [19]

    arXiv preprint arXiv:2507.11539 (2025)

    Kai Deng, Zexin Ti, Jiawei Xu, Jian Yang, and Jin Xie. Vggt-long: Chunk it, loop it, align it – pushing vggt’s limits on kilometer-scale long rgb sequences.arXiv preprint arXiv:2507.11539, 2025

  20. [20]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and description. InCVPR workshops, pages 224–236, 2018

  21. [21]

    Drone australia gliding ep025: Sydney views | opera house, harbour bridge & hyde park | dji mavic 4k

    MTS Drones. Drone australia gliding ep025: Sydney views | opera house, harbour bridge & hyde park | dji mavic 4k. https://www.youtube.com/watch?v=qbgKDaGraTA, 2024. Accessed: Sep. 25, 2025. Used under YouTube Standard License

  22. [22]

    D2-net: A trainable cnn for joint description and detection of local features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Pollefeys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2-net: A trainable cnn for joint description and detection of local features. InIEEE Conf. Comput. Vis. Pattern Recog., pages 8092–8101, 2019

  23. [23]

    Depth map prediction from a single image using a multi-scale deep network

    David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InAdv. Neural Inform. Process. Syst., 2014

  24. [24]

    Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

    Mathias Gehrig, Willem Aarents, Daniel Gehrig, and Davide Scaramuzza. Dsec: A stereo event camera dataset for driving scenarios.IEEE Robotics and Automation Letters, 6(3):4947–4954, 2021

  25. [26]

    Vision meets robotics: The kitti dataset

    Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The international journal of robotics research, 32(11):1231–1237, 2013

  26. [27]

    Online training of stereo self- calibration using monocular depth estimation.IEEE Transactionson Computational Imaging, 7:812–823, 2021

    Yotam Gil, Shay Elmalem, Harel Haim, Emanuel Marom, and Raja Giryes. Online training of stereo self- calibration using monocular depth estimation.IEEE Transactionson Computational Imaging, 7:812–823, 2021

  27. [28]

    Radiant foam: Real-time differentiable ray tracing

    Shrisudhan Govindarajan, Daniel Rebain, Kwang Moo Yi, and Andrea Tagliasacchi. Radiant foam: Real-time differentiable ray tracing. InInt. Conf. Comput. Vis., 2025

  28. [29]

    Sugar: Surface-aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering

    Antoine Guédon and Vincent Lepetit. Sugar: Surface-aligned gaussian splatting for efficient 3d mesh recon- struction and high-quality mesh rendering. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5354–5363, 2024

  29. [30]

    3d packing for self-supervised monocular depth estimation

    Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raventos, and Adrien Gaidon. 3d packing for self-supervised monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., 2020

  30. [31]

    Multi-view reconstruction via sfm-guided monocular depth estimation

    Haoyu Guo, He Zhu, Sida Peng, Haotong Lin, Yunzhi Yan, Tao Xie, Wenguan Wang, Xiaowei Zhou, and Hujun Bao. Multi-view reconstruction via sfm-guided monocular depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 5272–5282, 2025

  31. [32]

    Gómez, Manuel Silva, Antonio Seoane, Agnés Borràs, Mario Noriega, German Ros, Jose A

    Jose L. Gómez, Manuel Silva, Antonio Seoane, Agnés Borràs, Mario Noriega, German Ros, Jose A. Iglesias-Guitian, and Antonio M. López. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes. Neurocomputing, 637:130038, 2025. ISSN 0925-2312. doi: https://doi.org/10.1016/j.neucom.2025.130038. URLhttps://www.sciencedir...

  32. [33]

    Detector-free structure from motion

    Xingyi He, Jiaming Sun, Yifan Wang, Sida Peng, Qixing Huang, Hujun Bao, and Xiaowei Zhou. Detector-free structure from motion. InIEEE Conf. Comput. Vis. Pattern Recog., pages 21594–21603, 2024

  33. [34]

    Plenoptic modeling and rendering from image sequences taken by a hand-held camera

    Benno Heigl, Reinhard Koch, Marc Pollefeys, Joachim Denzler, and Luc Van Gool. Plenoptic modeling and rendering from image sequences taken by a hand-held camera. InMustererkennung1999: 21. DAGM-Symposium Bonn, 15.–17.September 1999, pages 94–101. Springer, 1999

  34. [35]

    Lrm: Large reconstruction model for single image to 3d

    Yicong Hong, Kai Zhang, Jiuxiang Gu, Sai Bi, Yang Zhou, Difan Liu, Feng Liu, Kalyan Sunkavalli, Trung Bui, and Hao Tan. Lrm: Large reconstruction model for single image to 3d. InInt. Conf. Learn. Represent., 2024

  35. [36]

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.IEEE Transactionson Pattern Analysis and Machine Intelligence, 2024

  36. [37]

    Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 2024

    Mu Hu, Wei Yin, Chi Zhang, Zhipeng Cai, Xiaoxiao Long, Hao Chen, Kaixuan Wang, Gang Yu, Chunhua Shen, and Shaojie Shen. Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation.TPAMI, 2024. 25

  37. [38]

    Deepmvs: Learning multi-view stereopsis

    Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra Ahuja, and Jia-Bin Huang. Deepmvs: Learning multi-view stereopsis. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

  38. [39]

    Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors

    Wonbong Jang, Philippe Weinzaepfel, Vincent Leroy, Lourdes Agapito, and Jerome Revaud. Pow3r: Empowering unconstrained 3d reconstruction with camera and scene priors. InIEEE Conf. Comput. Vis. Pattern Recog., pages 1071–1081, 2025

  39. [40]

    Megasynth: Scaling up 3d scene reconstruction with synthesized data

    Hanwen Jiang, Zexiang Xu, Desai Xie, Ziwen Chen, Haian Jin, Fujun Luan, Zhixin Shu, Kai Zhang, Sai Bi, Xin Sun, et al. Megasynth: Scaling up 3d scene reconstruction with synthesized data. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 16441–16452, 2025

  40. [41]

    Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Trans

    Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.ACM Trans. Graph., 2025

  41. [42]

    Repurposing diffusion-based image generators for monocular depth estimation

    Bingxin Ke, Anton Obukhov, Shengyu Huang, Nando Metzger, Rodrigo Caye Daudt, and Konrad Schindler. Repurposing diffusion-based image generators for monocular depth estimation. InCVPR, pages 9492–9502, 2024

  42. [43]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed-forward metric 3D reconstruction, 202...

  43. [44]

    3d gaussian splatting for real-time radiance field rendering.ACM Trans.Graph., 42(4):139–1, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Trans.Graph., 42(4):139–1, 2023

  44. [46]

    Tanks and temples: Benchmarking large-scale scene reconstruction

    Arno Knapitsch, Jaesik Park, Qian-Yi Zhou, and Vladlen Koltun. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph., 36(4):1–13, 2017

  45. [47]

    Eden: Multimodal synthetic dataset of enclosed garden scenes

    Hoang-An Le, Thomas Mensink, Partha Das, Sezer Karaoglu, and Theo Gevers. Eden: Multimodal synthetic dataset of enclosed garden scenes. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1579–1589, 2021

  46. [48]

    Grounding image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Grounding image matching in 3d with mast3r. InEur. Conf. Comput. Vis., pages 71–91. Springer, 2024

  47. [49]

    Light field rendering.ACM Trans.Graph., 1996

    Marc Levoy and Pat Hanrahan. Light field rendering.ACM Trans.Graph., 1996

  48. [50]

    Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond

    Yixuan Li, Lihan Jiang, Linning Xu, Yuanbo Xiangli, Zhenzhi Wang, Dahua Lin, and Bo Dai. Matrixcity: A large-scale city dataset for city-scale neural rendering and beyond. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 3205–3215, 2023

  49. [51]

    Megadepth: Learning single-view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2041–2050, 2018

  50. [52]

    Efficient neural radiance fields for interactive free-viewpoint video

    Haotong Lin, Sida Peng, Zhen Xu, Yunzhi Yan, Qing Shuai, Hujun Bao, and Xiaowei Zhou. Efficient neural radiance fields for interactive free-viewpoint video. InSIGGRAPH Asia 2022 Conference Papers, pages 1–9, 2022

  51. [53]

    Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision

    Lu Ling, Yichen Sheng, Zhi Tu, Wentian Zhao, Cheng Xin, Kun Wan, Lantao Yu, Qianyu Guo, Zixun Yu, Yawen Lu, et al. Dl3dv-10k: A large-scale scene dataset for deep learning-based 3d vision. InIEEE Conf. Comput. Vis. Pattern Recog., pages 22160–22169, 2024

  52. [54]

    arXiv preprint arXiv:2505.12549 (2025)

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt-slam: Dense rgb slam optimized on the sl(4) manifold. arXiv preprint arXiv:2505.12549, 2025

  53. [55]

    John McCormac, Ankur Handa, Stefan Leutenegger, and Andrew J Davison. Scenenet rgb-d: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation? InProceedings of the IEEE International Conference on Computer Vision, pages 2678–2687, 2017

  54. [56]

    Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo

    Lukas Mehl, Jenny Schmalfuss, Azin Jahedi, Yaroslava Nalivayko, and Andrés Bruhn. Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4981–4991, 2023. 26

  55. [57]

    Nerf: Representing scenes as neural radiance fields for view synthesis

    Ben Mildenhall, Pratul P Srinivasan, Matthew Tancik, Jonathan T Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view synthesis. InEur. Conf. Comput. Vis., 2020

  56. [58]

    Orb-slam: A versatile and accurate monocular slam system

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: A versatile and accurate monocular slam system. IEEE transactions on robotics, 31(5):1147–1163, 2015

  57. [59]

    Mast3r-slam: Real-time dense slam with 3d reconstruction priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r-slam: Real-time dense slam with 3d reconstruction priors. In IEEE Conf. Comput. Vis. Pattern Recog., pages 16695–16705, 2025

  58. [60]

    3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15, 2019

    Simon Niklaus, Long Mai, Jimei Yang, and Feng Liu. 3d ken burns effect from a single image.ACM Transactions on Graphics (ToG), 38(6):1–15, 2019

  59. [61]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023

  60. [62]

    Global structure-from-motion revisited

    Linfei Pan, Dániel Baráth, Marc Pollefeys, and Johannes L Schönberger. Global structure-from-motion revisited. In Eur. Conf. Comput. Vis., pages 58–77. Springer, 2024

  61. [63]

    Aria digital twin: A new benchmark dataset for egocentric 3d machine perception

    Xiaqing Pan, Nicholas Charron, Yongqian Yang, Scott Peters, Thomas Whelan, Chen Kong, Omkar Parkhi, Richard Newcombe, and Yuheng Carl Ren. Aria digital twin: A new benchmark dataset for egocentric 3d machine perception. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 20133–20143, 2023

  62. [64]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  63. [65]

    Unidepth: Universal monocular metric depth estimation

    Luigi Piccinelli, Yung-Hsu Yang, Christos Sakaridis, Mattia Segu, Siyuan Li, Luc Van Gool, and Fisher Yu. Unidepth: Universal monocular metric depth estimation. InIEEE Conf. Comput. Vis. Pattern Recog., pages 10106–10116, 2024

  64. [66]

    Unidepthv2: Universal monocular metric depth estimation made simpler

    Luigi Piccinelli, Christos Sakaridis, Yung-Hsu Yang, Mattia Segu, Siyuan Li, Wim Abbeloos, and Luc Van Gool. Unidepthv2: Universal monocular metric depth estimation made simpler.arXiv preprint arXiv:2502.20110, 2025

  65. [67]

    Vision transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. InInt. Conf. Comput. Vis., pages 12179–12188, 2021

  66. [68]

    Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction

    Jeremy Reizenstein, Roman Shapovalov, Philipp Henzler, Luca Sbordone, Patrick Labatut, and David Novotny. Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction. InInt. Conf. Comput. Vis., pages 10901–10911, 2021

  67. [69]

    Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding

    Mike Roberts, Jason Ramapuram, Anurag Ranjan, Atulit Kumar, Miguel Angel Bautista, Nathan Paczan, Russ Webb, and Joshua M Susskind. Hypersim: A photorealistic synthetic dataset for holistic indoor scene understanding. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10912– 10922, 2021

  68. [70]

    Structure-from-motion revisited

    Johannes Lutz Schönberger and Jan-Michael Frahm. Structure-from-motion revisited. InIEEE Conf. Comput. Vis. Pattern Recog., 2016

  69. [71]

    Pixelwise view selection for unstructured multi-view stereo

    Johannes Lutz Schönberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for unstructured multi-view stereo. InEur. Conf. Comput. Vis., 2016

  70. [72]

    A multi-view stereo benchmark with high-resolution images and multi-camera videos

    Thomas Schops, Johannes L Schonberger, Silvano Galliani, Torsten Sattler, Konrad Schindler, Marc Pollefeys, and Andreas Geiger. A multi-view stereo benchmark with high-resolution images and multi-camera videos. In IEEE Conf. Comput. Vis. Pattern Recog., pages 3260–3269, 2017

  71. [73]

    A comparison and evaluation of multi-view stereo reconstruction algorithms

    Steven M Seitz, Brian Curless, James Diebel, Daniel Scharstein, and Richard Szeliski. A comparison and evaluation of multi-view stereo reconstruction algorithms. InIEEE Conf. Comput. Vis. Pattern Recog., volume 1, pages 519–528. IEEE, 2006

  72. [74]

    Scene coordinate regression forests for camera relocalization in rgb-d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb-d images. InIEEE Conf. Comput. Vis. Pattern Recog., pages 2930–2937, 2013

  73. [75]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEur. Conf. Comput. Vis., pages 746–760. Springer, 2012. 27

  74. [76]

    Indoor segmentation and support inference from rgbd images

    Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. InEur. Conf. Comput. Vis., pages 746–760. Springer, 2012

  75. [77]

    Scene representation networks: Continuous 3d-structure-aware neural scene representations.Adv

    Vincent Sitzmann, Michael Zollhöfer, and Gordon Wetzstein. Scene representation networks: Continuous 3d-structure-aware neural scene representations.Adv. Neural Inform. Process. Syst., 32, 2019

  76. [78]

    Light field networks: Neural scene representations with single-evaluation rendering.Adv

    Vincent Sitzmann, Semon Rezchikov, Bill Freeman, Josh Tenenbaum, and Fredo Durand. Light field networks: Neural scene representations with single-evaluation rendering.Adv. Neural Inform. Process. Syst., 34:19313– 19325, 2021

  77. [79]

    Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Victor Adrian Prisacariu. Splatt3r: Zero-shot gaussian splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024

  78. [80]

    Photo tourism: exploring photo collections in 3d.ACM Trans.Graph., pages 835–846, 2006

    Noah Snavely, Steven M Seitz, and Richard Szeliski. Photo tourism: exploring photo collections in 3d.ACM Trans.Graph., pages 835–846, 2006

  79. [81]

    Sun rgb-d: A rgb-d scene understanding benchmark suite

    Shuran Song, Samuel P Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. In IEEE Conf. Comput. Vis. Pattern Recog., pages 567–576, 2015

  80. [83]

    The Replica Dataset: A Digital Replica of Indoor Spaces

    Julian Straub, Thomas Whelan, Lingni Ma, Yufan Chen, Erik Wijmans, Simon Green, Jakob J Engel, Raul Mur-Artal, Carl Ren, Shobhit Verma, et al. The replica dataset: A digital replica of indoor spaces.arXiv preprint arXiv:1906.05797, 2019

Showing first 80 references.