arxiv: 2410.03825 · v2 · submitted 2024-10-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang , Charles Herrmann , Junhwa Hur , Varun Jampani , Trevor Darrell , Forrester Cole , Deqing Sun , Ming-Hsuan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:36 UTC · model grok-4.3

classification 💻 cs.CV

keywords dynamic scene geometrypointmap estimationvideo depth estimationcamera pose estimation4D reconstructionfine-tuningmotion handling

0 comments

The pith

A pointmap estimator fine-tuned on limited dynamic video data can estimate geometry in moving scenes without explicit motion modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a geometry estimator originally developed for static scenes can be repurposed for videos containing object motion and deformation through targeted fine-tuning rather than architectural redesign. By training on a modest collection of posed dynamic videos that include depth labels, the model learns to output per-frame pointmaps that directly represent scene geometry at each timestep. This avoids the error accumulation typical of pipelines that separately compute depth, optical flow, and other intermediate representations. If the approach holds, video geometry tasks become simpler to train and deploy while retaining or improving accuracy on depth and camera pose estimation. The result also supports primarily feed-forward pipelines for reconstructing 4D scenes from monocular input.

Core claim

MonST3R directly estimates per-timestep pointmaps from dynamic scenes by fine-tuning an existing pointmap model on several dynamic posed video datasets with depth labels, enabling it to handle motion and deformation without any explicit motion representation or multi-stage decomposition.

What carries the argument

Per-timestep pointmap output, which supplies an independent 3D point cloud for every video frame to serve as the geometry representation.

If this is right

Video depth estimation becomes more robust because the single-stage pointmap prediction avoids compounding errors from separate depth and flow stages.
Camera pose estimation in dynamic scenes improves in both accuracy and speed by operating directly on the per-frame geometry output.
Primarily feed-forward 4D reconstruction from video becomes feasible without requiring global optimization or explicit temporal modeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fine-tuning strategy could support real-time video geometry if the underlying model is distilled or quantized for lower latency.
Integration with generative video models might allow the pointmaps to guide synthesis of missing or occluded geometry across frames.
Performance on fluid or highly non-rigid motion could be tested by adding synthetic datasets with controlled deformation parameters.

Load-bearing premise

The fine-tuning data of dynamic posed videos with depth labels is sufficient for the model to generalize to arbitrary motions and deformations outside the training distribution.

What would settle it

Evaluate the model on a held-out set of videos containing motion patterns and object deformations absent from the fine-tuning datasets and measure whether depth accuracy or pose estimation error rises sharply compared with prior multi-stage methods.

read the original abstract

Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously only used for static scenes, to dynamic scenes. However, this approach presents a significant challenge: the scarcity of suitable training data, namely dynamic, posed videos with depth labels. Despite this, we show that by posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation. Based on this, we introduce new optimizations for several downstream video-specific tasks and demonstrate strong performance on video depth and camera pose estimation, outperforming prior work in terms of robustness and efficiency. Moreover, MonST3R shows promising results for primarily feed-forward 4D reconstruction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MonST3R fine-tunes DUST3R per-timestep on limited dynamic datasets to skip explicit motion models, and it delivers usable video depth and pose results, but the generalization story rests on thin evidence.

read the letter

The core move is straightforward: take the static-scene pointmap model from DUST3R, predict one map per frame, and fine-tune on a small collection of posed dynamic videos that happen to carry depth labels. This lets them avoid the usual multi-stage depth-plus-flow pipelines and still report stronger robustness and speed on video depth and camera pose tasks, plus some feed-forward 4D reconstruction output. The simplicity is the real contribution; they show that the existing representation can absorb dynamics once the right data is supplied, without inventing new motion primitives.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces MonST3R, a fine-tuned adaptation of the DUST3R pointmap estimator for dynamic scenes. By predicting independent per-timestep pointmaps from posed video frames with depth supervision, the method implicitly accommodates object motion and deformation without explicit flow, optical flow, or 4D representations. The central contribution is an empirical demonstration that fine-tuning on a limited set of dynamic posed video+depth datasets suffices to extend static-scene geometry estimation to dynamic cases, yielding improved robustness on video depth estimation, camera pose recovery, and feed-forward 4D reconstruction relative to prior multi-stage pipelines.

Significance. If the reported generalization holds, the work would offer a notably simpler alternative to existing dynamic geometry pipelines that decompose the problem into separate depth, flow, and optimization stages. The approach's strength lies in its minimal architectural change and avoidance of hand-crafted motion models; however, its dependence on empirical fine-tuning rather than a parameter-free derivation limits the strength of the theoretical claim.

major comments (3)

[§5] §5 (Experiments): Results are reported primarily on in-distribution sequences drawn from the same small set of sources used for fine-tuning. No cross-dataset zero-shot evaluation on novel non-rigid deformations outside the training distribution is presented, leaving open whether performance stems from memorization of dataset-specific motion patterns rather than a general geometry-first dynamic prior.
[§4] §4 (Training details) and §5: The manuscript acknowledges data scarcity yet provides no ablation that isolates the contribution of data selection strategy versus fine-tuning hyperparameters, nor any quantitative measure (e.g., performance drop under distribution shift) to support the claim that limited data suffices for arbitrary motion handling.
[§5] Abstract and §5: The claim of outperforming prior work is stated without accompanying tables showing baseline implementations, error bars, or statistical significance; the absence of these details in the reported metrics undermines verification of the robustness and efficiency advantages.

minor comments (2)

[§3] Notation for per-timestep pointmaps is introduced without an explicit equation linking the dynamic case to the original DUST3R static formulation; adding a short derivation or reference to the base model would improve clarity.
[Figure 4] Figure captions for qualitative 4D reconstruction results do not indicate which sequences are held-out versus training-distribution, making it difficult to assess generalization from visuals alone.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful comments and suggestions. We provide point-by-point responses below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: [§5] §5 (Experiments): Results are reported primarily on in-distribution sequences drawn from the same small set of sources used for fine-tuning. No cross-dataset zero-shot evaluation on novel non-rigid deformations outside the training distribution is presented, leaving open whether performance stems from memorization of dataset-specific motion patterns rather than a general geometry-first dynamic prior.

Authors: We acknowledge that our current evaluations are primarily on sequences from the training data distributions. To address this, we will include additional zero-shot evaluations on out-of-distribution dynamic scenes with novel deformations in the revised manuscript. This will help demonstrate that the model learns a general geometry prior rather than memorizing specific patterns. revision: yes
Referee: [§4] §4 (Training details) and §5: The manuscript acknowledges data scarcity yet provides no ablation that isolates the contribution of data selection strategy versus fine-tuning hyperparameters, nor any quantitative measure (e.g., performance drop under distribution shift) to support the claim that limited data suffices for arbitrary motion handling.

Authors: We agree that further ablations would strengthen the paper. In the revision, we plan to add ablations isolating the effects of data selection and fine-tuning hyperparameters. Additionally, we will report performance drops under distribution shifts to quantify the generalization from limited data. revision: yes
Referee: [§5] Abstract and §5: The claim of outperforming prior work is stated without accompanying tables showing baseline implementations, error bars, or statistical significance; the absence of these details in the reported metrics undermines verification of the robustness and efficiency advantages.

Authors: We will revise the abstract and Section 5 to include more detailed comparison tables with baseline implementations, error bars, and statistical significance tests where applicable. This will provide better verification of our claims regarding robustness and efficiency. revision: yes

Circularity Check

0 steps flagged

Empirical fine-tuning on external dynamic datasets; no derivation reduces to self-defined inputs

full rationale

The paper presents MonST3R as a fine-tuning of the pre-trained DUST3R pointmap estimator on external posed dynamic video+depth datasets. No mathematical derivation chain exists that reduces predictions to quantities defined in terms of the model's own fitted parameters. The central claim is supported by experimental results on training and test splits from those datasets rather than by self-referential equations or load-bearing self-citations that forbid alternatives. This matches the default expectation of no significant circularity for an empirical adaptation paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that DUST3R's static pointmap representation can be directly reused for dynamic scenes via per-frame prediction and that limited existing dynamic datasets suffice for effective fine-tuning.

free parameters (1)

fine-tuning hyperparameters and data selection strategy
Learning rate, epochs, and choice of which dynamic datasets to include are chosen to make the adaptation work.

axioms (1)

domain assumption Per-timestep pointmaps are sufficient to capture geometry in the presence of motion without an explicit motion representation
This is the key modeling choice stated in the abstract.

pith-pipeline@v0.9.0 · 5552 in / 1139 out tokens · 38520 ms · 2026-05-15T14:36:24.653161+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

by simply estimating a pointmap for each timestep, we can effectively adapt DUSt3R's representation... fine-tuning on this limited data
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

global optimization... Lalign + wsmooth Lsmooth + wflow Lflow

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 7.0

Ray-aware pointers that track both location and viewing direction enable adaptive retain-or-replace memory updates for more stable streaming 3D reconstruction.
Ground4D: Spatially-Grounded Feedforward 4D Reconstruction for Unstructured Off-Road Scenes
cs.CV 2026-05 unverdicted novelty 7.0

Ground4D resolves temporal conflicts in feedforward 4D Gaussian reconstruction for off-road scenes via voxel-grounded temporal aggregation with intra-voxel softmax and surface normal regularization, outperforming prio...
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
cs.CV 2026-04 conditional novelty 7.0

AirZoo is a new large-scale synthetic dataset for aerial 3D vision that improves state-of-the-art models on image retrieval, cross-view matching, and 3D reconstruction when used for fine-tuning.
Holo360D: A Large-Scale Real-World Dataset with Continuous Trajectories for Advancing Panoramic 3D Reconstruction and Beyond
cs.CV 2026-04 unverdicted novelty 7.0

Holo360D is the first large-scale dataset providing continuous panoramic sequences with accurately aligned high-completeness depth maps and meshes for training panoramic 3D reconstruction models.
Learning 3D Reconstruction with Priors in Test Time
cs.CV 2026-04 unverdicted novelty 7.0

Test-time constrained optimization incorporates priors into pre-trained multiview transformers via self-supervised losses and penalty terms to improve 3D reconstruction accuracy.
STAC: Plug-and-Play Spatio-Temporal Aware Cache Compression for Streaming 3D Reconstruction
cs.CV 2026-03 unverdicted novelty 7.0

STAC compresses KV caches in streaming 3D reconstruction transformers via temporal token preservation with decayed attention, spatial voxel compression, and chunked multi-frame optimization, delivering 10x memory redu...
ZipMap: Linear-Time Stateful 3D Reconstruction via Test-Time Training
cs.CV 2026-03 unverdicted novelty 7.0

ZipMap achieves linear-time bidirectional 3D reconstruction by zipping image collections into a compact stateful representation via test-time training layers.
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
cs.CV 2025-07 conditional novelty 7.0

π³ is a feed-forward network with full permutation equivariance that outputs affine-invariant poses and scale-invariant local point maps without reference frames, reaching state-of-the-art on camera pose, depth, and d...
CoGE: Sim-to-Real Online Geometric Estimation for Monocular Colonoscopy
cs.CV 2026-05 unverdicted novelty 6.0

CoGE achieves state-of-the-art monocular geometric estimation in colonoscopy by training solely on simulated data via an illumination-aware Retinex-based module and a wavelet-based structure-aware module.
Attention Itself Could Retrieve.RetrieveVGGT: Training-Free Long Context Streaming 3D Reconstruction via Query-Key Similarity Retrieval
cs.CV 2026-05 unverdicted novelty 6.0

RetrieveVGGT enables constant-memory long-context streaming 3D reconstruction by retrieving relevant frames via query-key similarities in VGGT's first attention layer, outperforming StreamVGGT and others.
RigidFormer: Learning Rigid Dynamics using Transformers
cs.CV 2026-05 unverdicted novelty 6.0

RigidFormer learns mesh-free rigid dynamics from point clouds using object-centric anchors, Anchor-Vertex Pooling, Anchor-based RoPE, and differentiable Kabsch alignment to enforce rigidity.
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
cs.CV 2026-05 unverdicted novelty 6.0

Sat3R adapts Depth Anything V2 via RPC-aware metric depth fine-tuning to deliver satellite DSM reconstruction with 38% lower MAE than zero-shot baselines and over 300x speedup versus optimization methods.
Ray-Aware Pointer Memory with Adaptive Updates for Streaming 3D Reconstruction
cs.CV 2026-05 unverdicted novelty 6.0

Ray-aware pointer memory with adaptive retain-or-replace updates enhances stability and accuracy in streaming 3D reconstruction.
Long-tail Internet photo reconstruction
cs.CV 2026-04 unverdicted novelty 6.0

Finetuning 3D foundation models on simulated sparse subsets from MegaDepth-X produces robust reconstructions from extremely sparse, noisy internet photos while preserving performance on dense benchmarks.
Vista4D: Video Reshooting with 4D Point Clouds
cs.CV 2026-04 unverdicted novelty 6.0

Vista4D re-synthesizes dynamic videos from new viewpoints by grounding them in a 4D point cloud built with static segmentation and multiview training.
Feed-Forward 3D Scene Modeling: A Problem-Driven Perspective
cs.CV 2026-04 unverdicted novelty 6.0

The paper proposes a problem-driven taxonomy for feed-forward 3D scene modeling that groups methods by five core challenges: feature enhancement, geometry awareness, model efficiency, augmentation strategies, and temp...
Self-Improving 4D Perception via Self-Distillation
cs.CV 2026-04 unverdicted novelty 6.0

SelfEvo enables pretrained 4D perception models to self-improve on unlabeled videos via self-distillation, delivering up to 36.5% relative gains in video depth estimation and 20.1% in camera estimation across eight be...
SceneScribe-1M: A Large-Scale Video Dataset with Comprehensive Geometric and Semantic Annotations
cs.CV 2026-04 unverdicted novelty 6.0

SceneScribe-1M is a new dataset of 1 million videos with semantic text, camera parameters, dense depth, and consistent 3D point tracks to support monocular depth estimation, scene reconstruction, point tracking, and t...
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
cs.CV 2026-03 conditional novelty 6.0

OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
Streaming 4D Visual Geometry Transformer
cs.CV 2025-07 unverdicted novelty 6.0

A causal transformer with key-value caching and distillation from a bidirectional VGGT model enables efficient online 4D geometry reconstruction from videos.
WildPose: A Unified Framework for Robust Pose Estimation in the Wild
cs.CV 2026-05 unverdicted novelty 5.0

WildPose unifies feedforward 3D features from MASt3R with differentiable bundle adjustment for robust monocular pose estimation across dynamic, static, and low-ego-motion scenes.
LychSim: A Controllable and Interactive Simulation Framework for Vision Research
cs.CV 2026-05 unverdicted novelty 4.0

LychSim introduces a controllable simulation platform on Unreal Engine 5 with Python API, procedural generation, and LLM integration for vision research tasks.
DINO_4D: Semantic-Aware 4D Reconstruction
cs.CV 2026-04 unverdicted novelty 4.0

DINO_4D uses frozen DINOv3 features to inject semantic awareness into 4D dynamic scene reconstruction, improving tracking accuracy and completeness on benchmarks while preserving O(T) complexity.

Reference graph

Works this paper leans on

169 extracted references · 169 canonical work pages · cited by 22 Pith papers · 2 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[4]

Repurposing diffusion-based image generators for monocular depth estimation , author=

work page
[5]

Wang, Wenshan and Hu, Yaoyu and Scherer, Sebastian , booktitle=CoRL, pages=. Tartan

work page
[6]

Shen, Shihao and Cai, Yilin and Wang, Wenshan and Scherer, Sebastian , booktitle=ICRA, pages=. Dytan

work page
[7]

Structure and motion from casual videos , author=

work page
[8]

Depth Anything

Yang, Lihe and Kang, Bingyi and Huang, Zilong and Zhao, Zhen and Xu, Xiaogang and Feng, Jiashi and Zhao, Hengshuang , journal=. Depth Anything

work page
[9]

Deep patch visual odometry , author=

work page
[10]

ACM Transactions on Graphics (ToG) , volume=

Consistent video depth estimation , author=. ACM Transactions on Graphics (ToG) , volume=. 2020 , publisher=

work page 2020
[11]

ACM Transactions on Graphics (ToG) , volume=

Consistent depth of moving objects in video , author=. ACM Transactions on Graphics (ToG) , volume=

work page
[12]

Teed, Zachary and Deng, Jia , journal=NeurIPS, pages=

work page
[13]

Chen, Weirong and Chen, Le and Wang, Rui and Pollefeys, Marc , booktitle=CVPR, pages=

work page
[14]

Hu, Wenbo and Gao, Xiangjun and Li, Xiaoyu and Zhao, Sijie and Cun, Xiaodong and Zhang, Yong and Quan, Long and Shan, Ying , journal=. Depth

work page
[15]

Zhao, Wang and Liu, Shaohui and Guo, Hengkai and Wang, Wenping and Liu, Yong-Jin , booktitle=ECCV, pages=

work page
[16]

Lei, Jiahui and Weng, Yijia and Harley, Adam and Guibas, Leonidas and Daniilidis, Kostas , journal=

work page
[17]

Structure-from-Motion Revisited , booktitle=CVPR, year=

Sch\". Structure-from-Motion Revisited , booktitle=CVPR, year=

work page
[18]

Chu, Wen-Hsuan and Ke, Lei and Fragkiadaki, Katerina , journal=

work page
[19]

Shape of Motion: 4

Wang, Qianqian and Ye, Vickie and Gao, Hang and Austin, Jake and Li, Zhengqi and Kanazawa, Angjoo , journal =. Shape of Motion: 4

work page
[20]

Wu, Guanjun and Yi, Taoran and Fang, Jiemin and Xie, Lingxi and Zhang, Xiaopeng and Wei, Wei and Liu, Wenyu and Tian, Qi and Wang, Xinggang , booktitle=CVPR, pages=. 4

work page
[21]

Wang, Shizun and Yang, Xingyi and Shen, Qiuhong and Jiang, Zhenxiang and Wang, Xinchao , journal=

work page
[22]

Liu, Qingming and Liu, Yuan and Wang, Jiepeng and Lv, Xianqiang and Wang, Peng and Wang, Wenping and Hou, Junhui , journal=

work page
[23]

Wang, Shuzhe and Leroy, Vincent and Cabon, Yohann and Chidlovskii, Boris and Revaud, Jerome , booktitle=CVPR, pages=

work page
[24]

A naturalistic open source movie for optical flow evaluation , author=

work page
[25]

Wang, Wenshan and Zhu, Delong and Wang, Xiangwei and Hu, Yaoyu and Qiu, Yuheng and Wang, Chen and Hu, Yafei and Kapoor, Ashish and Scherer, Sebastian , booktitle=IROS, pages=. Tartan

work page
[26]

Spring: A high-resolution high-detail dataset and benchmark for scene flow, optical flow and stereo , author=

work page
[27]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Learning to recover 3d scene shape from a single image , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

work page
[28]

Large scale interactive motion forecasting for autonomous driving: The waymo open motion dataset , author=

work page
[29]

ReFusion: 3D reconstruction in dynamic environments for

Palazzolo, Emanuele and Behley, Jens and Lottes, Philipp and Giguere, Philippe and Stachniss, Cyrill , booktitle=IROS, pages=. ReFusion: 3D reconstruction in dynamic environments for

work page
[30]

Vision meets robotics: The

Geiger, Andreas and Lenz, Philip and Stiller, Christoph and Urtasun, Raquel , journal=IJRR, volume=. Vision meets robotics: The

work page
[31]

Indoor segmentation and support inference from

Silberman, Nathan and Hoiem, Derek and Kohli, Pushmeet and Fergus, Rob , booktitle=ECCV, pages=. Indoor segmentation and support inference from

work page
[32]

Harley and Bokui Shen and Gordon Wetzstein and Leonidas J

Yang Zheng and Adam W. Harley and Bokui Shen and Gordon Wetzstein and Leonidas J. Guibas , title =

work page
[33]

Lepetit, Vincent and Moreno-Noguer, Francesc and Fua, Pascal , journal=IJCV, volume=

work page
[34]

Moving Object Segmentation: All You Need Is

Xie, Junyu and Yang, Charig and Xie, Weidi and Zisserman, Andrew , journal=. Moving Object Segmentation: All You Need Is

work page
[35]

Communications of the ACM , volume=

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography , author=. Communications of the ACM , volume=

work page
[36]

, booktitle=

Luo, Bin and Hancock, Edwin R. , booktitle=. Procrustes alignment with the

work page
[37]

2003 , publisher=

Multiple view geometry in computer vision , author=. 2003 , publisher=

work page 2003
[38]

Weinzaepfel, Philippe and Leroy, Vincent and Lucas, Thomas and Br. Cro

work page
[39]

Sun, Pei and Kretzschmar, Henrik and Dotiwalla, Xerxes and Chouard, Aurelien and Patnaik, Vijaysai and Tsui, Paul and Guo, James and Zhou, Yin and Chai, Yuning and Caine, Benjamin and Vasudevan, Vijay and Han, Wei and Ngiam, Jiquan and Zhao, Hang and Timofeev, Aleksei and Ettinger, Scott and Krivokon, Maxim and Gao, Amy and Joshi, Aditya and Zhang, Yu and...

work page
[40]

and Zhou, Yin and Yang, Zoey and Chouard, Aur'elien and Sun, Pei and Ngiam, Jiquan and Vasudevan, Vijay and McCauley, Alexander and Shlens, Jonathon and Anguelov, Dragomir , title=

Ettinger, Scott and Cheng, Shuyang and Caine, Benjamin and Liu, Chenxi and Zhao, Hang and Pradhan, Sabeek and Chai, Yuning and Sapp, Ben and Qi, Charles R. and Zhou, Yin and Yang, Zoey and Chouard, Aur'elien and Sun, Pei and Ngiam, Jiquan and Vasudevan, Vijay and McCauley, Alexander and Shlens, Jonathon and Anguelov, Dragomir , title=. 2021 , pages=

work page 2021
[41]

Chen, Kan and Ge, Runzhou and Qiu, Hang and Ai-Rfou, Rami and Qi, Charles R. and Zhou, Xuanyu and Yang, Zoey and Ettinger, Scott and Sun, Pei and Leng, Zhaoqi and Mustafa, Mustafa and Bogun, Ivan and Wang, Weiyue and Tan, Mingxing and Anguelov, Dragomir , title=

work page
[42]

A benchmark dataset and evaluation methodology for video object segmentation , author=

work page
[43]

Depth anything: Unleashing the power of large-scale unlabeled data , author=

work page
[44]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer , author=

work page
[45]

Vision transformers for dense prediction , author=

work page
[46]

Bhat, Shariq Farooq and Birkl, Reiner and Wofk, Diana and Wonka, Peter and M. Zoe. arXiv preprint arXiv:2302.12288 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[47]

The surprising effectiveness of diffusion models for optical flow and monocular depth estimation , author=

work page
[48]

Depth map prediction from a single image using a multi-scale deep network , author=

work page
[49]

Deep ordinal regression network for monocular depth estimation , author=

work page
[50]

3DV , pages=

Deeper depth prediction with fully convolutional residual networks , author=. 3DV , pages=

work page
[51]

Li, Zhenyu and Wang, Xuyang and Liu, Xianming and Jiang, Junjun , journal=TIP, year=. Bins

work page
[52]

Bhat, Shariq Farooq and Alhashim, Ibraheem and Wonka, Peter , booktitle=CVPR, pages=. Ada

work page
[53]

Piccinelli, Luigi and Yang, Yung-Hsu and Sakaridis, Christos and Segu, Mattia and Li, Siyuan and Van Gool, Luc and Yu, Fisher , booktitle=CVPR, pages=. Uni

work page
[54]

arXiv preprint arXiv:2312.13252 , year=

Zero-shot metric depth with a field-of-view conditioned diffusion model , author=. arXiv preprint arXiv:2312.13252 , year=

work page arXiv
[55]

Robust consistent video depth estimation , author=

work page
[56]

Digging into self-supervised monocular depth estimation , author=

work page
[58]

arXiv preprint arXiv:2409.02104 , year=

Seidenschwarz, Jenny and Zhou, Qunjie and Duisterhof, Bardienus and Ramanan, Deva and Leal-Taix. arXiv preprint arXiv:2409.02104 , year=

work page arXiv
[59]

, author=

3D Gaussian Splatting for Real-Time Radiance Field Rendering. , author=

work page
[60]

A benchmark for the evaluation of

Sturm, J. A benchmark for the evaluation of

work page
[61]

and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie

Dai, Angela and Chang, Angel X. and Savva, Manolis and Halber, Maciej and Funkhouser, Thomas and Nie. Scan

work page
[62]

Neural video depth stabilizer , author=

work page
[63]

Mur-Artal, Raul and Montiel, Jose Maria Martinez and Tardos, Juan D , journal=

work page
[64]

IEEE Transactions on Robotics , volume=

Mur-Artal, Raul and Tard. IEEE Transactions on Robotics , volume=

work page
[65]

Direct sparse odometry , author=

work page
[66]

Depth from videos in the wild: Unsupervised monocular depth learning from unknown cameras , author=

work page
[67]

Unsupervised learning of depth and ego-motion from monocular video using 3

Mahjourian, Reza and Wicke, Martin and Angelova, Anelia , booktitle=CVPR, pages=. Unsupervised learning of depth and ego-motion from monocular video using 3

work page
[68]

Every pixel counts: Unsupervised geometry learning with holistic 3

Yang, Zhenheng and Wang, Peng and Wang, Yang and Xu, Wei and Nevatia, Ram , booktitle=. Every pixel counts: Unsupervised geometry learning with holistic 3

work page
[69]

Pixel-perfect structure-from-motion with featuremetric refinement , author=

work page
[70]

Teed, Zachary and Deng, Jia , journal=. Deep

work page
[71]

Tang, Chengzhou and Tan, Ping , journal=

work page
[72]

Engel, Jakob and Sch

work page
[73]

and Davison, Andrew J

Newcombe, Richard A and Lovegrove, Steven J. and Davison, Andrew J. , booktitle=ICCV, pages=

work page
[74]

Deep learning for 3

Guo, Yulan and Wang, Hanyun and Hu, Qingyong and Liu, Hao and Liu, Li and Bennamoun, Mohammed , journal=PAMI, volume=. Deep learning for 3

work page
[75]

Learning efficient point cloud generation for dense 3

Lin, Chen-Hsuan and Kong, Chen and Lucey, Simon , booktitle=AAAI, year=. Learning efficient point cloud generation for dense 3

work page
[76]

Sitzmann, Vincent and Thies, Justus and Heide, Felix and Nie. Deep

work page
[77]

and Xu, Danfei and Gwak, JunYoung and Chen, Kevin and Savarese, Silvio , booktitle=ECCV, pages=

Choy, Christopher B. and Xu, Danfei and Gwak, JunYoung and Chen, Kevin and Savarese, Silvio , booktitle=ECCV, pages=

work page
[78]

Multi-view supervision for single-view reconstruction via differentiable ray consistency , author=

work page
[79]

Wang, Peng and Liu, Lingjie and Liu, Yuan and Theobalt, Christian and Komura, Taku and Wang, Wenping , journal=. Neu

work page
[80]

Convolutional occupancy networks , author=

work page
[81]

Learning implicit fields for generative shape modeling , author=

work page

Showing first 80 references.