{"paper":{"title":"MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A pointmap estimator fine-tuned on limited dynamic video data can estimate geometry in moving scenes without explicit motion modeling.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Charles Herrmann, Deqing Sun, Forrester Cole, Junhwa Hur, Junyi Zhang, Ming-Hsuan Yang, Trevor Darrell, Varun Jampani","submitted_at":"2024-10-04T18:00:07Z","abstract_excerpt":"Estimating geometry from dynamic scenes, where objects move and deform over time, remains a core challenge in computer vision. Current approaches often rely on multi-stage pipelines or global optimizations that decompose the problem into subtasks, like depth and flow, leading to complex systems prone to errors. In this paper, we present Motion DUSt3R (MonST3R), a novel geometry-first approach that directly estimates per-timestep geometry from dynamic scenes. Our key insight is that by simply estimating a pointmap for each timestep, we can effectively adapt DUST3R's representation, previously o"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"By posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"Suitable dynamic posed videos with depth labels exist in sufficient quantity and quality to allow fine-tuning to generalize to arbitrary motion and deformation.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"By fine-tuning DUST3R to output per-timestep pointmaps on scarce dynamic video datasets, MonST3R achieves stronger video depth and pose estimation without explicit motion modeling.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A pointmap estimator fine-tuned on limited dynamic video data can estimate geometry in moving scenes without explicit motion modeling.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"7cc5b618e5ab3b0499c32f868a933f5d546ae2adf40daaa35cfd5ae0a16015ec"},"source":{"id":"2410.03825","kind":"arxiv","version":2},"verdict":{"id":"55b4dfb0-921a-4c88-b11d-a004d94f9b98","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-15T14:36:24.653161Z","strongest_claim":"By posing the problem as a fine-tuning task, identifying several suitable datasets, and strategically training the model on this limited data, we can surprisingly enable the model to handle dynamics, even without an explicit motion representation.","one_line_summary":"By fine-tuning DUST3R to output per-timestep pointmaps on scarce dynamic video datasets, MonST3R achieves stronger video depth and pose estimation without explicit motion modeling.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"Suitable dynamic posed videos with depth labels exist in sufficient quantity and quality to allow fine-tuning to generalize to arbitrary motion and deformation.","pith_extraction_headline":"A pointmap estimator fine-tuned on limited dynamic video data can estimate geometry in moving scenes without explicit motion modeling."},"references":{"count":169,"sample":[{"doi":"","year":null,"title":"Scaling Learning Algorithms Towards","work_id":"bb2761cc-98d0-411b-92f6-803773d64460","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"and Osindero, Simon and Teh, Yee Whye , journal =","work_id":"0a5921e3-ac4e-46f1-85ae-866119a87be0","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2016,"title":"Deep learning , author=. 2016 , publisher=","work_id":"cf0899e0-53ee-4591-aae4-f38fa5ac12ad","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Repurposing diffusion-based image generators for monocular depth estimation , author=","work_id":"5161405f-a02d-444d-b9c8-5673b4c5d9bc","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Wang, Wenshan and Hu, Yaoyu and Scherer, Sebastian , booktitle=CoRL, pages=. Tartan","work_id":"b8109ebf-41b7-472b-81bc-337d4d69facd","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":169,"snapshot_sha256":"01953506d9b3e2115280417af4393463ee2ac28f24b7c8388828a4e815a02cd6","internal_anchors":3},"formal_canon":{"evidence_count":2,"snapshot_sha256":"9b8a0dba0acbc64f0ac1ce52181a5ee36208037b2ca1c01e23840bc164be249c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}