arxiv: 2512.13122 · v2 · submitted 2025-12-15 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Vivek Alumootil , Tuan-Anh Vu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords dense point tracking3D reconstructiondynamic scenessingle forward passunposed imagesspatio-temporal featuresdense prediction heads

0 comments

The pith

DePT3R performs dense point tracking and 3D reconstruction of dynamic scenes in one forward pass without camera poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DePT3R is a framework that takes multiple images of a dynamic scene and outputs both dense point tracks and a 3D reconstruction at the same time. It achieves this through a backbone network that pulls out spatio-temporal features, followed by dense prediction heads that regress pixel-wise maps for each task. The approach works in a single network pass and skips the need for known camera poses or assumptions about frame ordering. A reader would care because current methods often split the work into pairwise steps or demand extra pose data, which limits speed and flexibility in real moving environments. Combining the tasks this way points toward simpler, more efficient pipelines for video understanding.

Core claim

DePT3R extracts deep spatio-temporal features with a powerful backbone and regresses pixel-wise maps with dense prediction heads to jointly perform dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass, without requiring camera poses.

What carries the argument

Backbone network for spatio-temporal feature extraction paired with dense prediction heads that regress tracking and reconstruction maps.

Load-bearing premise

The backbone network extracts spatio-temporal features sufficient for accurate simultaneous regression of tracking and reconstruction maps across challenging dynamic scenes.

What would settle it

A dynamic scene benchmark test where the method produces lower-accuracy tracks or reconstructions than pose-aware alternatives on fast motion or heavy occlusion cases.

Figures

Figures reproduced from arXiv: 2512.13122 by Tuan-Anh Vu, Vivek Alumootil.

**Figure 1.** Figure 1: DePT3R achieves robust dense point tracking and reconstruction accuracy across unposed sequences while requiring less memory usage, highlighting the effectiveness of our approach for long-range, dynamic scenes. Abstract Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining … view at source ↗

**Figure 2.** Figure 2: Our proposed DePT3R framework. Our model first tokenizes each input frame with DINOv2 and augments every token with a global intrinsic embedding. A learnable query embedding is added to the tokens corresponding to the query frame. The alternating frame-wise and global self-attention blocks process the tokens. A dedicated camera head predicts both intrinsics and extrinsics, while DPT heads produce point map… view at source ↗

**Figure 3.** Figure 3: 3D Reconstruction and Point Tracking on the Stereo4D Dataset. Despite supervising point tracking on a small collection of unrealistic datasets with mostly minimal scene motion, our method exhibits strong generalization to real-world scenes [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative ablation study on the Panoptic Studio dataset. The 3D trajectories of scene points are projected onto the last frame. Without the intrinsic embedding, the model can identify the direction of motion but struggles to accurately place it and gauge its magnitude, resulting in significant errors in point trajectory estimation. 10 2 10 3 10 4 10 5 Number of Query Points 0 1 × 10 4 2 × 10 4 3 × 10 4 4… view at source ↗

**Figure 5.** Figure 5: GPU memory usage comparison between SpatialTrackerV2, VGGT and our DePT3R method across varying numbers of query points. SpatialTrackerV2 and VGGT exhibit a rapid increase in GPU memory consumption, exhausting the 48 GB memory limit at just 40k and 22.5k query points, respectively. In contrast, DePT3R efficiently handles 268k query points, requiring only 12 GB of memory. All experiments were performed on… view at source ↗

read the original abstract

Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume temporal ordering of input frames, thereby constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency, especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DePT3R, a framework that jointly performs dense point tracking and 3D reconstruction of dynamic scenes from multiple unposed images in a single forward pass. It extracts spatio-temporal features via a backbone network and regresses pixel-wise maps using dense prediction heads, claiming strong benchmark performance and memory efficiency gains without requiring camera poses or temporal ordering.

Significance. If the joint single-pass regression holds, the work could meaningfully advance dynamic scene understanding by unifying tasks typically addressed separately, with practical benefits for efficiency in video and robotics applications.

major comments (2)

[§3.2] §3.2: The backbone feature extractor is presented as sufficient to resolve motion-structure ambiguities for both long-range tracking and metric 3D regression without poses, but no capacity analysis, feature visualization, or ablation on shared representation quality is provided; this directly bears on whether the claimed unification is achieved or if one task is approximated.
[§4.2] §4.2, Table 3: Performance is reported as strong, yet no quantitative breakdown of per-task error (tracking vs. reconstruction) or cross-task interference is shown, leaving open whether the single forward pass maintains accuracy on both outputs simultaneously in challenging dynamic cases.

minor comments (2)

[§3.3] The method section would benefit from an explicit loss formulation equation for the 3D head to clarify supervision without poses.
[Figure 1] Figure 1 overview could label input frame count and output map resolutions for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below.

read point-by-point responses

Referee: [§3.2] §3.2: The backbone feature extractor is presented as sufficient to resolve motion-structure ambiguities for both long-range tracking and metric 3D regression without poses, but no capacity analysis, feature visualization, or ablation on shared representation quality is provided; this directly bears on whether the claimed unification is achieved or if one task is approximated.

Authors: We agree that additional analysis of the shared backbone would help substantiate the unification. In the revised manuscript we will add backbone capacity analysis, feature visualizations illustrating how spatio-temporal features jointly capture motion and structure, and an ablation on shared representation quality to demonstrate that both tasks are supported without one being approximated. revision: yes
Referee: [§4.2] §4.2, Table 3: Performance is reported as strong, yet no quantitative breakdown of per-task error (tracking vs. reconstruction) or cross-task interference is shown, leaving open whether the single forward pass maintains accuracy on both outputs simultaneously in challenging dynamic cases.

Authors: We will incorporate a quantitative per-task error breakdown (tracking versus reconstruction) and an analysis of cross-task interference in the revision. This will include targeted experiments on challenging dynamic scenes to confirm that the single forward pass maintains accuracy on both outputs. revision: yes

Circularity Check

0 steps flagged

No circularity: standard multi-task regression from backbone features

full rationale

The paper proposes DePT3R as a neural architecture that extracts spatio-temporal features via a backbone network and regresses pixel-wise maps for tracking and 3D reconstruction using dense heads, all in one forward pass without poses. This is a conventional multi-task learning setup with no self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations that reduce the central claim to its own inputs. Validation occurs via empirical benchmarks rather than any closed derivation loop. The approach is self-contained against external data and does not invoke uniqueness theorems or ansatzes from prior author work to force its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a standard backbone can produce features adequate for both tasks; no free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Deep spatio-temporal features extracted by the backbone are sufficient to regress accurate pixel-wise tracking and reconstruction maps
Invoked to justify the multi-task regression heads operating on unposed inputs.

pith-pipeline@v0.9.0 · 5500 in / 1030 out tokens · 25348 ms · 2026-05-16T21:52:02.768035+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DePT3R jointly produces dense point tracks and reconstructs dynamic scenes from a sequence of RGB inputs with a single forward pass... leveraging VGGT... global aggregator module... DPT heads

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

[1]

Virtual KITTI 2

Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2001
[2]

Dense point clouds matter: Dust-gs for scene reconstruction from sparse view- points

Shen Chen, Jiale Zhou, and Lei Li. Dense point clouds matter: Dust-gs for scene reconstruction from sparse view- points. InICASSP 2025-2025 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP),

work page 2025
[3]

Pref3r: Pose- free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024

Zequn Chen, Jiezhi Yang, and Heng Yang. Pref3r: Pose- free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024. 3

work page arXiv 2024
[4]

Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 2022

Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 2022. 3

work page 2022
[5]

Flownet: Learning optical flow with convolutional networks

Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. InPro- ceedings of the IEEE International Conference on Computer Vision, 2015. 3

work page 2015
[6]

Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024

Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024. 3

work page arXiv 2024
[7]

Black, Trevor Darrell, and Angjoo Kanazawa

Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 3, 4, 5, 6, 7

work page 2025
[8]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[9]

Particle video revisited: Tracking through occlusions using point trajectories

Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vi- sion, 2022. 3

work page 2022
[10]

Cambridge university press,

Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,

work page
[11]

Flowformer: A transformer architecture for optical flow

Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InEuropean Conference on Computer Vision, 2022. 3

work page 2022
[12]

Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024. 4, 6

work page arXiv 2024
[13]

Panoptic studio: A massively multiview system for social motion capture

Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE Interna- tional Conference on Computer Vision, 2015. 6

work page 2015
[14]

Dy- namicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6

work page 2023
[15]

Co- tracker: It is better to track together

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean Conference on Computer Vision, 2024. 3

work page 2024
[16]

3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

work page
[17]

Tapvid-3d: A benchmark for tracking any point in 3d

Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Jo˜ao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. Tapvid-3d: A benchmark for tracking any point in 3d. InNeurIPS, 2024. 6

work page 2024
[18]

Ground- ing image matching in 3d with MASt3R

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with MASt3R. InEuropean Con- ference on Computer Vision (ECCV), 2024. 3, 6, 7

work page 2024
[19]

Zero-shot monocular scene flow estima- tion in the wild

Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, and Orazio Gallo. Zero-shot monocular scene flow estima- tion in the wild. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 5

work page 2025
[20]

Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- worth. NeRF in the Wild: Neural Radiance Fields for Uncon- strained Photo Collections. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

work page 2021
[21]

Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Transactions on Graphics (TOG), 2022

Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Transactions on Graphics (TOG), 2022. 2

work page 2022
[22]

DELTA: DENSE EFFICIENT LONG- RANGE 3d TRACKING FOR ANY VIDEO

Tuan Duc Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. DELTA: DENSE EFFICIENT LONG- RANGE 3d TRACKING FOR ANY VIDEO. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 4

work page 2025
[23]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[24]

Vi- sion transformers for dense prediction

Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2021. 4

work page 2021
[25]

Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian 10 splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 3

work page internal anchor Pith review arXiv 2024
[26]

Dynavins: A visual-inertial slam for dynamic en- vironments.IEEE Robotics and Automation Letters, 2022

Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. Dynavins: A visual-inertial slam for dynamic en- vironments.IEEE Robotics and Automation Letters, 2022. 2

work page 2022
[27]

A benchmark for the evalua- tion of rgb-d slam systems

J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. 6

work page 2012
[28]

Raft: Recurrent all-pairs field transforms for optical flow

Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision, 2020. 3

work page 2020
[29]

Raft-3d: Scene flow using rigid- motion embeddings

Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid- motion embeddings. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2021. 3

work page 2021
[30]

Rfnet-4d: Joint object re- construction and flow estimation from 4d point clouds

Tuan-Anh Vu, Duc-Thanh Nguyen, Binh-Son Hua, Quang- Hieu Pham, and Sai-Kit Yeung. Rfnet-4d: Joint object re- construction and flow estimation from 4d point clouds. In ECCV, 2022. 2

work page 2022
[31]

Rfnet-4d++: Joint ob- ject reconstruction and flow estimation from 4d point clouds with cross-attention spatio-temporal features

Tuan-Anh Vu, Duc-Thanh Nguyen, Binh-Son Hua, Quang- Hieu Pham, and Sai-Kit Yeung. Rfnet-4d++: Joint ob- ject reconstruction and flow estimation from 4d point clouds with cross-attention spatio-temporal features. InPREPRINT available at Research Square, DOI: 10.21203/rs.3.rs- 4390361/v1, 2024. 2

work page doi:10.21203/rs.3.rs- 2024
[32]

Deep learning-based 3d recon- struction from multiple images: A survey.Neurocomputing,

Chuhua Wang, Md Alimoor Reza, Vibhas Vats, Yingnan Ju, Nikhil Thakurdesai, Yuchen Wang, David J Crandall, Soon- heung Jung, and Jeongil Seo. Deep learning-based 3d recon- struction from multiple images: A survey.Neurocomputing,

work page
[33]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5, 6, 7

work page 2025
[34]

Tracking everything everywhere all at once

Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking everything everywhere all at once. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023. 3

work page 2023
[35]

Continuous 3d per- ception model with persistent state

Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

work page 2025
[36]

Dust3r: Geometric 3d vi- sion made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7

work page 2024
[37]

Tartanair: A dataset to push the limits of visual slam

Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. 6

work page 2020
[38]

Spatialtracker: Tracking any 2d pixels in 3d space

Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 6, 7

work page 2024
[39]

Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6726–6737, 2025. 8

work page 2025
[40]

Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3

work page 2025
[41]

No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images

Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images. InThe Thirteenth International Conference on Learning Representations, 2025. 5

work page 2025
[42]

Mip-splatting: Alias-free 3d gaussian splat- ting

Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 2

work page 2024
[43]

Tapip3d: Tracking any point in persistent 3d geome- try.arXiv preprint arXiv:2504.14717, 2025

Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragki- adaki. Tapip3d: Tracking any point in persistent 3d geome- try.arXiv preprint arXiv:2504.14717, 2025. 4

work page arXiv 2025
[44]

MonST3r: A simple approach for estimating geometry in the presence of motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 3, 6, 7

work page 2025
[45]

Pomato: Marrying pointmap matching with temporal motion for dynamic 3d re- construction

Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, and Chunhua Shen. Pomato: Marrying pointmap matching with temporal motion for dynamic 3d re- construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 4

work page 2025
[46]

Pointodyssey: A large-scale synthetic dataset for long-term point tracking

Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, 2023. 6

work page 2023
[47]

Multi-view stereo with transformer.arXiv preprint arXiv:2112.00336, 2021

Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, and Jianjun Lei. Multi-view stereo with transformer.arXiv preprint arXiv:2112.00336, 2021. 3 11

work page arXiv 2021