pith. machine review for the scientific record. sign in

arxiv: 2512.13122 · v2 · submitted 2025-12-15 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass

Authors on Pith no claims yet

Pith reviewed 2026-05-16 21:52 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords dense point tracking3D reconstructiondynamic scenessingle forward passunposed imagesspatio-temporal featuresdense prediction heads
0
0 comments X

The pith

DePT3R performs dense point tracking and 3D reconstruction of dynamic scenes in one forward pass without camera poses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

DePT3R is a framework that takes multiple images of a dynamic scene and outputs both dense point tracks and a 3D reconstruction at the same time. It achieves this through a backbone network that pulls out spatio-temporal features, followed by dense prediction heads that regress pixel-wise maps for each task. The approach works in a single network pass and skips the need for known camera poses or assumptions about frame ordering. A reader would care because current methods often split the work into pairwise steps or demand extra pose data, which limits speed and flexibility in real moving environments. Combining the tasks this way points toward simpler, more efficient pipelines for video understanding.

Core claim

DePT3R extracts deep spatio-temporal features with a powerful backbone and regresses pixel-wise maps with dense prediction heads to jointly perform dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass, without requiring camera poses.

What carries the argument

Backbone network for spatio-temporal feature extraction paired with dense prediction heads that regress tracking and reconstruction maps.

Load-bearing premise

The backbone network extracts spatio-temporal features sufficient for accurate simultaneous regression of tracking and reconstruction maps across challenging dynamic scenes.

What would settle it

A dynamic scene benchmark test where the method produces lower-accuracy tracks or reconstructions than pose-aware alternatives on fast motion or heavy occlusion cases.

Figures

Figures reproduced from arXiv: 2512.13122 by Tuan-Anh Vu, Vivek Alumootil.

Figure 1
Figure 1. Figure 1: DePT3R achieves robust dense point tracking and reconstruction accuracy across unposed sequences while requiring less memory usage, highlighting the effectiveness of our approach for long-range, dynamic scenes. Abstract Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume a temporal ordering to input frames, constraining … view at source ↗
Figure 2
Figure 2. Figure 2: Our proposed DePT3R framework. Our model first tokenizes each input frame with DINOv2 and augments every token with a global intrinsic embedding. A learnable query embedding is added to the tokens corresponding to the query frame. The alternating frame-wise and global self-attention blocks process the tokens. A dedicated camera head predicts both intrinsics and extrinsics, while DPT heads produce point map… view at source ↗
Figure 3
Figure 3. Figure 3: 3D Reconstruction and Point Tracking on the Stereo4D Dataset. Despite supervising point tracking on a small collection of unrealistic datasets with mostly minimal scene motion, our method exhibits strong generalization to real-world scenes [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative ablation study on the Panoptic Studio dataset. The 3D trajectories of scene points are projected onto the last frame. Without the intrinsic embedding, the model can identify the direction of motion but struggles to accurately place it and gauge its magnitude, resulting in significant errors in point trajectory estimation. 10 2 10 3 10 4 10 5 Number of Query Points 0 1 × 10 4 2 × 10 4 3 × 10 4 4… view at source ↗
Figure 5
Figure 5. Figure 5: GPU memory usage comparison between Spatial￾TrackerV2, VGGT and our DePT3R method across varying num￾bers of query points. SpatialTrackerV2 and VGGT exhibit a rapid increase in GPU memory consumption, exhausting the 48 GB memory limit at just 40k and 22.5k query points, respectively. In contrast, DePT3R efficiently handles 268k query points, requiring only 12 GB of memory. All experiments were performed on… view at source ↗
read the original abstract

Current methods for dense 3D point tracking in dynamic scenes typically rely on pairwise processing, require known camera poses, or assume temporal ordering of input frames, thereby constraining their flexibility and applicability. Additionally, recent advances have successfully enabled efficient 3D reconstruction from large-scale, unposed image collections, underscoring opportunities for unified approaches to dynamic scene understanding. Motivated by this, we propose DePT3R, a novel framework that simultaneously performs dense point tracking and 3D reconstruction of dynamic scenes from multiple images in a single forward pass. This multi-task learning is achieved by extracting deep spatio-temporal features with a powerful backbone and regressing pixel-wise maps with dense prediction heads. Crucially, DePT3R operates without requiring camera poses, substantially enhancing its adaptability and efficiency, especially important in dynamic environments with rapid changes. We validate DePT3R on several challenging benchmarks involving dynamic scenes, demonstrating strong performance and significant improvements in memory efficiency over existing state-of-the-art methods. Data and codes are available via the open repository: https://github.com/StructuresComp/DePT3R

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DePT3R, a framework that jointly performs dense point tracking and 3D reconstruction of dynamic scenes from multiple unposed images in a single forward pass. It extracts spatio-temporal features via a backbone network and regresses pixel-wise maps using dense prediction heads, claiming strong benchmark performance and memory efficiency gains without requiring camera poses or temporal ordering.

Significance. If the joint single-pass regression holds, the work could meaningfully advance dynamic scene understanding by unifying tasks typically addressed separately, with practical benefits for efficiency in video and robotics applications.

major comments (2)
  1. [§3.2] §3.2: The backbone feature extractor is presented as sufficient to resolve motion-structure ambiguities for both long-range tracking and metric 3D regression without poses, but no capacity analysis, feature visualization, or ablation on shared representation quality is provided; this directly bears on whether the claimed unification is achieved or if one task is approximated.
  2. [§4.2] §4.2, Table 3: Performance is reported as strong, yet no quantitative breakdown of per-task error (tracking vs. reconstruction) or cross-task interference is shown, leaving open whether the single forward pass maintains accuracy on both outputs simultaneously in challenging dynamic cases.
minor comments (2)
  1. [§3.3] The method section would benefit from an explicit loss formulation equation for the 3D head to clarify supervision without poses.
  2. [Figure 1] Figure 1 overview could label input frame count and output map resolutions for immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2: The backbone feature extractor is presented as sufficient to resolve motion-structure ambiguities for both long-range tracking and metric 3D regression without poses, but no capacity analysis, feature visualization, or ablation on shared representation quality is provided; this directly bears on whether the claimed unification is achieved or if one task is approximated.

    Authors: We agree that additional analysis of the shared backbone would help substantiate the unification. In the revised manuscript we will add backbone capacity analysis, feature visualizations illustrating how spatio-temporal features jointly capture motion and structure, and an ablation on shared representation quality to demonstrate that both tasks are supported without one being approximated. revision: yes

  2. Referee: [§4.2] §4.2, Table 3: Performance is reported as strong, yet no quantitative breakdown of per-task error (tracking vs. reconstruction) or cross-task interference is shown, leaving open whether the single forward pass maintains accuracy on both outputs simultaneously in challenging dynamic cases.

    Authors: We will incorporate a quantitative per-task error breakdown (tracking versus reconstruction) and an analysis of cross-task interference in the revision. This will include targeted experiments on challenging dynamic scenes to confirm that the single forward pass maintains accuracy on both outputs. revision: yes

Circularity Check

0 steps flagged

No circularity: standard multi-task regression from backbone features

full rationale

The paper proposes DePT3R as a neural architecture that extracts spatio-temporal features via a backbone network and regresses pixel-wise maps for tracking and 3D reconstruction using dense heads, all in one forward pass without poses. This is a conventional multi-task learning setup with no self-definitional equations, no fitted parameters renamed as predictions, and no load-bearing self-citations that reduce the central claim to its own inputs. Validation occurs via empirical benchmarks rather than any closed derivation loop. The approach is self-contained against external data and does not invoke uniqueness theorems or ansatzes from prior author work to force its outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that a standard backbone can produce features adequate for both tasks; no free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Deep spatio-temporal features extracted by the backbone are sufficient to regress accurate pixel-wise tracking and reconstruction maps
    Invoked to justify the multi-task regression heads operating on unposed inputs.

pith-pipeline@v0.9.0 · 5500 in / 1030 out tokens · 25348 ms · 2026-05-16T21:52:02.768035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

  1. [1]

    Virtual KITTI 2

    Yohann Cabon, Naila Murray, and Martin Humenberger. Vir- tual kitti 2.arXiv preprint arXiv:2001.10773, 2020. 6

  2. [2]

    Dense point clouds matter: Dust-gs for scene reconstruction from sparse view- points

    Shen Chen, Jiale Zhou, and Lei Li. Dense point clouds matter: Dust-gs for scene reconstruction from sparse view- points. InICASSP 2025-2025 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP),

  3. [3]

    Pref3r: Pose- free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024

    Zequn Chen, Jiezhi Yang, and Heng Yang. Pref3r: Pose- free feed-forward 3d gaussian splatting from variable-length image sequence.arXiv preprint arXiv:2411.16877, 2024. 3

  4. [4]

    Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 2022

    Carl Doersch, Ankush Gupta, Larisa Markeeva, Adria Re- casens, Lucas Smaira, Yusuf Aytar, Joao Carreira, Andrew Zisserman, and Yi Yang. Tap-vid: A benchmark for track- ing any point in a video.Advances in Neural Information Processing Systems, 2022. 3

  5. [5]

    Flownet: Learning optical flow with convolutional networks

    Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van Der Smagt, Daniel Cremers, and Thomas Brox. Flownet: Learning optical flow with convolutional networks. InPro- ceedings of the IEEE International Conference on Computer Vision, 2015. 3

  6. [6]

    Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024

    Zhiwen Fan, Wenyan Cong, Kairun Wen, Kevin Wang, Jian Zhang, Xinghao Ding, Danfei Xu, Boris Ivanovic, Marco Pavone, Georgios Pavlakos, et al. Instantsplat: Un- bounded sparse-view pose-free gaussian splatting in 40 sec- onds.arXiv preprint arXiv:2403.20309, 2024. 3

  7. [7]

    Black, Trevor Darrell, and Angjoo Kanazawa

    Haiwen Feng, Junyi Zhang, Qianqian Wang, Yufei Ye, Pengcheng Yu, Michael J. Black, Trevor Darrell, and Angjoo Kanazawa. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. InProceedings of the IEEE/CVF In- ternational Conference on Computer Vision, 2025. 3, 4, 5, 6, 7

  8. [8]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  9. [9]

    Particle video revisited: Tracking through occlusions using point trajectories

    Adam W Harley, Zhaoyuan Fang, and Katerina Fragkiadaki. Particle video revisited: Tracking through occlusions using point trajectories. InEuropean Conference on Computer Vi- sion, 2022. 3

  10. [10]

    Cambridge university press,

    Richard Hartley and Andrew Zisserman.Multiple view ge- ometry in computer vision. Cambridge university press,

  11. [11]

    Flowformer: A transformer architecture for optical flow

    Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. InEuropean Conference on Computer Vision, 2022. 3

  12. [12]

    Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024

    Linyi Jin, Richard Tucker, Zhengqi Li, David Fouhey, Noah Snavely, and Aleksander Holynski. Stereo4d: Learning how things move in 3d from internet stereo videos.arXiv preprint arXiv:2412.09621, 2024. 4, 6

  13. [13]

    Panoptic studio: A massively multiview system for social motion capture

    Hanbyul Joo, Hao Liu, Lei Tan, Lin Gui, Bart Nabbe, Iain Matthews, Takeo Kanade, Shohei Nobuhara, and Yaser Sheikh. Panoptic studio: A massively multiview system for social motion capture. InProceedings of the IEEE Interna- tional Conference on Computer Vision, 2015. 6

  14. [14]

    Dy- namicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 6

  15. [15]

    Co- tracker: It is better to track together

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker: It is better to track together. InEuropean Conference on Computer Vision, 2024. 3

  16. [16]

    3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics,

  17. [17]

    Tapvid-3d: A benchmark for tracking any point in 3d

    Skanda Koppula, Ignacio Rocco, Yi Yang, Joe Heyward, Jo˜ao Carreira, Andrew Zisserman, Gabriel Brostow, and Carl Doersch. Tapvid-3d: A benchmark for tracking any point in 3d. InNeurIPS, 2024. 6

  18. [18]

    Ground- ing image matching in 3d with MASt3R

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with MASt3R. InEuropean Con- ference on Computer Vision (ECCV), 2024. 3, 6, 7

  19. [19]

    Zero-shot monocular scene flow estima- tion in the wild

    Yiqing Liang, Abhishek Badki, Hang Su, James Tompkin, and Orazio Gallo. Zero-shot monocular scene flow estima- tion in the wild. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 5

  20. [20]

    Ricardo Martin-Brualla, Noha Radwan, Mehdi S. M. Sajjadi, Jonathan T. Barron, Alexey Dosovitskiy, and Daniel Duck- worth. NeRF in the Wild: Neural Radiance Fields for Uncon- strained Photo Collections. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021. 2

  21. [21]

    Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Transactions on Graphics (TOG), 2022

    Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a mul- tiresolution hash encoding.ACM Transactions on Graphics (TOG), 2022. 2

  22. [22]

    DELTA: DENSE EFFICIENT LONG- RANGE 3d TRACKING FOR ANY VIDEO

    Tuan Duc Ngo, Peiye Zhuang, Evangelos Kalogerakis, Chuang Gan, Sergey Tulyakov, Hsin-Ying Lee, and Chaoyang Wang. DELTA: DENSE EFFICIENT LONG- RANGE 3d TRACKING FOR ANY VIDEO. InThe Thir- teenth International Conference on Learning Representa- tions, 2025. 4

  23. [23]

    Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

  24. [24]

    Vi- sion transformers for dense prediction

    Ren ´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2021. 4

  25. [25]

    Splatt3R: Zero-shot Gaussian Splatting from Uncalibrated Image Pairs

    Brandon Smart, Chuanxia Zheng, Iro Laina, and Vic- tor Adrian Prisacariu. Splatt3r: Zero-shot gaussian 10 splatting from uncalibrated image pairs.arXiv preprint arXiv:2408.13912, 2024. 3

  26. [26]

    Dynavins: A visual-inertial slam for dynamic en- vironments.IEEE Robotics and Automation Letters, 2022

    Seungwon Song, Hyungtae Lim, Alex Junho Lee, and Hyun Myung. Dynavins: A visual-inertial slam for dynamic en- vironments.IEEE Robotics and Automation Letters, 2022. 2

  27. [27]

    A benchmark for the evalua- tion of rgb-d slam systems

    J ¨urgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evalua- tion of rgb-d slam systems. In2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, 2012. 6

  28. [28]

    Raft: Recurrent all-pairs field transforms for optical flow

    Zachary Teed and Jia Deng. Raft: Recurrent all-pairs field transforms for optical flow. InEuropean Conference on Computer Vision, 2020. 3

  29. [29]

    Raft-3d: Scene flow using rigid- motion embeddings

    Zachary Teed and Jia Deng. Raft-3d: Scene flow using rigid- motion embeddings. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2021. 3

  30. [30]

    Rfnet-4d: Joint object re- construction and flow estimation from 4d point clouds

    Tuan-Anh Vu, Duc-Thanh Nguyen, Binh-Son Hua, Quang- Hieu Pham, and Sai-Kit Yeung. Rfnet-4d: Joint object re- construction and flow estimation from 4d point clouds. In ECCV, 2022. 2

  31. [31]

    Rfnet-4d++: Joint ob- ject reconstruction and flow estimation from 4d point clouds with cross-attention spatio-temporal features

    Tuan-Anh Vu, Duc-Thanh Nguyen, Binh-Son Hua, Quang- Hieu Pham, and Sai-Kit Yeung. Rfnet-4d++: Joint ob- ject reconstruction and flow estimation from 4d point clouds with cross-attention spatio-temporal features. InPREPRINT available at Research Square, DOI: 10.21203/rs.3.rs- 4390361/v1, 2024. 2

  32. [32]

    Deep learning-based 3d recon- struction from multiple images: A survey.Neurocomputing,

    Chuhua Wang, Md Alimoor Reza, Vibhas Vats, Yingnan Ju, Nikhil Thakurdesai, Yuchen Wang, David J Crandall, Soon- heung Jung, and Jeongil Seo. Deep learning-based 3d recon- struction from multiple images: A survey.Neurocomputing,

  33. [33]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2, 3, 4, 5, 6, 7

  34. [34]

    Tracking everything everywhere all at once

    Qianqian Wang, Yen-Yu Chang, Ruojin Cai, Zhengqi Li, Bharath Hariharan, Aleksander Holynski, and Noah Snavely. Tracking everything everywhere all at once. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, 2023. 3

  35. [35]

    Continuous 3d per- ception model with persistent state

    Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 3

  36. [36]

    Dust3r: Geometric 3d vi- sion made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2, 3, 4, 6, 7

  37. [37]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Se- bastian Scherer. Tartanair: A dataset to push the limits of visual slam. In2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2020. 6

  38. [38]

    Spatialtracker: Tracking any 2d pixels in 3d space

    Yuxi Xiao, Qianqian Wang, Shangzhan Zhang, Nan Xue, Sida Peng, Yujun Shen, and Xiaowei Zhou. Spatialtracker: Tracking any 2d pixels in 3d space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024. 3, 6, 7

  39. [39]

    Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion

    Yuxi Xiao, Jianyuan Wang, Nan Xue, Nikita Karaev, Yuri Makarov, Bingyi Kang, Xing Zhu, Hujun Bao, Yujun Shen, and Xiaowei Zhou. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6726–6737, 2025. 8

  40. [40]

    Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass

    Jianing Yang, Alexander Sax, Kevin J Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. InProceedings of the Computer Vision and Pattern Recognition Conference, 2025. 2, 3

  41. [41]

    No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images

    Botao Ye, Sifei Liu, Haofei Xu, Xueting Li, Marc Pollefeys, Ming-Hsuan Yang, and Songyou Peng. No pose, no prob- lem: Surprisingly simple 3d gaussian splats from sparse un- posed images. InThe Thirteenth International Conference on Learning Representations, 2025. 5

  42. [42]

    Mip-splatting: Alias-free 3d gaussian splat- ting

    Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), 2024. 2

  43. [43]

    Tapip3d: Tracking any point in persistent 3d geome- try.arXiv preprint arXiv:2504.14717, 2025

    Bowei Zhang, Lei Ke, Adam W Harley, and Katerina Fragki- adaki. Tapip3d: Tracking any point in persistent 3d geome- try.arXiv preprint arXiv:2504.14717, 2025. 4

  44. [44]

    MonST3r: A simple approach for estimating geometry in the presence of motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. MonST3r: A simple approach for estimating geometry in the presence of motion. InThe Thirteenth Inter- national Conference on Learning Representations, 2025. 3, 6, 7

  45. [45]

    Pomato: Marrying pointmap matching with temporal motion for dynamic 3d re- construction

    Songyan Zhang, Yongtao Ge, Jinyuan Tian, Guangkai Xu, Hao Chen, Chen Lv, and Chunhua Shen. Pomato: Marrying pointmap matching with temporal motion for dynamic 3d re- construction. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025. 4

  46. [46]

    Pointodyssey: A large-scale synthetic dataset for long-term point tracking

    Yang Zheng, Adam W Harley, Bokui Shen, Gordon Wet- zstein, and Leonidas J Guibas. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. InProceed- ings of the IEEE/CVF International Conference on Com- puter Vision, 2023. 6

  47. [47]

    Multi-view stereo with transformer.arXiv preprint arXiv:2112.00336, 2021

    Jie Zhu, Bo Peng, Wanqing Li, Haifeng Shen, Zhe Zhang, and Jianjun Lei. Multi-view stereo with transformer.arXiv preprint arXiv:2112.00336, 2021. 3 11