pith. sign in

arxiv: 2605.17303 · v1 · pith:XFPJH7Z6new · submitted 2026-05-17 · 💻 cs.CV

LongDPM: Overlap-Aware 4D Reconstruction from Long Monocular Videos

Pith reviewed 2026-05-20 14:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords 4D reconstructionlong monocular videodynamic scene reconstructionoverlap registrationdense trackingcamera pose estimationmonocular dynamic reconstruction
0
0 comments X

The pith

LongDPM reconstructs consistent dynamic 3D scenes from long monocular videos by aligning overlapping short chunks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tackles recovering dynamic 3D geometry, camera motion, and correspondences from extended monocular videos, where prior methods either limit themselves to short clips or fail to produce dense reconstructions. LongDPM splits the input into overlapping chunks so that local inference stays memory-bounded. These chunks are then aligned by confidence-weighted registration that uses static-aware overlap abstraction to merge their coordinate systems. Dynamic object identities are matched and their trajectories fused across boundaries, producing a single coherent 4D sequence. The resulting system reports lower dense tracking error than V-DPM on multiple synthetic benchmarks and the lowest camera-pose error on TUM-dynamics.

Core claim

LongDPM is an overlap-aware framework that recovers dynamic 3D scenes from long monocular videos by processing them in overlapping chunks, connecting the chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction, and associating dynamic identities across chunk boundaries to fuse matched trajectories into coherent long-range 3D motion.

What carries the argument

The overlap-aware registration step that performs confidence-weighted alignment of static-aware abstractions extracted from chunk overlaps to link local coordinate systems.

If this is right

  • Dense tracking endpoint error is reduced relative to V-DPM on PointOdyssey, Kubric-F, and Kubric-G.
  • Camera-pose absolute trajectory error reaches its lowest reported value on TUM-dynamics.
  • Arbitrary-length videos can be processed while keeping peak memory fixed by the chosen chunk length.
  • Dynamic object trajectories remain coherent across the entire sequence after cross-boundary fusion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same chunk-overlap strategy could be tested on real-world robotics sequences to check whether static scene anchors continue to stabilize alignment under uncontrolled lighting.
  • Extending the static-aware abstraction to include slowly moving background elements might further reduce drift in crowded scenes.
  • The approach implies that long-term 4D consistency can be achieved without a single global optimization pass if local registrations are sufficiently reliable.

Load-bearing premise

The method assumes that confidence-weighted registration of overlapping chunks using static-aware abstraction can reliably connect local coordinate systems without accumulating large errors in scenes with significant dynamic motion or changing lighting.

What would settle it

Large accumulated drift in reconstructed 3D trajectories or camera poses on a long video containing rapid object motion and varying illumination would indicate that the chunk-linking step fails to maintain consistency.

Figures

Figures reproduced from arXiv: 2605.17303 by Chao Yang, Chenyi Xu, Fangli Guan, Jianhui Zhang, Liqi Yan, Pan Li, Yihao Wu.

Figure 1
Figure 1. Figure 1: LongDPM for long-range dynamic reconstruction. (a) Chunk-wise long-sequence 3D reconstruction works well for static videos, but fails to maintain motion consistency in dynamic scenes. (b) Short-context 4D reconstructors recover dynamic point maps within local clips, yet are difficult to scale directly to long videos. (c) LongDPM bridges this gap with dynamic￾aware chunk-wise alignment, enabling long-range … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the LongDPM pipeline. A long monocular video is divided into overlapping chunks for bounded-memory [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Memory and runtime scalability as the number of [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Cross-chunk reconstruction comparison. Base-window inference reconstructs video chunks independently and produces inconsistent geometry and camera poses across chunks. Overlap registration reduces this misalignment by aligning adjacent chunks through shared frames, but can still be unstable under dynamic motion and weak overlap. LongDPM uses static-aware overlap abstraction and dynamic association to recov… view at source ↗
read the original abstract

Recovering a dynamic 3D scene from a long monocular video is crucial for dense geometry, camera motion, and temporal correspondence to remain consistent in a shared coordinate system. Existing methods face two key challenges: (1) feed-forward reconstruction models provide accurate local predictions but are limited to short clips, and (2) long-range trackers preserve correspondences without producing dense sequence-level reconstruction. This paper presents LongDPM, a novel overlap-aware framework for scalable long-range monocular dynamic reconstruction. First, LongDPM processes long videos in overlapping chunks, keeping inference memory bounded by the chunk length. Second, it connects chunk-local coordinate systems through confidence-weighted registration with static-aware overlap abstraction. Third, it associates dynamic identities across chunk boundaries and fuses matched trajectories to recover coherent long-range 3D motion. Experimental results demonstrate that LongDPM achieves superior long-range reconstruction and tracking performance, reducing dense tracking EPE over V-DPM on PointOdyssey, Kubric-F, and Kubric-G, while obtaining the best TUM-dynamics ATE for camera pose estimation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents LongDPM, an overlap-aware framework for scalable 4D reconstruction from long monocular videos. It processes videos in overlapping chunks to bound memory, connects local coordinate systems via confidence-weighted registration using static-aware overlap abstraction, associates dynamic identities across boundaries, and fuses trajectories for coherent long-range 3D motion and camera poses. Experiments report reduced dense tracking EPE versus V-DPM on PointOdyssey, Kubric-F, and Kubric-G, plus best TUM-dynamics ATE.

Significance. If the central claims hold, the work is significant for enabling practical long-range dynamic scene reconstruction without prohibitive memory costs. It usefully combines feed-forward local models with overlap-based global alignment, addressing a key scalability gap in monocular 4D vision. The static-aware abstraction and trajectory fusion ideas are practical and could influence downstream tasks in robotics and AR.

major comments (2)
  1. [Methods (overlap registration)] Methods section on static-aware overlap abstraction and confidence-weighted registration: the central claim of drift-free global alignment rests on the abstraction correctly separating static structure from dynamic content in overlaps. No quantitative analysis or failure-case ablation is provided for sequences where dynamic objects occupy large portions of the overlap (as in many PointOdyssey and Kubric clips), leaving open the risk that mislabeled points bias the rigid registration and accumulate across chunk boundaries.
  2. [Experiments] Experiments, Table reporting EPE and ATE: performance gains are stated without error bars, multiple runs, or statistical tests. This makes it difficult to judge whether the reported reductions in dense tracking EPE and TUM-dynamics ATE are robust or could be explained by favorable chunk overlaps or lighting conditions.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly state the typical chunk length and overlap ratio used in experiments, as these directly affect the claimed memory bound and registration reliability.
  2. [Methods] Notation for the confidence weighting in the registration step is introduced without a clear equation reference or pseudocode, making the exact fusion procedure harder to reproduce.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful comments and positive evaluation of our work's significance. We address each major comment below, providing clarifications and describing planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: Methods section on static-aware overlap abstraction and confidence-weighted registration: the central claim of drift-free global alignment rests on the abstraction correctly separating static structure from dynamic content in overlaps. No quantitative analysis or failure-case ablation is provided for sequences where dynamic objects occupy large portions of the overlap (as in many PointOdyssey and Kubric clips), leaving open the risk that mislabeled points bias the rigid registration and accumulate across chunk boundaries.

    Authors: We agree that additional quantitative analysis of the static-aware overlap abstraction would strengthen the claims, especially for overlaps with high dynamic content. In the revised manuscript we will add an ablation measuring static-point identification accuracy as a function of dynamic ratio in overlaps (using ground-truth labels available in PointOdyssey and Kubric), together with selected failure-case visualizations. We will also show that the confidence weighting already limits the influence of mislabeled dynamic points on the rigid transform, thereby reducing drift accumulation. revision: yes

  2. Referee: Experiments, Table reporting EPE and ATE: performance gains are stated without error bars, multiple runs, or statistical tests. This makes it difficult to judge whether the reported reductions in dense tracking EPE and TUM-dynamics ATE are robust or could be explained by favorable chunk overlaps or lighting conditions.

    Authors: We acknowledge that variability measures improve interpretability. Because the pipeline is deterministic given fixed model weights and input, repeated random-seed runs are not meaningful. In the revision we will report per-sequence standard deviations as error bars on the aggregate EPE and ATE figures and will add a simple paired statistical test (Wilcoxon signed-rank) across sequences to quantify the consistency of the observed improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation chain remains self-contained

full rationale

The paper presents LongDPM as an overlap-aware pipeline that splits long videos into chunks, performs local reconstruction, then registers chunks using confidence-weighted static-aware abstraction and fuses trajectories. No equations, fitted parameters, or self-citations are shown that would reduce the reported EPE or ATE gains to definitions or inputs by construction. Performance is evaluated on external datasets (PointOdyssey, Kubric, TUM-dynamics) against baselines such as V-DPM, allowing independent verification outside any internal normalization or renaming. The central claims therefore rest on empirical comparison rather than tautological reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; all technical details remain at the level of high-level method description.

pith-pipeline@v0.9.0 · 5736 in / 1120 out tokens · 38705 ms · 2026-05-20T14:30:56.397022+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 11 internal anchors

  1. [1]

    Alumootil, V.; and Vu, T.-A. 2025. DePT3R: Joint Dense Point Tracking and 3D Reconstruction of Dynamic Scenes in a Single Forward Pass. arXiv preprint arXiv:2512.13122

  2. [2]

    J.; Wulff, J.; Stanley, G

    Butler, D. J.; Wulff, J.; Stanley, G. B.; and Black, M. J. 2012. A naturalistic open source movie for optical flow evaluation. In European conference on computer vision, 611--625. Springer

  3. [3]

    Geometric Context Transformer for Streaming 3D Reconstruction

    Chen, L.-Z.; Gao, J.; Chen, Y.; Cheng, K. L.; Sun, Y.; Hu, L.; Xue, N.; Zhu, X.; Shen, Y.; Yao, Y.; et al. 2026. Geometric Context Transformer for Streaming 3D Reconstruction. arXiv preprint arXiv:2604.14141

  4. [4]

    Chen, X.; Chen, Y.; Xiu, Y.; Geiger, A.; and Chen, A. 2025 a . Ttt3r: 3d reconstruction as test-time training. arXiv preprint arXiv:2509.26645

  5. [5]

    Chen, Z.; Qin, M.; Yuan, T.; Liu, Z.; and Zhao, H. 2025 b . Long3r: Long sequence streaming 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5273--5284

  6. [6]

    Cho, S.; Huang, J.; Nam, J.; An, H.; Kim, S.; and Lee, J.-Y. 2024. Local all-pair correspondence for point tracking. In European conference on computer vision, 306--325. Springer

  7. [7]

    Deng, K.; Ti, Z.; Xu, J.; Yang, J.; and Xie, J. 2025. VGGT-Long: Chunk it, Loop it, Align it--Pushing VGGT's Limits on Kilometer-scale Long RGB Sequences. arXiv preprint arXiv:2507.16443

  8. [8]

    Doersch, C.; Luc, P.; Yang, Y.; Gokay, D.; Koppula, S.; Gupta, A.; Heyward, J.; Rocco, I.; Goroshin, R.; Carreira, J.; et al. 2024. Bootstap: Bootstrapped training for tracking-any-point. In Proceedings of the Asian Conference on Computer Vision, 3257--3274

  9. [9]

    Doersch, C.; Yang, Y.; Vecerik, M.; Gokay, D.; Gupta, A.; Aytar, Y.; Carreira, J.; and Zisserman, A. 2023. Tapir: Tracking any point with per-frame initialization and temporal refinement. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10061--10072

  10. [10]

    Elflein, S.; Li, R.; Agostinho, S.; Gojcic, Z.; Leal-Taix \'e , L.; Zhou, Q.; and Osep, A. 2026. VGG-T ^3 : Offline Feed-Forward 3D Reconstruction at Scale. arXiv preprint arXiv:2602.23361

  11. [11]

    Elflein, S.; Zhou, Q.; and Leal-Taix \'e , L. 2025. Light3r-sfm: Towards feed-forward structure-from-motion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16774--16784

  12. [12]

    J.; Darrell, T.; and Kanazawa, A

    Feng, H.; Zhang, J.; Wang, Q.; Ye, Y.; Yu, P.; Black, M. J.; Darrell, T.; and Kanazawa, A. 2025. St4rtrack: Simultaneous 4d reconstruction and tracking in the world. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 8503--8513

  13. [13]

    J.; Gnanapragasam, D.; Golemo, F.; Herrmann, C.; Kipf, T.; Kundu, A.; Lagun, D.; Laradji, I.; Liu, H.-T

    Greff, K.; Belletti, F.; Beyer, L.; Doersch, C.; Du, Y.; Duckworth, D.; Fleet, D. J.; Gnanapragasam, D.; Golemo, F.; Herrmann, C.; Kipf, T.; Kundu, A.; Lagun, D.; Laradji, I.; Liu, H.-T. D.; Meyer, H.; Miao, Y.; Nowrouzezahrai, D.; Oztireli, C.; Pot, E.; Radwan, N.; Rebain, D.; Sabour, S.; Sajjadi, M. S. M.; Sela, M.; Sitzmann, V.; Stone, A.; Sun, D.; Vor...

  14. [14]

    Han, J.; An, H.; Jung, J.; Narihira, T.; Seo, J.; Fukuda, K.; Kim, C.; Hong, S.; Mitsufuji, Y.; and Kim, S. 2026. Enhancing 3D Reconstruction for Dynamic Scenes. Advances in Neural Information Processing Systems, 38: 1210--1234

  15. [15]

    W.; Fang, Z.; and Fragkiadaki, K

    Harley, A. W.; Fang, Z.; and Fragkiadaki, K. 2022. Particle Video Revisited: Tracking Through Occlusions Using Point Trajectories. In ECCV

  16. [16]

    Karaev, N.; Makarov, Y.; Wang, J.; Neverova, N.; Vedaldi, A.; and Rupprecht, C. 2025. Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6013--6022

  17. [17]

    Karaev, N.; Rocco, I.; Graham, B.; Neverova, N.; Vedaldi, A.; and Rupprecht, C. 2024. Cotracker: It is better to track together. In European conference on computer vision, 18--35. Springer

  18. [18]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Keetha, N.; M \"u ller, N.; Sch \"o nberger, J.; Porzi, L.; Zhang, Y.; Fischer, T.; Knapitsch, A.; Zauss, D.; Weber, E.; Antunes, N.; et al. 2025. Mapanything: Universal feed-forward metric 3d reconstruction. arXiv preprint arXiv:2509.13414

  19. [19]

    Leroy, V.; Cabon, Y.; and Revaud, J. 2024. Grounding image matching in 3d with mast3r. In European conference on computer vision, 71--91. Springer

  20. [20]

    Li, Z.; Tucker, R.; Cole, F.; Wang, Q.; Jin, L.; Ye, V.; Kanazawa, A.; Holynski, A.; and Snavely, N. 2025. Megasam: Accurate, fast and robust structure and motion from casual dynamic videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 10486--10496

  21. [21]

    Liang, H.; Ren, J.; Mirzaei, A.; Torralba, A.; Liu, Z.; Gilitschenski, I.; Fidler, S.; Oztireli, C.; Ling, H.; Gojcic, Z.; et al. 2024. Feed-forward bullet-time reconstruction of dynamic scenes from monocular videos. arXiv preprint arXiv:2412.03526

  22. [22]

    Depth Anything 3: Recovering the Visual Space from Any Views

    Lin, H.; Chen, S.; Liew, J.; Chen, D. Y.; Li, Z.; Shi, G.; Feng, J.; and Kang, B. 2025. Depth anything 3: Recovering the visual space from any views. arXiv preprint arXiv:2511.10647

  23. [23]

    Trace anything: Representing any video in 4d via trajectory fields.arXiv preprint arXiv:2510.13802, 2025

    Liu, X.; Xiao, Y.; Chen, D. Y.; Feng, J.; Tai, Y.-W.; Tang, C.-K.; and Kang, B. 2025. Trace anything: Representing any video in 4d via trajectory fields. arXiv preprint arXiv:2510.13802

  24. [24]

    D.; Zhuang, P.; Gan, C.; Kalogerakis, E.; Tulyakov, S.; Lee, H.-Y.; and Wang, C

    Ngo, T. D.; Zhuang, P.; Gan, C.; Kalogerakis, E.; Tulyakov, S.; Lee, H.-Y.; and Wang, C. 2024. DELTA: Dense Efficient Long-range 3D Tracking for Any video. arXiv preprint arXiv:2410.24211

  25. [25]

    Raji c , F.; Xu, H.; Mihajlovic, M.; Li, S.; Demir, I.; G \"u ndo g du, E.; Ke, L.; Prokudin, S.; Pollefeys, M.; and Tang, S. 2025 a . Multi-view 3d point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 59--68

  26. [26]

    Raji c , F.; Xu, H.; Mihajlovic, M.; Li, S.; Demir, I.; G \"u ndo g du, E.; Ke, L.; Prokudin, S.; Pollefeys, M.; and Tang, S. 2025 b . Multi-View 3D Point Tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  27. [27]

    Sturm, J.; Engelhard, N.; Endres, F.; Burgard, W.; and Cremers, D. 2012. A benchmark for the evaluation of RGB-D SLAM systems. In 2012 IEEE/RSJ international conference on intelligent robots and systems, 573--580. IEEE

  28. [28]

    Sucar, E.; Insafutdinov, E.; Lai, Z.; and Vedaldi, A. 2026. V-DPM: 4D Video Reconstruction with Dynamic Point Maps. arXiv preprint arXiv:2601.09499

  29. [29]

    Sucar, E.; Lai, Z.; Insafutdinov, E.; and Vedaldi, A. 2025. Dynamic point maps: A versatile representation for dynamic 3d reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 7295--7305

  30. [30]

    Sun, P.; Kretzschmar, H.; Dotiwalla, X.; Chouard, A.; Patnaik, V.; Tsui, P.; Guo, J.; Zhou, Y.; Chai, Y.; Caine, B.; et al. 2020. Scalability in perception for autonomous driving: Waymo open dataset. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2446--2454

  31. [31]

    Teed, Z.; and Deng, J. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In European conference on computer vision, 402--419. Springer

  32. [32]

    Tumanyan, N.; Singer, A.; Bagon, S.; and Dekel, T. 2024. Dino-tracker: Taming dino for self-supervised point tracking in a single video. In European Conference on Computer Vision, 367--385. Springer

  33. [33]

    Wang, J.; Chen, M.; Karaev, N.; Vedaldi, A.; Rupprecht, C.; and Novotny, D. 2025 a . Vggt: Visual geometry grounded transformer. In Proceedings of the Computer Vision and Pattern Recognition Conference, 5294--5306

  34. [34]

    Wang, Q.; Ye, V.; Gao, H.; Zeng, W.; Austin, J.; Li, Z.; and Kanazawa, A. 2025 b . Shape of motion: 4d reconstruction from a single video. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9660--9672

  35. [35]

    A.; and Kanazawa, A

    Wang, Q.; Zhang, Y.; Holynski, A.; Efros, A. A.; and Kanazawa, A. 2025 c . Continuous 3d perception model with persistent state. In Proceedings of the Computer Vision and Pattern Recognition Conference, 10510--10522

  36. [36]

    Wang, S.; Leroy, V.; Cabon, Y.; Chidlovskii, B.; and Revaud, J. 2024. Dust3r: Geometric 3d vision made easy. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 20697--20709

  37. [37]

    Wang, Y.; Zhou, J.; Zhu, H.; Chang, W.; Zhou, Y.; Li, Z.; Chen, J.; Pang, J.; Shen, C.; and He, T. 2025 d . ^3 : Permutation-Equivariant Visual Geometry Learning. arXiv preprint arXiv:2507.13347

  38. [38]

    Xiao, Y.; Wang, J.; Xue, N.; Karaev, N.; Makarov, Y.; Kang, B.; Zhu, X.; Bao, H.; Shen, Y.; and Zhou, X. 2025. Spatialtrackerv2: Advancing 3d point tracking with explicit camera motion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6726--6737

  39. [39]

    Xiao, Y.; Wang, Q.; Zhang, S.; Xue, N.; Peng, S.; Shen, Y.; and Zhou, X. 2024. Spatialtracker: Tracking any 2d pixels in 3d space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 20406--20417

  40. [40]

    Xie, T.; Yang, P.; Jin, Y.; Cai, Y.; Yin, W.; Ren, W.; Zhang, Q.; Hua, W.; Peng, S.; Guo, X.; et al. 2026. Scal3R: Scalable Test-Time Training for Large-Scale 3D Reconstruction. arXiv preprint arXiv:2604.08542

  41. [41]

    Xiong, Z.; Zhang, C.; Xu, Q.; and Tao, W. 2026. VGGT-Motion: Motion-Aware Calibration-Free Monocular SLAM for Long-Range Consistency. arXiv preprint arXiv:2602.05508

  42. [42]

    Xu, L.; Guo, L.; Jiang, C.; and Wang, C. 2026. PAS3R: Pose-Adaptive Streaming 3D Reconstruction for Long Video Sequences. arXiv preprint arXiv:2603.21436

  43. [43]

    J.; Henaff, M.; Tang, H.; Cao, A.; Chai, J.; Meier, F.; and Feiszli, M

    Yang, J.; Sax, A.; Liang, K. J.; Henaff, M.; Tang, H.; Cao, A.; Chai, J.; Meier, F.; and Feiszli, M. 2025. Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In Proceedings of the Computer Vision and Pattern Recognition Conference, 21924--21935

  44. [44]

    TAPIP3D: Tracking any point in persistent 3d geom- etry,

    Zhang, B.; Ke, L.; Harley, A. W.; and Fragkiadaki, K. 2025 a . Tapip3d: Tracking any point in persistent 3d geometry. arXiv preprint arXiv:2504.14717

  45. [45]

    L.; Koppula, S.; Rocco, I.; Momeni, L.; Xie, J.; Sun, S.; Sukthankar, R.; Barral, J

    Zhang, C.; Moing, G. L.; Koppula, S.; Rocco, I.; Momeni, L.; Xie, J.; Sun, S.; Sukthankar, R.; Barral, J. K.; Hadsell, R.; et al. 2025 b . Efficiently reconstructing dynamic scenes one d4rt at a time. arXiv preprint arXiv:2512.08924

  46. [46]

    Zhang, J.; Herrmann, C.; Hur, J.; Jampani, V.; Darrell, T.; Cole, F.; Sun, D.; and Yang, M.-H. 2024. Monst3r: A simple approach for estimating geometry in the presence of motion. arXiv preprint arXiv:2410.03825

  47. [47]

    Zhang, J.; Herrmann, C.; Hur, J.; Sun, C.; Yang, M.-H.; Cole, F.; Darrell, T.; and Sun, D. 2026. Loger: Long-context geometric reconstruction with hybrid memory. arXiv preprint arXiv:2603.03269

  48. [48]

    Zhang, S.; Ge, Y.; Tian, J.; Xu, G.; Chen, H.; Lv, C.; and Shen, C. 2025 c . POMATO: Marrying Pointmap Matching with Temporal Motions for Dynamic 3D Reconstruction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 5680--5689

  49. [49]

    Zhang, S.; Wang, J.; Xu, Y.; Xue, N.; Rupprecht, C.; Zhou, X.; Shen, Y.; and Wetzstein, G. 2025 d . Flare: Feed-forward geometry, appearance and camera estimation from uncalibrated sparse views. In Proceedings of the Computer Vision and Pattern Recognition Conference, 21936--21947

  50. [50]

    W.; Shen, B.; Wetzstein, G.; and Guibas, L

    Zheng, Y.; Harley, A. W.; Shen, B.; Wetzstein, G.; and Guibas, L. J. 2023. Pointodyssey: A large-scale synthetic dataset for long-term point tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 19855--19865

  51. [51]

    Zhuo, D.; Zheng, W.; Guo, J.; Wu, Y.; Zhou, J.; and Lu, J. 2025. Streaming 4d visual geometry transformer. arXiv preprint arXiv:2507.11539