pith. machine review for the scientific record. sign in

arxiv: 2605.12774 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords monocular pose estimationdynamic scenesbundle adjustmentmotion mask3D featuresvisual SLAMunified framework
0
0 comments X

The pith

WildPose unifies monocular pose estimation to stay accurate in dynamic scenes without losing ground on static or low-motion ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WildPose as a single monocular framework that estimates camera pose reliably whether objects in the scene are moving or the environment is entirely static. It links a rich perceptual frontend from a pre-trained 3D model directly to end-to-end differentiable bundle adjustment. A 3D-aware update operator refines the pose while a separate high-capacity detector identifies moving regions, both drawing from the same frozen backbone features at multiple levels. This design removes the need for scene-specific retraining or separate pipelines for different motion regimes. Experiments across dynamic, static, and low-ego-motion benchmarks show the method outperforms prior specialized approaches in each category.

Core claim

WildPose connects feedforward 3D vision features with differentiable bundle adjustment through a 3D-aware update operator and a multi-level motion mask detector, both built on a frozen pre-trained backbone, to produce accurate camera poses in dynamic environments while preserving state-of-the-art results on static and low-ego-motion data.

What carries the argument

The 3D-aware update operator and high-capacity motion mask detector that share multi-level features from a frozen pre-trained MASt3R backbone to drive differentiable bundle adjustment.

If this is right

  • It delivers higher accuracy than prior methods on dynamic benchmarks such as Wild-SLAM and Bonn.
  • It matches or exceeds state-of-the-art results on static benchmarks including TUM and 7-Scenes.
  • It maintains strong performance on low-ego-motion sequences such as Sintel.
  • It removes the requirement for per-sequence optimization or semantic segmentation when moving objects are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pre-trained 3D features appear sufficient to support motion robustness across varied scene types without task-specific fine-tuning.
  • The same backbone-plus-optimization pattern could simplify other 3D tasks that must tolerate mixed static and dynamic content.
  • Real-time implementations or integration into larger mapping systems would test whether the current accuracy carries over under computational constraints.

Load-bearing premise

The frozen pre-trained backbone already contains enough general 3D information to let the update operator and mask detector handle new dynamic situations without any further training.

What would settle it

A new set of dynamic video sequences containing motion patterns absent from the training distribution where WildPose accuracy falls below that of existing dynamic-aware methods while static-scene performance remains unchanged.

Figures

Figures reproduced from arXiv: 2605.12774 by Iro Armeni, Jianhao Zheng, Liyuan Zhu, Zihan Zhu.

Figure 1
Figure 1. Figure 1: WildPose. Left: Given a calibrated video sequence captured in dynamic environments, our method effectively detects the moving distractors and accurately estimates the camera trajectory, whereas methods relying on semantic segmentation fail to identify all dynamic elements. Right: Pose estimation performance across multiple SLAM benchmarks, where our method achieves superior and stable performance on both d… view at source ↗
Figure 2
Figure 2. Figure 2: System Overview. WildPose robustly estimates the camera trajectory from a monocular RGB sequence. We leverage 3D-aware features from the frozen MASt3R encoder [25], which are fed into our update operator . Concurrently, a motion mask detector generates motion masks from the backbone’s multi-layer features. These outputs, combined with the metric depth prior [48], enable our Dense Bundle Adjustment layer to… view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of Motion Masks. Top: Our per-edge masks (Frame i → j and i → k) resolve temporal ambiguity by capturing motion relative to a second frame, enabling fine-grained detection of inconsistencies along each frame-graph edge. Bottom: Per-frame masks from prior methods (WildGS-SLAM [60] and Vipe [17]) are shown for comparison; these approaches produce frame-level predictions that are unable to ident… view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of Camera Trajectories. The estimated trajectory is colorized by the translation error (ATE). Method Abs.Rel. ↓ Log-RMSE ↓ δ1.25 ↑ DA-v2 [55] 0.16 0.24 91.1 PPD [52] 0.15 0.24 94.6 DepthCrafter [16] 0.19 0.26 86.8 Video-Depth-Anything [4] 0.14 0.23 95.7 MegaSaM [27] 0.13 0.23 94.5 WildPose (Ours) 0.12 0.22 96.3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Architecture of Update Operator. The ConvGRU iteratively updates the hiddens state from the image feature correlation, context features, and the current optical flow. The updated hidden state is further decoded to variables that will be used to guide pose and disparity estimation in the differentiable BA process [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Architecture of the flow feature and context encoders. Both encoders take the MASt3R features as input and output fea￾tures at 1/8 of the image resolution. For the context encoder, the dimension of the last convolution layer is 256. dicted by the update operator ˆf. All three losses are applied across the 15 BA steps, utilizing an increasing temporal weighting scheme wk = γ (15−k) , where γ = 0.9 and k is … view at source ↗
Figure 7
Figure 7. Figure 7: Limitations. We visualize sampled images from Bonn RGB-D Dynamic Dataset [34] (Person sequence). The dataset has inconsistent exposure, which is challenging to our approach. rior tracking accuracy (lowest ATE) with a higher FPS. Our peak memory usage stems from foundation models (Moge2, MASt3R) but is mitigatable via preprocessing, similar to MegaSaM, or using a distilled model. 9. Limitations WildPose’s l… view at source ↗
read the original abstract

Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces WildPose, a unified monocular pose estimation framework for dynamic environments. It connects feedforward 3D models with differentiable bundle adjustment via a 3D-aware update operator and high-capacity motion mask detector, both built on a frozen pre-trained MASt3R feature backbone. The method claims to deliver robust performance in dynamic scenes while preserving state-of-the-art results on static and low-ego-motion datasets, with consistent outperformance reported across Wild-SLAM, Bonn, TUM, 7-Scenes, and Sintel benchmarks.

Significance. If the empirical claims hold after detailed verification, the work would offer a practical bridge between perceptual feedforward models and optimization-based SLAM, potentially reducing the need for scene-specific retraining or semantic priors in wild settings. The use of a frozen backbone for both update and masking components is a notable design choice that could generalize if the 3D features prove sufficiently rich.

major comments (2)
  1. Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.
  2. Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.
minor comments (2)
  1. Abstract: The phrasing 'high-capacity motion mask detector' is used without defining capacity (e.g., parameter count or architecture depth) relative to prior mask predictors.
  2. Abstract: No mention of runtime or memory overhead compared to prior dynamic-aware methods, which would help evaluate practicality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve the manuscript.

read point-by-point responses
  1. Referee: Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.

    Authors: We agree that targeted analysis of feature generalization to novel dynamics would strengthen the central claim. Although MASt3R was pretrained on diverse data, the manuscript currently lacks an explicit ablation isolating the backbone's contribution on fast non-rigid motion. In the revised version we will add a new ablation subsection (in §3 or §4) that evaluates the 3D-aware update operator and motion mask detector on Wild-SLAM subsets containing rapid non-rigid dynamics, comparing the frozen MASt3R features against alternative backbones and reporting the resulting impact on outlier suppression within the differentiable BA. revision: yes

  2. Referee: Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.

    Authors: We thank the referee for highlighting this clarity issue. While quantitative results (ATE/RPE) appear in Tables 1–4 of §4, the abstract and opening paragraphs of the experiments section do not explicitly reference these tables or provide error-distribution or failure-case discussion. We will revise the abstract to include brief quantitative highlights with table citations and expand §4 with a new paragraph (and accompanying figure) summarizing error distributions across benchmark categories together with representative failure cases from dynamic, static, and low-ego-motion sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's framework connects an external frozen pre-trained MASt3R backbone to a 3D-aware update operator and motion mask detector, then validates unified robustness via empirical benchmark comparisons on independent datasets (Wild-SLAM, Bonn, TUM, 7-Scenes, Sintel). No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present that would make any claimed performance equivalent to the inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that MASt3R features transfer effectively to dynamic pose estimation and on the introduction of two new components whose independent value is asserted via benchmark gains.

axioms (1)
  • domain assumption Pre-trained MASt3R provides rich perceptual 3D features usable without fine-tuning for both motion masking and pose updates
    The backbone is kept frozen; success depends on this transfer property holding across the tested dynamic and static datasets.
invented entities (2)
  • 3D-aware update operator no independent evidence
    purpose: To connect feedforward features with differentiable bundle adjustment
    New module introduced to enable end-to-end optimization while leveraging the MASt3R backbone.
  • high-capacity motion mask detector no independent evidence
    purpose: To identify moving objects using multi-level 3D features from the same backbone
    New detector component proposed to handle dynamic scenes without semantic brittleness.

pith-pipeline@v0.9.0 · 5500 in / 1423 out tokens · 63638 ms · 2026-05-14T20:34:19.048067+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

  1. [1]

    Fácil, Javier Civera, and José Neira

    Berta Bescos, José M. Fácil, Javier Civera, and José Neira. DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes.RAL, 2018. 3

  2. [2]

    A naturalistic open source movie for optical flow evaluation

    Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012. 6, 7

  3. [3]

    Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE transactions on robotics, 37(6):1874–1890,

    Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE transactions on robotics, 37(6):1874–1890,

  4. [4]

    Video depth anything: Consistent depth estimation for super-long videos

    Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 8

  5. [5]

    Back on track: Bundle adjustment for dynamic scene reconstruc- tion

    Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruc- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4951–4960, 2025. 3

  6. [6]

    Improving monocular visual slam in dynamic environments: an optical- flow-based approach.Advanced Robotics, 2019

    Jiyu Cheng, Yuxiang Sun, and Max Q-H Meng. Improving monocular visual slam in dynamic environments: an optical- flow-based approach.Advanced Robotics, 2019. 3

  7. [7]

    Sg-slam: A real-time rgb-d visual slam toward dy- namic scenes with semantic and geometric information.IEEE Transactions on Instrumentation and Measurement, 2022

    Shuhong Cheng, Changhe Sun, Shijun Zhang, and Dianfan Zhang. Sg-slam: A real-time rgb-d visual slam toward dy- namic scenes with semantic and geometric information.IEEE Transactions on Instrumentation and Measurement, 2022. 2

  8. [8]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR,

  9. [9]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4

  10. [10]

    Lsd-slam: Large-scale direct monocular slam

    Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conference on computer vision, pages 834–849. Springer, 2014. 2

  11. [11]

    Direct sparse odometry.IEEE TPAMI, 2017

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE TPAMI, 2017. 2

  12. [12]

    Direct sparse odometry.PAMI, 2017

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.PAMI, 2017. 1

  13. [13]

    Svo: Fast semi-direct monocular visual odometry

    Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 2

  14. [14]

    Romo: Robust motion segmenta- tion improves structure from motion

    Lily Goli, Sara Sabour, Mark Matthews, Marcus A Brubaker, Dmitry Lagun, Alec Jacobson, David J Fleet, Saurabh Saxena, and Andrea Tagliasacchi. Romo: Robust motion segmenta- tion improves structure from motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6155–6164, 2025. 3

  15. [15]

    Kubric: A scalable dataset generator

    Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1

  16. [16]

    Depth C rafter: Generating consistent long depth sequences for open-world videos

    Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos.arXiv preprint arXiv:2409.02095, 2024. 8

  17. [17]

    Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

    Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 2, 3, 5, 6, 7, 8

  18. [18]

    Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields.IEEE Robotics and Automation Letters, 2024

    Haochen Jiang, Yueming Xu, Kejie Li, Jianfeng Feng, and Li Zhang. Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields.IEEE Robotics and Automation Letters, 2024. 3

  19. [19]

    Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion

    Masaya Kaneko, Kazuya Iwami, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion. InCVPR Workshops, 2018. 3

  20. [20]

    Dy- namicstereo: Consistent dynamic depth from stereo videos

    Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 5, 1

  21. [21]

    MapAnything: Universal Feed-Forward Metric 3D Reconstruction

    Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruction. In...

  22. [22]

    3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023. 2

  23. [23]

    Segment any- thing

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1

  24. [24]

    Ground- ing image matching in 3d with mast3r

    Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3

  25. [25]

    Ground- ing image matching in 3d with mast3r, 2024

    Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 4, 7

  26. [26]

    Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025

    Mingrui Li, Zhetao Guo, Tianchen Deng, Yiming Zhou, Yux- iang Ren, and Hongyu Wang. Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025. 3

  27. [27]

    Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,

    Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,

  28. [28]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 1

  29. [29]

    Vggt- slam: Dense rgb slam optimized on the sl (4) manifold

    Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold. NeurIPS, 2025. 3

  30. [30]

    Gaussian splatting slam

    Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. InCVPR, 2024. 1, 2

  31. [31]

    Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras

    Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 2017. 2

  32. [32]

    Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015

    Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015. 1

  33. [33]

    Mast3r- slam: Real-time dense slam with 3d reconstruction priors

    Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In CVPR, 2025. 2, 3, 6, 7

  34. [34]

    Palazzolo, J

    E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. Iniros, 2019. 3, 6, 7, 8, 4

  35. [35]

    Tartan- Ground: A Large-Scale Dataset for Ground Robot Perception and Navigation.arXiv:2505.10696, 2025

    Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025. 5, 1

  36. [36]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4

  37. [37]

    Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024

    Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R Oswald, and Federico Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024. 2, 5, 6

  38. [38]

    Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields.IEEE Robotics and Automation Letters,

    Nicolas Schischka, Hannah Schieber, Mert Asim Karaoglu, Melih Gorgulu, Florian Grötzner, Alexander Ladikos, Nas- sir Navab, Daniel Roth, and Benjamin Busam. Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields.IEEE Robotics and Automation Letters,

  39. [39]

    Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments

    Raluca Scona, Mariano Jaimez, Yvan R Petillot, Maurice Fallon, and Daniel Cremers. Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments

  40. [40]

    Scene coordinate regression forests for camera relocalization in rgb- d images

    Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 6, 7, 2, 3

  41. [41]

    Crowd-slam: visual slam to- wards crowded environments using object detection.Journal of Intelligent & Robotic Systems, 2021

    João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Crowd-slam: visual slam to- wards crowded environments using object detection.Journal of Intelligent & Robotic Systems, 2021. 3

  42. [42]

    A benchmark for the evaluation of rgb-d slam systems

    Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. InIROS, 2012. 6, 7, 8, 2, 3

  43. [43]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeurIPS, 2021. 1, 2, 3, 4, 5, 6, 7

  44. [44]

    TartanAir-V2 Dataset

    The AirLab. TartanAir-V2 Dataset. https://tartanair.org, 2022. Accessed: 2025-10-28. 5, 1

  45. [45]

    Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991

    Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991. 6

  46. [46]

    Vggt: Visual geometry grounded transformer

    Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 1, 2, 4, 6, 7, 3

  47. [47]

    Efros, and Angjoo Kanazawa

    Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 2

  48. [48]

    MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

    Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

  49. [49]

    Dust3r: Geometric 3d vision made easy

    Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 1, 2

  50. [50]

    $\pi^3$: Permutation-Equivariant Visual Geometry Learning

    Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

  51. [51]

    Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera

    Felix Wimbauer, Nan Yang, Lukas V on Stumberg, Niclas Zeller, and Daniel Cremers. Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6112–6122,

  52. [52]

    Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

    Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025. 8

  53. [53]

    Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization.Advances in Neural Information Processing Systems, 37:51577–51596,

    Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization.Advances in Neural Information Processing Systems, 37:51577–51596,

  54. [54]

    DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization

    Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization. In NeurIPS, 2024. 2

  55. [55]

    Depth anything: Unleashing the power of large-scale unlabeled data

    Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 8

  56. [56]

    Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam.arXiv preprint arXiv:2403.19549, 2024

    Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Patel, Luc Van Gool, and Martin R Oswald. Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam.arXiv preprint arXiv:2403.19549, 2024. 2, 5, 6

  57. [57]

    MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

    Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arxiv:2410.03825,

  58. [58]

    Structure and motion from casual videos

    Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubin- stein, Noah Snavely, and William T Freeman. Structure and motion from casual videos. InECCV, pages 20–37. Springer,

  59. [59]

    Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild

    Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. InECCV, pages 523–542. Springer, 2022. 3

  60. [60]

    Wildgs-slam: Monocular gaussian splatting slam in dynamic environments

    Jianhao Zheng, Zihan Zhu, Valentin Bieri, Marc Pollefeys, Songyou Peng, and Iro Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InCVPR, pages 11461–11471, 2025. 2, 3, 5, 6, 7, 8

  61. [61]

    Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

    Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025. 5, 1

  62. [62]

    Loopsplat: Loop closure by registering 3d gaus- sian splats

    Liyuan Zhu, Yue Li, Erik Sandström, Konrad Schindler, and Iro Armeni. Loopsplat: Loop closure by registering 3d gaus- sian splats. In3DV, 2025. 1

  63. [63]

    Oswald, and Marc Pollefeys

    Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In CVPR, 2022. 1

  64. [64]

    Nicer- slam: Neural implicit scene encoding for rgb slam

    Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. Nicer- slam: Neural implicit scene encoding for rgb slam. 2024. 2 WildPose: A Unified Framework for Robust Pose Estimation in the Wild Supplementary Material Abstract In the supplementary material, we provide additional details about the following:

  65. [65]

    More information about the training dataset (Sec. 6)

  66. [66]

    Implementation details of WildPose, including more train- ing details and model architecture (Sec. 7)

  67. [67]

    Additional results and discussion (Sec. 8)

  68. [68]

    Limitations and future work (Sec. 9)

  69. [69]

    The training datasets encompass both static and dynamic environments

    Training Dataset We trained our model on four publicly available datasets supplemented with data that we generated using the Kubric simulator [15]. The training datasets encompass both static and dynamic environments. A comprehensive list of the datasets we used is provided in Table 7. While the TartanAir V2 [ 44] and TartanGround [ 35] datasets primarily...

  70. [70]

    Additional Training Details Following [43], we sample 7 frames per batch from the train- ing sequence

    Implementation Details 7.1. Additional Training Details Following [43], we sample 7 frames per batch from the train- ing sequence. We constrain the average optical flow magni- tude between neighboring pairs to fall within the range of 8 to 96 pixels. For all frames, we apply standard data augmen- tation, comprising photometric transformations (color jitte...

  71. [71]

    Additional Results Full Tracking Results on the static Dataset.In the main paper, we summarize the average ATE for the TUM RGB-D (static) [42] and 7-Scenes [40] datasets. Here, we present the results of full sequences in Table 8 (TUM RGB-D) and Method360 desk desk2 floor plant room rpy teddy xyz Keyframe Poses MASt3R-SLAM [33]0.049 0.016 0.0240.025 0.020 ...

  72. [72]

    Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists

    Limitations WildPose’s learnable modules are trained exclusively on syn- thetic data. Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists. This gap is evi- dent in sequences with unobserved phenomena, such as the significant photometric variations in the Bonn RGB-D Dy- namic Dataset (Fig. 7). Our model, lacking expli...