arxiv: 2605.12774 · v1 · submitted 2026-05-12 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

WildPose: A Unified Framework for Robust Pose Estimation in the Wild

Jianhao Zheng , Liyuan Zhu , Zihan Zhu , Iro Armeni

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular pose estimationdynamic scenesbundle adjustmentmotion mask3D featuresvisual SLAMunified framework

0 comments

The pith

WildPose unifies monocular pose estimation to stay accurate in dynamic scenes without losing ground on static or low-motion ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents WildPose as a single monocular framework that estimates camera pose reliably whether objects in the scene are moving or the environment is entirely static. It links a rich perceptual frontend from a pre-trained 3D model directly to end-to-end differentiable bundle adjustment. A 3D-aware update operator refines the pose while a separate high-capacity detector identifies moving regions, both drawing from the same frozen backbone features at multiple levels. This design removes the need for scene-specific retraining or separate pipelines for different motion regimes. Experiments across dynamic, static, and low-ego-motion benchmarks show the method outperforms prior specialized approaches in each category.

Core claim

WildPose connects feedforward 3D vision features with differentiable bundle adjustment through a 3D-aware update operator and a multi-level motion mask detector, both built on a frozen pre-trained backbone, to produce accurate camera poses in dynamic environments while preserving state-of-the-art results on static and low-ego-motion data.

What carries the argument

The 3D-aware update operator and high-capacity motion mask detector that share multi-level features from a frozen pre-trained MASt3R backbone to drive differentiable bundle adjustment.

If this is right

It delivers higher accuracy than prior methods on dynamic benchmarks such as Wild-SLAM and Bonn.
It matches or exceeds state-of-the-art results on static benchmarks including TUM and 7-Scenes.
It maintains strong performance on low-ego-motion sequences such as Sintel.
It removes the requirement for per-sequence optimization or semantic segmentation when moving objects are present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pre-trained 3D features appear sufficient to support motion robustness across varied scene types without task-specific fine-tuning.
The same backbone-plus-optimization pattern could simplify other 3D tasks that must tolerate mixed static and dynamic content.
Real-time implementations or integration into larger mapping systems would test whether the current accuracy carries over under computational constraints.

Load-bearing premise

The frozen pre-trained backbone already contains enough general 3D information to let the update operator and mask detector handle new dynamic situations without any further training.

What would settle it

A new set of dynamic video sequences containing motion patterns absent from the training distribution where WildPose accuracy falls below that of existing dynamic-aware methods while static-scene performance remains unchanged.

Figures

Figures reproduced from arXiv: 2605.12774 by Iro Armeni, Jianhao Zheng, Liyuan Zhu, Zihan Zhu.

**Figure 1.** Figure 1: WildPose. Left: Given a calibrated video sequence captured in dynamic environments, our method effectively detects the moving distractors and accurately estimates the camera trajectory, whereas methods relying on semantic segmentation fail to identify all dynamic elements. Right: Pose estimation performance across multiple SLAM benchmarks, where our method achieves superior and stable performance on both d… view at source ↗

**Figure 2.** Figure 2: System Overview. WildPose robustly estimates the camera trajectory from a monocular RGB sequence. We leverage 3D-aware features from the frozen MASt3R encoder [25], which are fed into our update operator . Concurrently, a motion mask detector generates motion masks from the backbone’s multi-layer features. These outputs, combined with the metric depth prior [48], enable our Dense Bundle Adjustment layer to… view at source ↗

**Figure 3.** Figure 3: Visualization of Motion Masks. Top: Our per-edge masks (Frame i → j and i → k) resolve temporal ambiguity by capturing motion relative to a second frame, enabling fine-grained detection of inconsistencies along each frame-graph edge. Bottom: Per-frame masks from prior methods (WildGS-SLAM [60] and Vipe [17]) are shown for comparison; these approaches produce frame-level predictions that are unable to ident… view at source ↗

**Figure 4.** Figure 4: Visualization of Camera Trajectories. The estimated trajectory is colorized by the translation error (ATE). Method Abs.Rel. ↓ Log-RMSE ↓ δ1.25 ↑ DA-v2 [55] 0.16 0.24 91.1 PPD [52] 0.15 0.24 94.6 DepthCrafter [16] 0.19 0.26 86.8 Video-Depth-Anything [4] 0.14 0.23 95.7 MegaSaM [27] 0.13 0.23 94.5 WildPose (Ours) 0.12 0.22 96.3 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Architecture of Update Operator. The ConvGRU iteratively updates the hiddens state from the image feature correlation, context features, and the current optical flow. The updated hidden state is further decoded to variables that will be used to guide pose and disparity estimation in the differentiable BA process [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Architecture of the flow feature and context encoders. Both encoders take the MASt3R features as input and output features at 1/8 of the image resolution. For the context encoder, the dimension of the last convolution layer is 256. dicted by the update operator ˆf. All three losses are applied across the 15 BA steps, utilizing an increasing temporal weighting scheme wk = γ (15−k) , where γ = 0.9 and k is … view at source ↗

**Figure 7.** Figure 7: Limitations. We visualize sampled images from Bonn RGB-D Dynamic Dataset [34] (Person sequence). The dataset has inconsistent exposure, which is challenging to our approach. rior tracking accuracy (lowest ATE) with a higher FPS. Our peak memory usage stems from foundation models (Moge2, MASt3R) but is mitigatable via preprocessing, similar to MegaSaM, or using a distilled model. 9. Limitations WildPose’s l… view at source ↗

read the original abstract

Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WildPose integrates a frozen MASt3R backbone with a 3D-aware update operator and multi-level motion mask to unify pose estimation across dynamic and static scenes, but the generalization claim needs full ablations to hold up.

read the letter

WildPose combines a frozen MASt3R backbone with a custom 3D-aware update operator and a multi-level motion mask to create a monocular pose estimator that aims to work across dynamic and static scenes without retraining. The concrete new element is this specific integration: using the same pre-trained features for both the update operator in differentiable bundle adjustment and for detecting motion at multiple levels. This setup tries to get the best of feedforward perception and end-to-end optimization. The paper reports consistent gains over prior methods on dynamic benchmarks like Wild-SLAM and Bonn, static ones like TUM and 7-Scenes, and low-ego-motion like Sintel. If the numbers check out in the full tables, that broad coverage is a practical strength for applications where scene type varies. The soft spot is the assumption that the frozen backbone provides sufficiently general 3D-aware features for the motion mask in all cases. MASt3R's training likely emphasizes static or mildly dynamic scenes, so in fast or non-rigid motions not represented there, the mask could fail and hurt the BA performance. The abstract does not detail ablations on the mask or tests on out-of-distribution dynamics, which makes it difficult to assess how much the unified robustness depends on the chosen benchmarks. Minor issues might include how they handle scale ambiguity in monocular setup over long sequences, but the main one is the generalization of the frozen features. This work is for computer vision researchers focused on robust SLAM in real-world settings. A reader interested in combining learned features with optimization will get value from the architecture description and the benchmark comparisons. I recommend sending it to peer review. The idea is solid enough and addresses a clear gap, so referees can verify the experiments and suggest improvements on the generalization tests.

Referee Report

2 major / 2 minor

Summary. The paper introduces WildPose, a unified monocular pose estimation framework for dynamic environments. It connects feedforward 3D models with differentiable bundle adjustment via a 3D-aware update operator and high-capacity motion mask detector, both built on a frozen pre-trained MASt3R feature backbone. The method claims to deliver robust performance in dynamic scenes while preserving state-of-the-art results on static and low-ego-motion datasets, with consistent outperformance reported across Wild-SLAM, Bonn, TUM, 7-Scenes, and Sintel benchmarks.

Significance. If the empirical claims hold after detailed verification, the work would offer a practical bridge between perceptual feedforward models and optimization-based SLAM, potentially reducing the need for scene-specific retraining or semantic priors in wild settings. The use of a frozen backbone for both update and masking components is a notable design choice that could generalize if the 3D features prove sufficiently rich.

major comments (2)

Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.
Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.

minor comments (2)

Abstract: The phrasing 'high-capacity motion mask detector' is used without defining capacity (e.g., parameter count or architecture depth) relative to prior mask predictors.
Abstract: No mention of runtime or memory overhead compared to prior dynamic-aware methods, which would help evaluate practicality.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve the manuscript.

read point-by-point responses

Referee: Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.

Authors: We agree that targeted analysis of feature generalization to novel dynamics would strengthen the central claim. Although MASt3R was pretrained on diverse data, the manuscript currently lacks an explicit ablation isolating the backbone's contribution on fast non-rigid motion. In the revised version we will add a new ablation subsection (in §3 or §4) that evaluates the 3D-aware update operator and motion mask detector on Wild-SLAM subsets containing rapid non-rigid dynamics, comparing the frozen MASt3R features against alternative backbones and reporting the resulting impact on outlier suppression within the differentiable BA. revision: yes
Referee: Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.

Authors: We thank the referee for highlighting this clarity issue. While quantitative results (ATE/RPE) appear in Tables 1–4 of §4, the abstract and opening paragraphs of the experiments section do not explicitly reference these tables or provide error-distribution or failure-case discussion. We will revise the abstract to include brief quantitative highlights with table citations and expand §4 with a new paragraph (and accompanying figure) summarizing error distributions across benchmark categories together with representative failure cases from dynamic, static, and low-ego-motion sequences. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's framework connects an external frozen pre-trained MASt3R backbone to a 3D-aware update operator and motion mask detector, then validates unified robustness via empirical benchmark comparisons on independent datasets (Wild-SLAM, Bonn, TUM, 7-Scenes, Sintel). No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present that would make any claimed performance equivalent to the inputs by construction. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The framework rests on the domain assumption that MASt3R features transfer effectively to dynamic pose estimation and on the introduction of two new components whose independent value is asserted via benchmark gains.

axioms (1)

domain assumption Pre-trained MASt3R provides rich perceptual 3D features usable without fine-tuning for both motion masking and pose updates
The backbone is kept frozen; success depends on this transfer property holding across the tested dynamic and static datasets.

invented entities (2)

3D-aware update operator no independent evidence
purpose: To connect feedforward features with differentiable bundle adjustment
New module introduced to enable end-to-end optimization while leveraging the MASt3R backbone.
high-capacity motion mask detector no independent evidence
purpose: To identify moving objects using multi-level 3D features from the same backbone
New detector component proposed to handle dynamic scenes without semantic brittleness.

pith-pipeline@v0.9.0 · 5500 in / 1423 out tokens · 63638 ms · 2026-05-14T20:34:19.048067+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We achieve this by enhancing the differentiable BA pipeline... with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our key insight is to connect... feed-forward models and... differentiable bundle adjustment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 5 internal anchors

[1]

Fácil, Javier Civera, and José Neira

Berta Bescos, José M. Fácil, Javier Civera, and José Neira. DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes.RAL, 2018. 3

work page 2018
[2]

A naturalistic open source movie for optical flow evaluation

Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012. 6, 7

work page 2012
[3]

Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE transactions on robotics, 37(6):1874–1890,

Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE transactions on robotics, 37(6):1874–1890,

work page
[4]

Video depth anything: Consistent depth estimation for super-long videos

Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 8

work page 2025
[5]

Back on track: Bundle adjustment for dynamic scene reconstruc- tion

Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruc- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4951–4960, 2025. 3

work page 2025
[6]

Improving monocular visual slam in dynamic environments: an optical- flow-based approach.Advanced Robotics, 2019

Jiyu Cheng, Yuxiang Sun, and Max Q-H Meng. Improving monocular visual slam in dynamic environments: an optical- flow-based approach.Advanced Robotics, 2019. 3

work page 2019
[7]

Sg-slam: A real-time rgb-d visual slam toward dy- namic scenes with semantic and geometric information.IEEE Transactions on Instrumentation and Measurement, 2022

Shuhong Cheng, Changhe Sun, Shijun Zhang, and Dianfan Zhang. Sg-slam: A real-time rgb-d visual slam toward dy- namic scenes with semantic and geometric information.IEEE Transactions on Instrumentation and Measurement, 2022. 2

work page 2022
[8]

Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR,

work page
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Lsd-slam: Large-scale direct monocular slam

Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conference on computer vision, pages 834–849. Springer, 2014. 2

work page 2014
[11]

Direct sparse odometry.IEEE TPAMI, 2017

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE TPAMI, 2017. 2

work page 2017
[12]

Direct sparse odometry.PAMI, 2017

Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.PAMI, 2017. 1

work page 2017
[13]

Svo: Fast semi-direct monocular visual odometry

Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 2

work page 2014
[14]

Romo: Robust motion segmenta- tion improves structure from motion

Lily Goli, Sara Sabour, Mark Matthews, Marcus A Brubaker, Dmitry Lagun, Alec Jacobson, David J Fleet, Saurabh Saxena, and Andrea Tagliasacchi. Romo: Robust motion segmenta- tion improves structure from motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6155–6164, 2025. 3

work page 2025
[15]

Kubric: A scalable dataset generator

Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1

work page 2022
[16]

Depth C rafter: Generating consistent long depth sequences for open-world videos

Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos.arXiv preprint arXiv:2409.02095, 2024. 8

work page arXiv 2024
[17]

Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025

Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 2, 3, 5, 6, 7, 8

work page arXiv 2025
[18]

Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields.IEEE Robotics and Automation Letters, 2024

Haochen Jiang, Yueming Xu, Kejie Li, Jianfeng Feng, and Li Zhang. Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields.IEEE Robotics and Automation Letters, 2024. 3

work page 2024
[19]

Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion

Masaya Kaneko, Kazuya Iwami, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion. InCVPR Workshops, 2018. 3

work page 2018
[20]

Dy- namicstereo: Consistent dynamic depth from stereo videos

Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 5, 1

work page 2023
[21]

MapAnything: Universal Feed-Forward Metric 3D Reconstruction

Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruction. In...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023. 2

work page 2023
[23]

Segment any- thing

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1

work page 2023
[24]

Ground- ing image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3

work page 2024
[25]

Ground- ing image matching in 3d with mast3r, 2024

Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 4, 7

work page 2024
[26]

Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025

Mingrui Li, Zhetao Guo, Tianchen Deng, Yiming Zhou, Yux- iang Ren, and Hongyu Wang. Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025. 3

work page 2025
[27]

Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,

Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,

work page
[28]

Grounding dino: Marrying dino with grounded pre-training for open-set object detection

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 1

work page 2024
[29]

Vggt- slam: Dense rgb slam optimized on the sl (4) manifold

Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold. NeurIPS, 2025. 3

work page 2025
[30]

Gaussian splatting slam

Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. InCVPR, 2024. 1, 2

work page 2024
[31]

Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras

Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 2017. 2

work page 2017
[32]

Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015

Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015. 1

work page 2015
[33]

Mast3r- slam: Real-time dense slam with 3d reconstruction priors

Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In CVPR, 2025. 2, 3, 6, 7

work page 2025
[34]

Palazzolo, J

E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. Iniros, 2019. 3, 6, 7, 8, 4

work page 2019
[35]

Tartan- Ground: A Large-Scale Dataset for Ground Robot Perception and Navigation.arXiv:2505.10696, 2025

Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025. 5, 1

work page arXiv 2025
[36]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4

work page 2021
[37]

Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024

Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R Oswald, and Federico Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024. 2, 5, 6

work page arXiv 2024
[38]

Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields.IEEE Robotics and Automation Letters,

Nicolas Schischka, Hannah Schieber, Mert Asim Karaoglu, Melih Gorgulu, Florian Grötzner, Alexander Ladikos, Nas- sir Navab, Daniel Roth, and Benjamin Busam. Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields.IEEE Robotics and Automation Letters,

work page
[39]

Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments

Raluca Scona, Mariano Jaimez, Yvan R Petillot, Maurice Fallon, and Daniel Cremers. Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments

work page
[40]

Scene coordinate regression forests for camera relocalization in rgb- d images

Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 6, 7, 2, 3

work page 2013
[41]

Crowd-slam: visual slam to- wards crowded environments using object detection.Journal of Intelligent & Robotic Systems, 2021

João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Crowd-slam: visual slam to- wards crowded environments using object detection.Journal of Intelligent & Robotic Systems, 2021. 3

work page 2021
[42]

A benchmark for the evaluation of rgb-d slam systems

Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. InIROS, 2012. 6, 7, 8, 2, 3

work page 2012
[43]

Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeurIPS, 2021. 1, 2, 3, 4, 5, 6, 7

work page 2021
[44]

TartanAir-V2 Dataset

The AirLab. TartanAir-V2 Dataset. https://tartanair.org, 2022. Accessed: 2025-10-28. 5, 1

work page 2022
[45]

Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991

Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991. 6

work page 1991
[46]

Vggt: Visual geometry grounded transformer

Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 1, 2, 4, 6, 7, 3

work page 2025
[47]

Efros, and Angjoo Kanazawa

Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 2

work page 2025
[48]

MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details

Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,

work page internal anchor Pith review arXiv
[49]

Dust3r: Geometric 3d vision made easy

Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 1, 2

work page 2024
[50]

$\pi^3$: Permutation-Equivariant Visual Geometry Learning

Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera

Felix Wimbauer, Nan Yang, Lukas V on Stumberg, Niclas Zeller, and Daniel Cremers. Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6112–6122,

work page
[52]

Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025a

Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025. 8

work page arXiv 2025
[53]

Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization.Advances in Neural Information Processing Systems, 37:51577–51596,

Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization.Advances in Neural Information Processing Systems, 37:51577–51596,

work page
[54]

DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization

Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization. In NeurIPS, 2024. 2

work page 2024
[55]

Depth anything: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 8

work page 2024
[56]

Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam.arXiv preprint arXiv:2403.19549, 2024

Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Patel, Luc Van Gool, and Martin R Oswald. Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam.arXiv preprint arXiv:2403.19549, 2024. 2, 5, 6

work page arXiv 2024
[57]

MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion

Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arxiv:2410.03825,

work page internal anchor Pith review arXiv
[58]

Structure and motion from casual videos

Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubin- stein, Noah Snavely, and William T Freeman. Structure and motion from casual videos. InECCV, pages 20–37. Springer,

work page
[59]

Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild

Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. InECCV, pages 523–542. Springer, 2022. 3

work page 2022
[60]

Wildgs-slam: Monocular gaussian splatting slam in dynamic environments

Jianhao Zheng, Zihan Zhu, Valentin Bieri, Marc Pollefeys, Songyou Peng, and Iro Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InCVPR, pages 11461–11471, 2025. 2, 3, 5, 6, 7, 8

work page 2025
[61]

Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025

Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025. 5, 1

work page arXiv 2025
[62]

Loopsplat: Loop closure by registering 3d gaus- sian splats

Liyuan Zhu, Yue Li, Erik Sandström, Konrad Schindler, and Iro Armeni. Loopsplat: Loop closure by registering 3d gaus- sian splats. In3DV, 2025. 1

work page 2025
[63]

Oswald, and Marc Pollefeys

Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In CVPR, 2022. 1

work page 2022
[64]

Nicer- slam: Neural implicit scene encoding for rgb slam

Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. Nicer- slam: Neural implicit scene encoding for rgb slam. 2024. 2 WildPose: A Unified Framework for Robust Pose Estimation in the Wild Supplementary Material Abstract In the supplementary material, we provide additional details about the following:

work page 2024
[65]

More information about the training dataset (Sec. 6)

work page
[66]

Implementation details of WildPose, including more train- ing details and model architecture (Sec. 7)

work page
[67]

Additional results and discussion (Sec. 8)

work page
[68]

Limitations and future work (Sec. 9)

work page
[69]

The training datasets encompass both static and dynamic environments

Training Dataset We trained our model on four publicly available datasets supplemented with data that we generated using the Kubric simulator [15]. The training datasets encompass both static and dynamic environments. A comprehensive list of the datasets we used is provided in Table 7. While the TartanAir V2 [ 44] and TartanGround [ 35] datasets primarily...

work page
[70]

Additional Training Details Following [43], we sample 7 frames per batch from the train- ing sequence

Implementation Details 7.1. Additional Training Details Following [43], we sample 7 frames per batch from the train- ing sequence. We constrain the average optical flow magni- tude between neighboring pairs to fall within the range of 8 to 96 pixels. For all frames, we apply standard data augmen- tation, comprising photometric transformations (color jitte...

work page
[71]

Additional Results Full Tracking Results on the static Dataset.In the main paper, we summarize the average ATE for the TUM RGB-D (static) [42] and 7-Scenes [40] datasets. Here, we present the results of full sequences in Table 8 (TUM RGB-D) and Method360 desk desk2 floor plant room rpy teddy xyz Keyframe Poses MASt3R-SLAM [33]0.049 0.016 0.0240.025 0.020 ...

work page arXiv
[72]

Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists

Limitations WildPose’s learnable modules are trained exclusively on syn- thetic data. Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists. This gap is evi- dent in sequences with unobserved phenomena, such as the significant photometric variations in the Bonn RGB-D Dy- namic Dataset (Fig. 7). Our model, lacking expli...

work page