Recognition: 2 theorem links
· Lean TheoremWildPose: A Unified Framework for Robust Pose Estimation in the Wild
Pith reviewed 2026-05-14 20:34 UTC · model grok-4.3
The pith
WildPose unifies monocular pose estimation to stay accurate in dynamic scenes without losing ground on static or low-motion ones.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WildPose connects feedforward 3D vision features with differentiable bundle adjustment through a 3D-aware update operator and a multi-level motion mask detector, both built on a frozen pre-trained backbone, to produce accurate camera poses in dynamic environments while preserving state-of-the-art results on static and low-ego-motion data.
What carries the argument
The 3D-aware update operator and high-capacity motion mask detector that share multi-level features from a frozen pre-trained MASt3R backbone to drive differentiable bundle adjustment.
If this is right
- It delivers higher accuracy than prior methods on dynamic benchmarks such as Wild-SLAM and Bonn.
- It matches or exceeds state-of-the-art results on static benchmarks including TUM and 7-Scenes.
- It maintains strong performance on low-ego-motion sequences such as Sintel.
- It removes the requirement for per-sequence optimization or semantic segmentation when moving objects are present.
Where Pith is reading between the lines
- Pre-trained 3D features appear sufficient to support motion robustness across varied scene types without task-specific fine-tuning.
- The same backbone-plus-optimization pattern could simplify other 3D tasks that must tolerate mixed static and dynamic content.
- Real-time implementations or integration into larger mapping systems would test whether the current accuracy carries over under computational constraints.
Load-bearing premise
The frozen pre-trained backbone already contains enough general 3D information to let the update operator and mask detector handle new dynamic situations without any further training.
What would settle it
A new set of dynamic video sequences containing motion patterns absent from the training distribution where WildPose accuracy falls below that of existing dynamic-aware methods while static-scene performance remains unchanged.
Figures
read the original abstract
Estimating camera pose in dynamic environments is a critical challenge, as most visual SLAM and SfM methods assume static scenes. While recent dynamic-aware methods exist, they are often not unified: semantic-based approaches are brittle, per-sequence optimization methods fail on short sequences, and other learned models may degrade on static-only scenes. We present WildPose, a unified monocular pose estimation framework that is robust in dynamic environments while maintaining state-of-the-art performance on static and low-ego-motion datasets. Our key insight is to connect two powerful paradigms in modern 3D vision: the rich perceptual frontend of feedforward models and the end-to-end optimization of differentiable bundle adjustment (BA). We achieve this with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector that uses multi-level 3D-aware features from the same backbone. Extensive experiments show WildPose consistently outperforms prior methods across dynamic (Wild-SLAM, Bonn), static (TUM, 7-Scenes), and low-ego-motion (Sintel) benchmarks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WildPose, a unified monocular pose estimation framework for dynamic environments. It connects feedforward 3D models with differentiable bundle adjustment via a 3D-aware update operator and high-capacity motion mask detector, both built on a frozen pre-trained MASt3R feature backbone. The method claims to deliver robust performance in dynamic scenes while preserving state-of-the-art results on static and low-ego-motion datasets, with consistent outperformance reported across Wild-SLAM, Bonn, TUM, 7-Scenes, and Sintel benchmarks.
Significance. If the empirical claims hold after detailed verification, the work would offer a practical bridge between perceptual feedforward models and optimization-based SLAM, potentially reducing the need for scene-specific retraining or semantic priors in wild settings. The use of a frozen backbone for both update and masking components is a notable design choice that could generalize if the 3D features prove sufficiently rich.
major comments (2)
- Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.
- Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.
minor comments (2)
- Abstract: The phrasing 'high-capacity motion mask detector' is used without defining capacity (e.g., parameter count or architecture depth) relative to prior mask predictors.
- Abstract: No mention of runtime or memory overhead compared to prior dynamic-aware methods, which would help evaluate practicality.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below and describe the revisions that will be incorporated to improve the manuscript.
read point-by-point responses
-
Referee: Abstract and §3 (architecture): The central claim of unified robustness without scene-specific retraining or failure modes in unseen dynamics rests on the frozen MASt3R backbone supplying sufficiently general multi-level 3D-aware features to both the update operator and motion mask detector. No ablation or analysis is provided on how these features handle novel non-rigid or fast dynamics outside MASt3R's typical training distribution, which directly bears on whether the differentiable BA can reliably suppress outliers without over-penalizing static structure.
Authors: We agree that targeted analysis of feature generalization to novel dynamics would strengthen the central claim. Although MASt3R was pretrained on diverse data, the manuscript currently lacks an explicit ablation isolating the backbone's contribution on fast non-rigid motion. In the revised version we will add a new ablation subsection (in §3 or §4) that evaluates the 3D-aware update operator and motion mask detector on Wild-SLAM subsets containing rapid non-rigid dynamics, comparing the frozen MASt3R features against alternative backbones and reporting the resulting impact on outlier suppression within the differentiable BA. revision: yes
-
Referee: Abstract and §4 (experiments): The reported outperformance across mixed benchmarks (dynamic Wild-SLAM/Bonn, static TUM/7-Scenes, low-ego Sintel) is stated without reference to specific quantitative tables, error distributions, or failure-case analysis. This makes it impossible to assess the magnitude of gains or whether they stem from the proposed components versus baseline strengths.
Authors: We thank the referee for highlighting this clarity issue. While quantitative results (ATE/RPE) appear in Tables 1–4 of §4, the abstract and opening paragraphs of the experiments section do not explicitly reference these tables or provide error-distribution or failure-case discussion. We will revise the abstract to include brief quantitative highlights with table citations and expand §4 with a new paragraph (and accompanying figure) summarizing error distributions across benchmark categories together with representative failure cases from dynamic, static, and low-ego-motion sequences. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper's framework connects an external frozen pre-trained MASt3R backbone to a 3D-aware update operator and motion mask detector, then validates unified robustness via empirical benchmark comparisons on independent datasets (Wild-SLAM, Bonn, TUM, 7-Scenes, Sintel). No equations, self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citations are present that would make any claimed performance equivalent to the inputs by construction. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained MASt3R provides rich perceptual 3D features usable without fine-tuning for both motion masking and pose updates
invented entities (2)
-
3D-aware update operator
no independent evidence
-
high-capacity motion mask detector
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We achieve this by enhancing the differentiable BA pipeline... with a 3D-aware update operator built on a frozen, pre-trained MASt3R feature backbone, together with a high-capacity motion mask detector
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our key insight is to connect... feed-forward models and... differentiable bundle adjustment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Fácil, Javier Civera, and José Neira
Berta Bescos, José M. Fácil, Javier Civera, and José Neira. DynaSLAM: Tracking, Mapping and Inpainting in Dynamic Scenes.RAL, 2018. 3
work page 2018
-
[2]
A naturalistic open source movie for optical flow evaluation
Daniel J Butler, Jonas Wulff, Garrett B Stanley, and Michael J Black. A naturalistic open source movie for optical flow evaluation. InEuropean conference on computer vision, pages 611–625. Springer, 2012. 6, 7
work page 2012
-
[3]
Carlos Campos, Richard Elvira, Juan J Gómez Rodríguez, José MM Montiel, and Juan D Tardós. Orb-slam3: An accu- rate open-source library for visual, visual–inertial, and mul- timap slam.IEEE transactions on robotics, 37(6):1874–1890,
-
[4]
Video depth anything: Consistent depth estimation for super-long videos
Sili Chen, Hengkai Guo, Shengnan Zhu, Feihu Zhang, Zilong Huang, Jiashi Feng, and Bingyi Kang. Video depth anything: Consistent depth estimation for super-long videos. InPro- ceedings of the Computer Vision and Pattern Recognition Conference, pages 22831–22840, 2025. 8
work page 2025
-
[5]
Back on track: Bundle adjustment for dynamic scene reconstruc- tion
Weirong Chen, Ganlin Zhang, Felix Wimbauer, Rui Wang, Nikita Araslanov, Andrea Vedaldi, and Daniel Cremers. Back on track: Bundle adjustment for dynamic scene reconstruc- tion. InProceedings of the IEEE/CVF International Confer- ence on Computer Vision (ICCV), pages 4951–4960, 2025. 3
work page 2025
-
[6]
Jiyu Cheng, Yuxiang Sun, and Max Q-H Meng. Improving monocular visual slam in dynamic environments: an optical- flow-based approach.Advanced Robotics, 2019. 3
work page 2019
-
[7]
Shuhong Cheng, Changhe Sun, Shijun Zhang, and Dianfan Zhang. Sg-slam: A real-time rgb-d visual slam toward dy- namic scenes with semantic and geometric information.IEEE Transactions on Instrumentation and Measurement, 2022. 2
work page 2022
-
[8]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InCVPR,
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 4
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Lsd-slam: Large-scale direct monocular slam
Jakob Engel, Thomas Schöps, and Daniel Cremers. Lsd-slam: Large-scale direct monocular slam. InEuropean conference on computer vision, pages 834–849. Springer, 2014. 2
work page 2014
-
[11]
Direct sparse odometry.IEEE TPAMI, 2017
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE TPAMI, 2017. 2
work page 2017
-
[12]
Direct sparse odometry.PAMI, 2017
Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.PAMI, 2017. 1
work page 2017
-
[13]
Svo: Fast semi-direct monocular visual odometry
Christian Forster, Matia Pizzoli, and Davide Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In2014 IEEE international conference on robotics and automation (ICRA), pages 15–22. IEEE, 2014. 2
work page 2014
-
[14]
Romo: Robust motion segmenta- tion improves structure from motion
Lily Goli, Sara Sabour, Mark Matthews, Marcus A Brubaker, Dmitry Lagun, Alec Jacobson, David J Fleet, Saurabh Saxena, and Andrea Tagliasacchi. Romo: Robust motion segmenta- tion improves structure from motion. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6155–6164, 2025. 3
work page 2025
-
[15]
Kubric: A scalable dataset generator
Klaus Greff, Francois Belletti, Lucas Beyer, Carl Doersch, Yilun Du, Daniel Duckworth, David J Fleet, Dan Gnanapra- gasam, Florian Golemo, Charles Herrmann, et al. Kubric: A scalable dataset generator. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3749–3761, 2022. 5, 1
work page 2022
-
[16]
Depth C rafter: Generating consistent long depth sequences for open-world videos
Wenbo Hu, Xiangjun Gao, Xiaoyu Li, Sijie Zhao, Xiaodong Cun, Yong Zhang, Long Quan, and Ying Shan. Depthcrafter: Generating consistent long depth sequences for open-world videos.arXiv preprint arXiv:2409.02095, 2024. 8
-
[17]
Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025
Jiahui Huang, Qunjie Zhou, Hesam Rabeti, Aleksandr Ko- rovko, Huan Ling, Xuanchi Ren, Tianchang Shen, Jun Gao, Dmitry Slepichev, Chen-Hsuan Lin, et al. Vipe: Video pose engine for 3d geometric perception.arXiv preprint arXiv:2508.10934, 2025. 2, 3, 5, 6, 7, 8
-
[18]
Haochen Jiang, Yueming Xu, Kejie Li, Jianfeng Feng, and Li Zhang. Rodyn-slam: Robust dynamic dense rgb-d slam with neural radiance fields.IEEE Robotics and Automation Letters, 2024. 3
work page 2024
-
[19]
Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion
Masaya Kaneko, Kazuya Iwami, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Mask-slam: Robust feature- based monocular slam by masking using semantic segmenta- tion. InCVPR Workshops, 2018. 3
work page 2018
-
[20]
Dy- namicstereo: Consistent dynamic depth from stereo videos
Nikita Karaev, Ignacio Rocco, Benjamin Graham, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Dy- namicstereo: Consistent dynamic depth from stereo videos. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13229–13239, 2023. 5, 1
work page 2023
-
[21]
MapAnything: Universal Feed-Forward Metric 3D Reconstruction
Nikhil Keetha, Norman Müller, Johannes Schönberger, Lorenzo Porzi, Yuchen Zhang, Tobias Fischer, Arno Knapitsch, Duncan Zauss, Ethan Weber, Nelson Antunes, Jonathon Luiten, Manuel Lopez-Antequera, Samuel Rota Bulò, Christian Richardt, Deva Ramanan, Sebastian Scherer, and Peter Kontschieder. MapAnything: Universal feed- forward metric 3D reconstruction. In...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023
Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM TOG, 2023. 2
work page 2023
-
[23]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer White- head, Alexander C Berg, Wan-Yen Lo, et al. Segment any- thing. InICCV, 2023. 1
work page 2023
-
[24]
Ground- ing image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and Jérôme Revaud. Ground- ing image matching in 3d with mast3r. InEuropean Confer- ence on Computer Vision, pages 71–91. Springer, 2024. 2, 3
work page 2024
-
[25]
Ground- ing image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 1, 2, 3, 4, 7
work page 2024
-
[26]
Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025
Mingrui Li, Zhetao Guo, Tianchen Deng, Yiming Zhou, Yux- iang Ren, and Hongyu Wang. Ddn-slam: Real time dense dynamic neural implicit slam.IEEE Robotics and Automation Letters, 2025. 3
work page 2025
-
[27]
Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,
Zhengqi Li, Richard Tucker, Forrester Cole, Qianqian Wang, Linyi Jin, Vickie Ye, Angjoo Kanazawa, Aleksander Holynski, and Noah Snavely. Megasam: Accurate, fast, and robust structure and motion from casual dynamic videos.CVPR,
-
[28]
Grounding dino: Marrying dino with grounded pre-training for open-set object detection
Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 1
work page 2024
-
[29]
Vggt- slam: Dense rgb slam optimized on the sl (4) manifold
Dominic Maggio, Hyungtae Lim, and Luca Carlone. Vggt- slam: Dense rgb slam optimized on the sl (4) manifold. NeurIPS, 2025. 3
work page 2025
-
[30]
Hidenobu Matsuki, Riku Murai, Paul HJ Kelly, and Andrew J Davison. Gaussian splatting slam. InCVPR, 2024. 1, 2
work page 2024
-
[31]
Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras
Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open- source slam system for monocular, stereo, and rgb-d cameras. IEEE transactions on robotics, 2017. 2
work page 2017
-
[32]
Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015
Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular slam system.IEEE transactions on robotics, 2015. 1
work page 2015
-
[33]
Mast3r- slam: Real-time dense slam with 3d reconstruction priors
Riku Murai, Eric Dexheimer, and Andrew J Davison. Mast3r- slam: Real-time dense slam with 3d reconstruction priors. In CVPR, 2025. 2, 3, 6, 7
work page 2025
-
[34]
E. Palazzolo, J. Behley, P. Lottes, P. Giguère, and C. Stachniss. ReFusion: 3D Reconstruction in Dynamic Environments for RGB-D Cameras Exploiting Residuals. Iniros, 2019. 3, 6, 7, 8, 4
work page 2019
-
[35]
Manthan Patel, Fan Yang, Yuheng Qiu, Cesar Cadena, Se- bastian Scherer, Marco Hutter, and Wenshan Wang. Tartan- ground: A large-scale dataset for ground robot perception and navigation.arXiv preprint arXiv:2505.10696, 2025. 5, 1
-
[36]
Vi- sion transformers for dense prediction
René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 4
work page 2021
-
[37]
Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024
Erik Sandström, Keisuke Tateno, Michael Oechsle, Michael Niemeyer, Luc Van Gool, Martin R Oswald, and Federico Tombari. Splat-slam: Globally optimized rgb-only slam with 3d gaussians.arXiv preprint arXiv:2405.16544, 2024. 2, 5, 6
-
[38]
Nicolas Schischka, Hannah Schieber, Mert Asim Karaoglu, Melih Gorgulu, Florian Grötzner, Alexander Ladikos, Nas- sir Navab, Daniel Roth, and Benjamin Busam. Dynamon: Motion-aware fast and robust camera localization for dynamic neural radiance fields.IEEE Robotics and Automation Letters,
-
[39]
Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments
Raluca Scona, Mariano Jaimez, Yvan R Petillot, Maurice Fallon, and Daniel Cremers. Staticfusion: Background re- construction for dense rgb-d slam in dynamic environments
-
[40]
Scene coordinate regression forests for camera relocalization in rgb- d images
Jamie Shotton, Ben Glocker, Christopher Zach, Shahram Izadi, Antonio Criminisi, and Andrew Fitzgibbon. Scene coordinate regression forests for camera relocalization in rgb- d images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2930–2937, 2013. 6, 7, 2, 3
work page 2013
-
[41]
João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Crowd-slam: visual slam to- wards crowded environments using object detection.Journal of Intelligent & Robotic Systems, 2021. 3
work page 2021
-
[42]
A benchmark for the evaluation of rgb-d slam systems
Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the evaluation of rgb-d slam systems. InIROS, 2012. 6, 7, 8, 2, 3
work page 2012
-
[43]
Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras
Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeurIPS, 2021. 1, 2, 3, 4, 5, 6, 7
work page 2021
-
[44]
The AirLab. TartanAir-V2 Dataset. https://tartanair.org, 2022. Accessed: 2025-10-28. 5, 1
work page 2022
-
[45]
Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991
Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE TPAMI, 1991. 6
work page 1991
-
[46]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InCVPR, pages 5294– 5306, 2025. 1, 2, 4, 6, 7, 3
work page 2025
-
[47]
Qianqian Wang*, Yifei Zhang*, Aleksander Holynski, Alexei A. Efros, and Angjoo Kanazawa. Continuous 3d per- ception model with persistent state. InCVPR, 2025. 2
work page 2025
-
[48]
MoGe-2: Accurate Monocular Geometry with Metric Scale and Sharp Details
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546,
work page internal anchor Pith review arXiv
-
[49]
Dust3r: Geometric 3d vision made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vision made easy. InCVPR, 2024. 1, 2
work page 2024
-
[50]
$\pi^3$: Permutation-Equivariant Visual Geometry Learning
Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, and Tong He. π3: Scalable permutation-equivariant visual geometry learning.arXiv preprint arXiv:2507.13347,
work page internal anchor Pith review Pith/arXiv arXiv
-
[51]
Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera
Felix Wimbauer, Nan Yang, Lukas V on Stumberg, Niclas Zeller, and Daniel Cremers. Monorec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6112–6122,
-
[52]
Gangwei Xu, Haotong Lin, Hongcheng Luo, Xianqi Wang, Jingfeng Yao, Lianghui Zhu, Yuechuan Pu, Cheng Chi, Haiyang Sun, Bing Wang, et al. Pixel-perfect depth with semantics-prompted diffusion transformers.arXiv preprint arXiv:2510.07316, 2025. 8
-
[53]
Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. Dg-slam: Robust dynamic gaussian splatting slam with hybrid pose optimization.Advances in Neural Information Processing Systems, 37:51577–51596,
-
[54]
DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization
Yueming Xu, Haochen Jiang, Zhongyang Xiao, Jianfeng Feng, and Li Zhang. DG-SLAM: Robust Dynamic Gaus- sian Splatting SLAM with Hybrid Pose Optimization. In NeurIPS, 2024. 2
work page 2024
-
[55]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, pages 10371–10381, 2024. 8
work page 2024
-
[56]
Ganlin Zhang, Erik Sandström, Youmin Zhang, Manthan Patel, Luc Van Gool, and Martin R Oswald. Glorie-slam: Globally optimized rgb-only implicit encoding point cloud slam.arXiv preprint arXiv:2403.19549, 2024. 2, 5, 6
-
[57]
MonST3R: A Simple Approach for Estimating Geometry in the Presence of Motion
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jampani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming-Hsuan Yang. Monst3r: A simple approach for estimating geometry in the presence of motion.arXiv preprint arxiv:2410.03825,
work page internal anchor Pith review arXiv
-
[58]
Structure and motion from casual videos
Zhoutong Zhang, Forrester Cole, Zhengqi Li, Michael Rubin- stein, Noah Snavely, and William T Freeman. Structure and motion from casual videos. InECCV, pages 20–37. Springer,
-
[59]
Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild
Wang Zhao, Shaohui Liu, Hengkai Guo, Wenping Wang, and Yong-Jin Liu. Particlesfm: Exploiting dense point trajectories for localizing moving cameras in the wild. InECCV, pages 523–542. Springer, 2022. 3
work page 2022
-
[60]
Wildgs-slam: Monocular gaussian splatting slam in dynamic environments
Jianhao Zheng, Zihan Zhu, Valentin Bieri, Marc Pollefeys, Songyou Peng, and Iro Armeni. Wildgs-slam: Monocular gaussian splatting slam in dynamic environments. InCVPR, pages 11461–11471, 2025. 2, 3, 5, 6, 7, 8
work page 2025
-
[61]
Yang Zhou, Yifan Wang, Jianjun Zhou, Wenzheng Chang, Haoyu Guo, Zizun Li, Kaijing Ma, Xinyue Li, Yating Wang, Haoyi Zhu, et al. Omniworld: A multi-domain and multi-modal dataset for 4d world modeling.arXiv preprint arXiv:2509.12201, 2025. 5, 1
-
[62]
Loopsplat: Loop closure by registering 3d gaus- sian splats
Liyuan Zhu, Yue Li, Erik Sandström, Konrad Schindler, and Iro Armeni. Loopsplat: Loop closure by registering 3d gaus- sian splats. In3DV, 2025. 1
work page 2025
-
[63]
Zihan Zhu, Songyou Peng, Viktor Larsson, Weiwei Xu, Hujun Bao, Zhaopeng Cui, Martin R. Oswald, and Marc Pollefeys. Nice-slam: Neural implicit scalable encoding for slam. In CVPR, 2022. 1
work page 2022
-
[64]
Nicer- slam: Neural implicit scene encoding for rgb slam
Zihan Zhu, Songyou Peng, Viktor Larsson, Zhaopeng Cui, Martin R Oswald, Andreas Geiger, and Marc Pollefeys. Nicer- slam: Neural implicit scene encoding for rgb slam. 2024. 2 WildPose: A Unified Framework for Robust Pose Estimation in the Wild Supplementary Material Abstract In the supplementary material, we provide additional details about the following:
work page 2024
-
[65]
More information about the training dataset (Sec. 6)
-
[66]
Implementation details of WildPose, including more train- ing details and model architecture (Sec. 7)
-
[67]
Additional results and discussion (Sec. 8)
-
[68]
Limitations and future work (Sec. 9)
-
[69]
The training datasets encompass both static and dynamic environments
Training Dataset We trained our model on four publicly available datasets supplemented with data that we generated using the Kubric simulator [15]. The training datasets encompass both static and dynamic environments. A comprehensive list of the datasets we used is provided in Table 7. While the TartanAir V2 [ 44] and TartanGround [ 35] datasets primarily...
-
[70]
Implementation Details 7.1. Additional Training Details Following [43], we sample 7 frames per batch from the train- ing sequence. We constrain the average optical flow magni- tude between neighboring pairs to fall within the range of 8 to 96 pixels. For all frames, we apply standard data augmen- tation, comprising photometric transformations (color jitte...
-
[71]
Additional Results Full Tracking Results on the static Dataset.In the main paper, we summarize the average ATE for the TUM RGB-D (static) [42] and 7-Scenes [40] datasets. Here, we present the results of full sequences in Table 8 (TUM RGB-D) and Method360 desk desk2 floor plant room rpy teddy xyz Keyframe Poses MASt3R-SLAM [33]0.049 0.016 0.0240.025 0.020 ...
-
[72]
Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists
Limitations WildPose’s learnable modules are trained exclusively on syn- thetic data. Although our curriculum is diverse, a domain gap to real-world scenarios inevitably exists. This gap is evi- dent in sequences with unobserved phenomena, such as the significant photometric variations in the Bonn RGB-D Dy- namic Dataset (Fig. 7). Our model, lacking expli...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.