pith. sign in

arxiv: 2604.22350 · v1 · submitted 2026-04-24 · 💻 cs.CV

PoseFM: Relative Camera Pose Estimation Through Flow Matching

Pith reviewed 2026-05-08 12:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual odometryflow matchingcamera pose estimationgenerative modelingmonocular VOuncertainty estimationrelative pose
0
0 comments X

The pith

PoseFM reformulates monocular visual odometry as a flow matching generative task to model camera motion as a distribution rather than a point estimate.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents PoseFM as the first method to treat frame-to-frame camera pose estimation in monocular visual odometry as a generative modeling problem solved with flow matching. Instead of regressing a single pose value, the approach learns to map random noise into realistic pose samples by integrating continuous-time ordinary differential equations. This yields a distribution over possible motions, which supplies uncertainty estimates and supports more reliable inference when images lack clear structure or good lighting. Experiments on TartanAir, KITTI and TUM-RGBD show the method reaches the lowest absolute trajectory error on selected sequences while staying competitive with existing deterministic frame-to-frame techniques overall.

Core claim

PoseFM is the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO method

What carries the argument

Flow matching that learns a continuous-time vector field to transport noise samples into a distribution of relative camera poses.

If this is right

  • The distribution over poses supplies per-estimate uncertainty that can be propagated into downstream navigation or mapping modules.
  • Robust motion inference becomes possible in texture-poor or poorly lit scenes where feature-based pipelines degrade.
  • Competitive accuracy is maintained while adding generative capabilities without requiring changes to the input image stream.
  • The same trained model can generate multiple plausible trajectories from one image pair for risk-aware planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The continuous ODE formulation could be adapted to variable frame-rate inputs by adjusting the integration time step without retraining.
  • Sampling multiple poses from the learned distribution might improve robustness when fused with inertial measurements in a visual-inertial system.
  • Extending the flow-matching head to also predict scene flow or depth could create a joint generative model for both motion and structure.

Load-bearing premise

Transforming noise into pose predictions through continuous-time ODEs in a flow-matching setup will produce meaningfully better uncertainty estimates and robustness than direct deterministic regression when visual conditions are difficult.

What would settle it

If PoseFM fails to match or beat the absolute trajectory error of leading deterministic monocular VO methods on the majority of sequences in TartanAir, KITTI and TUM-RGBD, or if its sampled pose variance shows no correlation with actual per-frame errors on held-out challenging footage.

Figures

Figures reproduced from arXiv: 2604.22350 by Dominik Kuczkowski, Laura Ruotsalainen.

Figure 1
Figure 1. Figure 1: Overview of the PoseFM framework. (a) PoseFM Pipeline: The pipeline consists of an optical flow estimator fϕ and parametrized vector field network. The output of the pipeline is a point estimate of the vector field uˆτ . (b) Inference Procedure: Given an image pair (It, It+1), we sample a pose X0 from the initial distribution and numerically integrate the learned vector field using an ODE solver to recover… view at source ↗
read the original abstract

Monocular visual odometry (VO) is a fundamental computer vision problem with applications in autonomous navigation, augmented reality and more. While deep learning-based methods have recently shown superior accuracy compared to traditional geometric pipelines, particularly in environments where handcrafted features struggle due to poor structure or lighting conditions, most rely on deterministic regression, which lacks the uncertainty awareness required for robust applications. We propose PoseFM, the first framework to reformulate monocular frame-to-frame VO as a generative task using Flow Matching (FM). By leveraging FM, we model camera motion as a distribution rather than a point estimate, learning to transform noise into realistic pose predictions via continuous-time ODEs. This approach provides a principled mechanism for uncertainty estimation and enables robust motion inference under challenging visual conditions. In our evaluations, PoseFM achieves strong performance on TartanAir, KITTI and TUM-RGBD benchmarks, achieving the lowest absolute trajectory error (ATE) on some of the trajectories and overall being competitive with the best frame-to-frame monocular VO methods. Code and model checkpoints will be made available at https://github.com/helsinki-sda-group/posefm.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes PoseFM, the first framework to reformulate monocular frame-to-frame visual odometry as a generative task via Flow Matching. Camera motion is modeled as a distribution rather than a point estimate, with noise transformed into pose predictions through continuous-time ODEs. This is claimed to enable principled uncertainty estimation and robust inference under challenging conditions. On TartanAir, KITTI, and TUM-RGBD, PoseFM reports competitive ATE overall and the lowest ATE on some trajectories compared to other frame-to-frame monocular VO methods.

Significance. If the central claim holds—that the flow-matching generative formulation yields meaningfully superior uncertainty calibration and robustness over deterministic regression—this would represent a useful advance for deep learning-based VO in safety-critical settings. The planned release of code and checkpoints is a strength that would aid reproducibility.

major comments (2)
  1. [Abstract and Experiments] The central claim that Flow Matching provides a 'principled mechanism for uncertainty estimation' and enables 'robust motion inference under challenging visual conditions' (abstract) is not supported by any quantitative evidence. The reported results consist solely of ATE on three benchmarks; no uncertainty calibration metrics (e.g., expected calibration error), negative log-likelihood, or controlled ablations (e.g., mean pose vs. sampled poses, performance stratified by low-texture/low-light sequences) are described. Without these, it remains unclear whether any observed gains derive from the generative formulation rather than architecture or training details.
  2. [Experiments] The manuscript positions PoseFM as achieving 'strong performance' and 'lowest ATE on some of the trajectories' while remaining 'competitive with the best frame-to-frame monocular VO methods,' yet supplies no error bars, statistical significance tests, or per-trajectory breakdowns that would allow assessment of whether the generative approach improves reliability over deterministic baselines under the conditions where uncertainty matters most.
minor comments (1)
  1. [Abstract] The abstract states that 'code and model checkpoints will be made available' but does not specify the license or exact repository path beyond the GitHub link; this should be clarified for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We appreciate the emphasis on providing stronger quantitative support for the uncertainty estimation claims and more rigorous statistical analysis of the results. We address each major comment below and will incorporate the suggested improvements in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] The central claim that Flow Matching provides a 'principled mechanism for uncertainty estimation' and enables 'robust motion inference under challenging visual conditions' (abstract) is not supported by any quantitative evidence. The reported results consist solely of ATE on three benchmarks; no uncertainty calibration metrics (e.g., expected calibration error), negative log-likelihood, or controlled ablations (e.g., mean pose vs. sampled poses, performance stratified by low-texture/low-light sequences) are described. Without these, it remains unclear whether any observed gains derive from the generative formulation rather than architecture or training details.

    Authors: We agree that the original submission would benefit from additional quantitative evidence to substantiate the uncertainty-related claims. The flow-matching formulation models camera motion as a conditional distribution, which enables uncertainty quantification in principle through sampling or analysis of the learned probability path. However, we did not report calibration metrics or targeted ablations in the initial version. In the revision, we will add expected calibration error (ECE) for the predicted pose distributions, negative log-likelihood of ground-truth poses, an ablation comparing mean-pose estimates against multiple samples drawn from the model, and performance breakdowns stratified by challenging conditions (low-texture and low-light sequences). These additions will help demonstrate that observed improvements stem from the generative approach. revision: yes

  2. Referee: [Experiments] The manuscript positions PoseFM as achieving 'strong performance' and 'lowest ATE on some of the trajectories' while remaining 'competitive with the best frame-to-frame monocular VO methods,' yet supplies no error bars, statistical significance tests, or per-trajectory breakdowns that would allow assessment of whether the generative approach improves reliability over deterministic baselines under the conditions where uncertainty matters most.

    Authors: We acknowledge that the experimental presentation lacks error bars, statistical tests, and detailed per-trajectory analysis. In the revised manuscript, we will report standard deviations across multiple runs (different random seeds for training and sampling), include full per-trajectory ATE tables with direct comparisons to deterministic baselines, and add statistical significance testing (e.g., paired Wilcoxon tests) on the relevant metrics. This will allow readers to better evaluate reliability gains, particularly on sequences where uncertainty estimation is most relevant. revision: yes

Circularity Check

0 steps flagged

No circularity detected; direct application of established flow matching

full rationale

The paper reformulates monocular frame-to-frame VO as a generative task by applying Flow Matching to model camera motion as a distribution transformed via continuous-time ODEs from noise. No equations, derivations, or self-citations in the abstract or described method reduce the claimed performance or uncertainty mechanism to a fitted parameter, self-definition, or prior result by the same authors. The central premise relies on an external generative modeling technique evaluated on independent benchmarks (TartanAir, KITTI, TUM-RGBD), with no load-bearing steps that collapse by construction to the inputs. This is a standard, non-circular application of an existing framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only abstract available; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.0 · 5496 in / 1064 out tokens · 39465 ms · 2026-05-08T12:34:46.969194+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Gómez Rodríguez, José M

    Carlos Campos, Richard Elvira, Juan J. Gómez Rodríguez, José M. M. Montiel, and Juan D. Tardós. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE Transactions on Robotics, 37(6):1874–1890, 2021

  2. [2]

    Direct sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625, 2018

    Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct sparse odometry.IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(3):611–625, 2018

  3. [3]

    Deep patch visual odometry

    Zachary Teed, Lahav Lipson, and Jia Deng. Deep patch visual odometry. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,NeurIPS, volume 36, pages 39033–39051. Curran Associates, Inc., 2023

  4. [4]

    Tartanvo: A generalizable learning-based vo

    Wenshan Wang, Yaoyu Hu, and Sebastian Scherer. Tartanvo: A generalizable learning-based vo. InConference on Robot Learning, pages 1761–1772. PMLR, 2021

  5. [5]

    Transformer-based model for monocular visual odometry: a video understanding approach.IEEE Access, 2025

    André O Françani and Marcos ROA Maximo. Transformer-based model for monocular visual odometry: a video understanding approach.IEEE Access, 2025

  6. [6]

    Shuo Wang, Wanting Li, Yongcai Wang, Zhaoxin Fan, Zhe Huang, Xudong Cai, Jian Zhao, and Deying Li. Mambavo: Deep visual odometry based on sequential matching refinement and training smoothing.2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 1252–1262, 2024

  7. [7]

    Analogy-augmented uncertainty-aware monocular visual odometry.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2026

    Jituo Li, Shunwang Sun, Tingxi Xue, Xinqi Liu, Jialu Zhang, Huixu Dong, and Guodong Lu. Analogy-augmented uncertainty-aware monocular visual odometry.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2026

  8. [8]

    Ilari Pajula, Niclas Joswig, Aiden Morrison, Nadia Sokolova, and Laura Ruotsalainen. A novel cross-attention- based pedestrian visual–inertial odometry with analyses demonstrating challenges in dense optical flow.IEEE journal of indoor and seamless positioning and navigation, 2:25–35, 2023

  9. [9]

    Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.NeurIPS, 33:6840–6851, 2020. 8

  10. [10]

    Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InICLR, 2023

  11. [11]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InCVPR, pages 10684–10695, 2022

  12. [12]

    Depthfm: Fast generative monocular depth estimation with flow matching

    Ming Gui, Johannes Schusterbauer, Ulrich Prestel, Pingchuan Ma, Dmytro Kotovenko, Olga Grebenkova, Stefan Baumann, Tao Hu, and Björn Ommer. Depthfm: Fast generative monocular depth estimation with flow matching. Proceedings of the AAAI Conference on Artificial Intelligence, 39:3203–3211, 04 2025

  13. [13]

    Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment

    Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion- aided bundle adjustment. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9773–9783, October 2023

  14. [14]

    Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras

    Zachary Teed and Jia Deng. Droid-slam: Deep visual slam for monocular, stereo, and rgb-d cameras. InNeural Information Processing Systems, 2021

  15. [15]

    Shihao Shen, Yilin Cai, Wenshan Wang, and Sebastian A. Scherer. Dytanvo: Joint refinement of visual odometry and motion segmentation in dynamic environments.2023 IEEE International Conference on Robotics and Automation (ICRA), pages 4048–4055, 2022

  16. [16]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems, volume 33, pages 6840–6851. Curran Associates, Inc., 2020

  17. [17]

    Score- based generative modeling through stochastic differential equations

    Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. Score- based generative modeling through stochastic differential equations. InInternational Conference on Learning Representations, 2021

  18. [18]

    Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer

    Robin Rombach, A. Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models.2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10674–10685, 2021

  19. [19]

    Cameras as rays: Pose estimation via ray diffusion

    Jason Y Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion. InInternational Conference on Learning Representations (ICLR), 2024

  20. [20]

    Fmpose3d: monocular 3d pose estimation via flow matching, 2026

    Ti Wang, Xiaohang Yu, and Mackenzie Weygandt Mathis. Fmpose3d: monocular 3d pose estimation via flow matching, 2026

  21. [21]

    D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry

    Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cremers. D3vo: Deep depth, deep pose and deep uncertainty for monocular visual odometry. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020

  22. [22]

    Self-supervised learning of monocular visual odometry and depth with uncertainty-aware scale consistency

    Changhao Wang, Guanwen Zhang, and Wei Zhou. Self-supervised learning of monocular visual odometry and depth with uncertainty-aware scale consistency. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 3984–3990, 2024

  23. [23]

    Coprou- vo: Combining projected uncertainty for end-to-end unsupervised monocular visual odometry

    Jingchao Xie, Oussema Dhaouadi, Weirong Chen, Johannes Meier, Jacques Kaiser, and Daniel Cremers. Coprou- vo: Combining projected uncertainty for end-to-end unsupervised monocular visual odometry. In Margret Keuper and Francesco Locatello, editors,Pattern Recognition, pages 502–517, Cham, 2026. Springer Nature Switzerland

  24. [24]

    Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2017

    Deqing Sun, Xiaodong Yang, Ming-Yu Liu, and Jan Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume.2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8934–8943, 2017

  25. [25]

    W AFT: Warping-alone field transforms for optical flow

    Yihan Wang and Jia Deng. W AFT: Warping-alone field transforms for optical flow. InThe Fourteenth International Conference on Learning Representations, 2026

  26. [26]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016

  27. [27]

    Tartanair: A dataset to push the limits of visual slam

    Wenshan Wang, Delong Zhu, Xiangwei Wang, Yaoyu Hu, Yuheng Qiu, Chen Wang, Yafei Hu, Ashish Kapoor, and Sebastian Scherer. Tartanair: A dataset to push the limits of visual slam. 2020

  28. [28]

    Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. Lowe. Unsupervised learning of depth and ego- motion from video. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6612–6619, 2017

  29. [29]

    Geonet: Unsupervised learning of dense depth, optical flow and camera pose

    Zhichao Yin and Jianping Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1983–1992, 2018. 9

  30. [30]

    Are we ready for autonomous driving? the kitti vision benchmark suite

    Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we ready for autonomous driving? the kitti vision benchmark suite. In2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 3354–3361, 2012

  31. [31]

    Sturm, N

    J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers. A benchmark for the evaluation of rgb-d slam systems. InProc. of the International Conference on Intelligent Robot Systems (IROS), Oct. 2012. 10