pith. sign in

arxiv: 2607.00579 · v1 · pith:OLYHCAJUnew · submitted 2026-07-01 · 💻 cs.CV

EPO: Boosting 3D Foundation Models with Edge-based Pose Optimization

Pith reviewed 2026-07-02 14:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords edge-based pose optimizationstructure from motion3D foundation modelsgeometric refinementdifferentiable optimizationbundle adjustment alternative
0
0 comments X

The pith

Edge map alignment refines camera poses in 3D reconstructions without building feature tracks or point correspondences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EPO as a way to improve the geometric accuracy of 3D foundation models that produce fast but imprecise Structure-from-Motion results. It replaces the usual requirement for explicit feature tracks with a differentiable process that aligns edge maps extracted from the input images. This proxy signal drives pose and geometry optimization directly. If the approach holds, it would let users refine foundation-model outputs at much lower cost in time and memory than traditional bundle-adjustment pipelines, including on ordinary hardware.

Core claim

EPO is a fully differentiable, trackless framework that treats edge-map alignment as a proxy for geometric optimization, recovering camera poses and 3D structure from 3D foundation model outputs while avoiding feature extraction and track construction entirely; evaluations show it matches or exceeds the accuracy of Bundle Adjustment-style methods at substantially lower runtime and memory footprint.

What carries the argument

Edge map alignment, which supplies the optimization objective by measuring consistency of edge maps across views to adjust poses and points without explicit multi-view correspondences.

If this is right

  • Refinement becomes possible immediately after a foundation-model forward pass without any intermediate feature detection or matching.
  • Memory consumption drops enough for the method to execute on consumer-grade GPUs where competing refinement techniques run out of memory.
  • Runtime per scene decreases while accuracy on standard SfM benchmarks stays comparable to or better than track-based methods.
  • The same edge-alignment objective can be applied across multiple downstream 3D tasks that start from foundation-model reconstructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be inserted into any pipeline that already produces rough camera estimates, lowering the barrier to high-quality 3D output on limited hardware.
  • Because the signal comes from edges rather than points, the method might tolerate scenes where repeatable point features are scarce but edge structure remains visible.
  • An end-to-end differentiable chain from image to refined 3D model becomes feasible if foundation-model components are also made differentiable with respect to the same edge objective.

Load-bearing premise

Edge maps extracted from the input images supply a sufficiently complete and unbiased geometric signal to recover accurate poses and structure without tracked point features.

What would settle it

A controlled test on a dataset with known ground-truth poses where EPO converges to solutions whose reprojection error or pose error exceeds that of a standard bundle-adjustment baseline by a clear margin.

Figures

Figures reproduced from arXiv: 2607.00579 by Christian Sormann, Friedrich Fraundorfer, Mattia D'Urso, Mattia Rossi.

Figure 1
Figure 1. Figure 1: Visualization of three stages of our pose optimization method and the ground truth sparse point cloud for the Graz Town Hall scene (TerraSky3D). Starting from the initial optimization stage (a), provided by the VGGT output, we illustrate an intermediate state (b) and the final refined poses (c). The ground truth poses are shown in green and the optimized ones in red . Abstract. We introduce Edge-based Pose… view at source ↗
Figure 2
Figure 2. Figure 2: Example of input RGB im￾age, VGGT raw depth, and the corresponding DTF [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative Novel View Synthesis Results. [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Absolute and relative inference time breakdown for the optimization methods evaluated in Tab. 3. † indicates runs on an NVIDIA H200. 6 Discussion and Limitations 6.1 Boosting Different 3D Foundation Models We conducted the same experiments described in Sec. 5.1 using various 3DFMs reconstructions as a starting point to demonstrate our method’s generalization across different models. Aside from the RGB imag… view at source ↗
read the original abstract

We introduce \textbf{Edge-based Pose Optimization (EPO)}, a trackless geometric optimization framework specifically designed to boost the Structure-from-Motion reconstructions generated by 3D Foundation Models. These models achieve rapid inference by bypassing the time-consuming feature extraction and matching stages of traditional pipelines, where explicit correspondences between each 3D point and multiple images, referred to as tracks, are established. However, their geometric accuracy currently falls short of traditional pipelines. While this can be addressed in a post-processing step via Bundle Adjustment-like refinement, doing so requires extracting feature tracks, thus defeating the original speed advantage. Instead, our fully differentiable framework uses edge map alignment as a proxy for geometric optimization, avoiding feature extraction and track construction entirely. Through extensive evaluation across multiple datasets and tasks, we demonstrate that EPO matches or outperforms Bundle Adjustment-like methods while requiring significantly lower runtime and memory. Notably, its reduced memory footprint makes EPO suitable for consumer-grade hardware, where competing refinement methods cannot run.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Edge-based Pose Optimization (EPO), a fully differentiable, trackless framework for refining camera poses and 3D geometry from 3D foundation model outputs. It replaces explicit feature tracks and Bundle Adjustment with edge-map alignment as a geometric proxy, claiming to achieve comparable or superior accuracy to BA-like methods while using substantially lower runtime and memory, thereby enabling refinement on consumer-grade hardware.

Significance. If the performance claims hold under rigorous verification, EPO would represent a meaningful advance in efficient post-processing for foundation-model SfM pipelines by eliminating the need for feature extraction and multi-view track construction. The reduced memory footprint is a practical strength for deployment scenarios where traditional refinement is infeasible.

major comments (2)
  1. [Abstract] Abstract: the central performance claim ('matches or outperforms Bundle Adjustment-like methods') is presented without any quantitative metrics, tables, datasets, or ablation results. Because the soundness of the method rests on this empirical demonstration, the absence of supporting evidence in the manuscript is load-bearing.
  2. [Abstract] Abstract: the claim that edge-map alignment supplies a sufficiently constraining and unbiased signal for pose recovery is not accompanied by analysis of known limitations (aperture problem, viewpoint-dependent edge fragmentation, low-texture regions). Without explicit handling or empirical mitigation of these issues, the substitution for explicit multi-view tracks remains unverified.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'Bundle Adjustment-like methods' is imprecise; the manuscript should name the specific baselines (e.g., COLMAP BA, certain differentiable renderers) used in the claimed comparisons.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment point-by-point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central performance claim ('matches or outperforms Bundle Adjustment-like methods') is presented without any quantitative metrics, tables, datasets, or ablation results. Because the soundness of the method rests on this empirical demonstration, the absence of supporting evidence in the manuscript is load-bearing.

    Authors: The abstract is a concise summary. The full manuscript provides the required empirical support through quantitative metrics, tables, datasets, and ablations in the Experiments section, demonstrating that EPO matches or outperforms BA-like methods with lower runtime and memory. No change to the abstract is needed as the evidence is already present in the body of the paper. revision: no

  2. Referee: [Abstract] Abstract: the claim that edge-map alignment supplies a sufficiently constraining and unbiased signal for pose recovery is not accompanied by analysis of known limitations (aperture problem, viewpoint-dependent edge fragmentation, low-texture regions). Without explicit handling or empirical mitigation of these issues, the substitution for explicit multi-view tracks remains unverified.

    Authors: We agree that explicit discussion of these limitations strengthens the paper. In the revised manuscript we will add a dedicated analysis subsection addressing the aperture problem, edge fragmentation, and low-texture regions, including how EPO's differentiable multi-view alignment provides mitigation. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The provided abstract and context describe EPO as a differentiable edge-map alignment proxy for pose optimization that avoids explicit tracks and feature extraction. No equations, fitted parameters renamed as predictions, self-citations used as load-bearing uniqueness theorems, or ansatzes smuggled via prior work are present in the given material. Performance claims rest on external dataset evaluations rather than any reduction of outputs to inputs by construction. The framework is therefore self-contained against the benchmarks it reports.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no concrete free parameters, axioms, or invented entities; the method itself is the contribution rather than any new postulated object.

pith-pipeline@v0.9.1-grok · 5708 in / 1161 out tokens · 35419 ms · 2026-07-02T14:53:15.443712+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Mip- nerf 360: Unbounded anti-aliased neural radiance fields. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5470–5479 (2022)

  2. [2]

    In: European Conference on Com- puter Vision

    Brachmann, E., Wynn, J., Chen, S., Cavallari, T., Monszpart, A., Turmukhambe- tov, D., Prisacariu, V.A.: Scene coordinate reconstruction: Posing of image collec- tions via incremental learning of a relocalizer. In: European Conference on Com- puter Vision. pp. 421–440. Springer (2024)

  3. [3]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Cabon, Y., Stoffl, L., Antsfeld, L., Csurka, G., Chidlovskii, B., Revaud, J., Leroy, V.: Must3r: Multi-view network for stereo 3d reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 1050– 1060 (2025)

  4. [4]

    IEEE Transactions on Pattern Analysis and Machine IntelligenceP AMI-8(6), 679–698 (1986)

    Canny, J.: A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine IntelligenceP AMI-8(6), 679–698 (1986)

  5. [5]

    In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops

    DeTone, D., Malisiewicz, T., Rabinovich, A.: Superpoint: Self-supervised interest point detection and description. In: Proceedings of the IEEE conference on com- puter vision and pattern recognition workshops. pp. 224–236 (2018)

  6. [6]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Doersch, C., Yang, Y., Vecerik, M., Gokay, D., Gupta, A., Aytar, Y., Carreira, J., Zisserman, A.: Tapir: Tracking any point with per-frame initialization and tem- poral refinement. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10061–10072 (2023)

  7. [7]

    In: 2025 International Conference on 3D Vision (3DV)

    Duisterhof, B.P., Zust, L., Weinzaepfel, P., Leroy, V., Cabon, Y., Revaud, J.: Mast3r-sfm: a fully-integrated solution for unconstrained structure-from-motion. In: 2025 International Conference on 3D Vision (3DV). pp. 1–10. IEEE (2025)

  8. [8]

    In: 2026 Inter- national Conference on 3D Vision (3DV)

    D’Urso, M., Santellani, E., Sormann, C., Rossi, M., Kuhn, A., Fraundorfer, F.: A streamlined attention-based network for descriptor extraction. In: 2026 Inter- national Conference on 3D Vision (3DV). pp. 247–256. IEEE Computer Society (2026)

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

    D’Urso, M., Sormann, C., Rossi, M., Fraundorfer, F.: Multi-view reconstructions of european landmarks in 4k. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2026)

  10. [10]

    D2-Net: A Trainable CNN for Joint Detection and Description of Local Features

    Dusmanu, M., Rocco, I., Pajdla, T., Pollefeys, M., Sivic, J., Torii, A., Sattler, T.: D2-net: A trainable cnn for joint detection and description of local features. arXiv preprint arXiv:1905.03561 (2019)

  11. [11]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Edstedt, J., Athanasiadis, I., Wadenbäck, M., Felsberg, M.: Dkm: Dense kernel- ized feature matching for geometry estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 17765–17775 (2023)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: Roma: Robust dense feature matching. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 19790–19800 (2024)

  13. [13]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Elflein, S., Zhou, Q., Leal-Taixé, L.: Light3r-sfm: Towards feed-forward structure- from-motion. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 16774–16784 (2025)

  14. [14]

    Theory of computing8(1), 415–428 (2012)

    Felzenszwalb, P.F., Huttenlocher, D.P.: Distance transforms of sampled functions. Theory of computing8(1), 415–428 (2012)

  15. [15]

    In: European Conference on Computer Vision

    Harley, A.W., Fang, Z., Fragkiadaki, K.: Particle video revisited: Tracking through occlusions using point trajectories. In: European Conference on Computer Vision. pp. 59–75. Springer (2022) 10 M. D’Urso et al

  16. [16]

    In: Proceedings of the IEEE interna- tional conference on computer vision

    He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: Surpassing human- level performance on imagenet classification. In: Proceedings of the IEEE interna- tional conference on computer vision. pp. 1026–1034 (2015)

  17. [17]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Karaev, N., Makarov, Y., Wang, J., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker3: Simpler and better point tracking by pseudo-labelling real videos. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6013–6022 (2025)

  18. [18]

    In: European conference on computer vision

    Karaev, N., Rocco, I., Graham, B., Neverova, N., Vedaldi, A., Rupprecht, C.: Cotracker: It is better to track together. In: European conference on computer vision. pp. 18–35. Springer (2024)

  19. [19]

    Keetha, N., Müller, N., Schönberger, J., Porzi, L., Zhang, Y., Fischer, T., Knapitsch, A., Zauss, D., Weber, E., Antunes, N., Luiten, J., Lopez-Antequera, M., Rota Bulò, S., Richardt, C., Ramanan, D., Scherer, S., Kontschieder, P.: Ma- pAnything:Universalfeed-forwardmetric3Dreconstruction.In:2026International Conference on 3D Vision (3DV). pp. 499–509. IE...

  20. [20]

    ACM Trans

    Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3d gaussian splatting for real-time radiance field rendering. ACM Trans. Graph.42(4), 139–1 (2023)

  21. [21]

    In: 2016 IEEE international conference on robotics and automation (ICRA)

    Kuse, M., Shen, S.: Robust camera motion estimation using direct edge alignment and sub-gradient method. In: 2016 IEEE international conference on robotics and automation (ICRA). pp. 573–579. IEEE (2016)

  22. [22]

    In: European Conference on Computer Vision

    Leroy, V., Cabon, Y., Revaud, J.: Grounding image matching in 3d with mast3r. In: European Conference on Computer Vision. pp. 71–91. Springer (2024)

  23. [23]

    In: 2026 International Conference on 3D Vision (3DV)

    Li, J., Wang, H., Irshad, M.Z., Vasiljevic, I., Walter, M.R., Guizilini, V.C., Shakhnarovich, G.: FastMap: Revisiting structure from motion through first-order optimization. In: 2026 International Conference on 3D Vision (3DV). pp. 29–39. IEEE Computer Society (2026)

  24. [24]

    In: International Conference on Learning Representations (ICLR) (2026),https://openreview.net/forum?id=yirunib8l8, oral

    Lin, H., Chen, S., Liew, J.H., Chen, D.Y., Li, Z., Zhao, Y., Peng, S., Guo, H., Zhou, X., Shi, G., Feng, J., Kang, B.: Depth anything 3: Recovering the visual space from any views. In: International Conference on Learning Representations (ICLR) (2026),https://openreview.net/forum?id=yirunib8l8, oral

  25. [25]

    In: Proceedings of the IEEE/CVF international conference on com- puter vision

    Lindenberger, P., Sarlin, P.E., Pollefeys, M.: Lightglue: Local feature matching at light speed. In: Proceedings of the IEEE/CVF international conference on com- puter vision. pp. 17627–17638 (2023)

  26. [26]

    In: European Conference on Computer Vision

    Liu, S., Gao, Y., Zhang, T., Pautrat, R., Schönberger, J.L., Larsson, V., Pollefeys, M.: Robust incremental structure-from-motion with hybrid features. In: European Conference on Computer Vision. pp. 249–269. Springer (2024)

  27. [27]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Liu, S., Yu, Y., Pautrat, R., Pollefeys, M., Larsson, V.: 3d line mapping revisited. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 21445–21455 (2023)

  28. [28]

    In: 7th Interna- tional Conference on Learning Representations (2019)

    Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: 7th Interna- tional Conference on Learning Representations (2019)

  29. [29]

    Interna- tional journal of computer vision60(2), 91–110 (2004)

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional journal of computer vision60(2), 91–110 (2004)

  30. [30]

    Commu- nications of the ACM65(1), 99–106 (2021)

    Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: Nerf: Representing scenes as neural radiance fields for view synthesis. Commu- nications of the ACM65(1), 99–106 (2021)

  31. [31]

    In: International Workshop on Reproducible Research in Pattern Recog- nition

    Moulon, P., Monasse, P., Perrot, R., Marlet, R.: Openmvg: Open multiple view geometry. In: International Workshop on Reproducible Research in Pattern Recog- nition. pp. 60–74. Springer (2016) Supplementary Material: Boosting 3D Foundation Models... 11

  32. [32]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  33. [33]

    In: European Conference on Computer Vision

    Pan, L., Baráth, D., Pollefeys, M., Schönberger, J.L.: Global structure-from-motion revisited. In: European Conference on Computer Vision. pp. 58–77. Springer (2024)

  34. [34]

    Advances in neural information processing sys- tems32(2019)

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: An imperative style, high- performance deep learning library. Advances in neural information processing sys- tems32(2019)

  35. [35]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Pautrat, R., Barath, D., Larsson, V., Oswald, M.R., Pollefeys, M.: Deeplsd: Line segment detection and refinement withdeep image gradients. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17327– 17336 (2023)

  36. [36]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Pautrat, R., Suárez, I., Yu, Y., Pollefeys, M., Larsson, V.: Gluestick: Robust image matching by sticking points and lines together. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 9706–9716 (2023)

  37. [37]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 12179–12188 (2021)

  38. [38]

    Advances in neural information processing sys- tems32(2019)

    Revaud, J., De Souza, C., Humenberger, M., Weinzaepfel, P.: R2d2: Reliable and repeatable detector and descriptor. Advances in neural information processing sys- tems32(2019)

  39. [39]

    In: 2022 26th International conference on pattern recognition (ICPR)

    Santellani, E., Sormann, C., Rossi, M., Kuhn, A., Fraundorfer, F.: Md-net: Multi- detector for local feature extraction. In: 2022 26th International conference on pattern recognition (ICPR). pp. 3944–3951. IEEE (2022)

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: Superglue: Learning feature matching with graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4938–4947 (2020)

  41. [41]

    In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

    Schenk, F., Fraundorfer, F.: Robust edge-based visual odometry using machine- learned edges. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 1297–1304. IEEE (2017)

  42. [42]

    In: 2019 International Conference on Robotics and Automation (ICRA)

    Schenk, F., Fraundorfer, F.: Reslam: A real-time robust edge-based slam system. In: 2019 International Conference on Robotics and Automation (ICRA). pp. 154–

  43. [43]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Schonberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 4104–4113 (2016)

  44. [44]

    International journal of computer vision80(2), 189–210 (2008)

    Snavely, N., Seitz, S.M., Szeliski, R.: Modeling the world from internet photo col- lections. International journal of computer vision80(2), 189–210 (2008)

  45. [45]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: Detector-free local fea- ture matching with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8922–8931 (2021)

  46. [46]

    In: Proceedings of the 23rd ACM international conference on Mul- timedia

    Sweeney, C., Hollerer, T., Turk, M.: Theia: A fast and scalable structure-from- motion library. In: Proceedings of the 23rd ACM international conference on Mul- timedia. pp. 693–696 (2015)

  47. [47]

    In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages

    Tillet, P., Kung, H.T., Cox, D.: Triton: an intermediate language and compiler for tiled neural network computations. In: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. pp. 10–19 (2019) 12 M. D’Urso et al

  48. [48]

    In: International workshop on vision algorithms

    Triggs, B., McLauchlan, P.F., Hartley, R.I., Fitzgibbon, A.W.: Bundle adjust- ment—a modern synthesis. In: International workshop on vision algorithms. pp. 298–372. Springer (1999)

  49. [49]

    Advances in neural information processing systems33, 14254–14265 (2020)

    Tyszkiewicz, M., Fua, P., Trulls, E.: Disk: Learning local features with policy gra- dient. Advances in neural information processing systems33, 14254–14265 (2020)

  50. [50]

    ACM Trans- actions on Graphics (TOG)36(1), 1–11 (2017)

    Waechter, M., Beljan, M., Fuhrmann, S., Moehrle, N., Kopf, J., Goesele, M.: Vir- tual rephotography: Novel view prediction error for 3d reconstruction. ACM Trans- actions on Graphics (TOG)36(1), 1–11 (2017)

  51. [51]

    In: 2025 Interna- tional Conference on 3D Vision (3DV)

    Wang, H., Agapito, L.: 3d reconstruction with spatial memory. In: 2025 Interna- tional Conference on 3D Vision (3DV). pp. 78–89. IEEE (2025)

  52. [52]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, J., Chen, M., Karaev, N., Vedaldi, A., Rupprecht, C., Novotny, D.: Vggt: Visual geometry grounded transformer. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 5294–5306 (2025)

  53. [53]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Wang, J., Karaev, N., Rupprecht, C., Novotny, D.: Vggsfm: Visual geometry grounded deep structure from motion. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 21686–21697 (2024)

  54. [54]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Wang, Q., Zhang, Y., Holynski, A., Efros, A.A., Kanazawa, A.: Continuous 3d perception model with persistent state. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 10510–10522 (2025)

  55. [55]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Wang, S., Leroy, V., Cabon, Y., Chidlovskii, B., Revaud, J.: Dust3r: Geometric 3d vision made easy. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20697–20709 (2024)

  56. [56]

    In: International Conference on Learning Representations (ICLR) (2026),https://openreview

    Wang, Y., Zhou, J., Zhu, H., Chang, W., Zhou, Y., Li, Z., Chen, J., Pang, J., Shen, C., He, T.:π3: Permutation-equivariant visual geometry learning. In: International Conference on Learning Representations (ICLR) (2026),https://openreview. net/forum?id=DTQIjngDta

  57. [57]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Yang, J., Sax, A., Liang, K.J., Henaff, M., Tang, H., Cao, A., Chai, J., Meier, F., Feiszli, M.: Fast3r: Towards 3d reconstruction of 1000+ images in one forward pass. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 21924–21935 (2025)

  58. [58]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Yeshwanth, C., Liu, Y.C., Nießner, M., Dai, A.: Scannet++: A high-fidelity dataset of 3d indoor scenes. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12–22 (2023)

  59. [59]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 586–595 (2018)

  60. [60]

    arXiv preprint arXiv:2302.09493 (2023)

    Zhao, H., Shang, J., Liu, K., Chen, C., Gu, F.: Edgevo: An efficient and accurate edge-based visual odometry. arXiv preprint arXiv:2302.09493 (2023)

  61. [61]

    IEEE Transac- tions on Instrumentation and Measurement72, 1–16 (2023)

    Zhao, X., Wu, X., Chen, W., Chen, P.C., Xu, Q., Li, Z.: Aliked: A lighter keypoint and descriptor extraction network via deformable transformation. IEEE Transac- tions on Instrumentation and Measurement72, 1–16 (2023)

  62. [62]

    arXiv preprint arXiv:2510.13310 (2025)

    Zhong, J., Zhan, Z., Gao, Q., Chen, Z., Lou, H., Mao, J., Neumann, U., Wang, C., Wang, Y.: InstantSfM: Towards GPU-native SfM for the deep learning era. arXiv preprint arXiv:2510.13310 (2025)

  63. [63]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhou, Y., Barnes, C., Lu, J., Yang, J., Li, H.: On the continuity of rotation rep- resentations in neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 5745–5753 (2019)

  64. [64]

    IEEE Transactions on Robotics35(1), 184– 199 (2018)

    Zhou, Y., Li, H., Kneip, L.: Canny-vo: Visual odometry with rgb-d cameras based on geometric 3-d–2-d edge alignment. IEEE Transactions on Robotics35(1), 184– 199 (2018)