pith. sign in

arxiv: 2606.31077 · v2 · pith:V53T534Bnew · submitted 2026-06-30 · 💻 cs.CV

AnyMatch: Supercharging Universal Multi-Modal Image Matching with Large-Scale Single-View Images

Pith reviewed 2026-07-02 20:10 UTC · model grok-4.3

classification 💻 cs.CV
keywords multi-modal image matchingsynthetic dataset generation3D geometric consistencycross-modal translationvisual localizationmonocular depth estimationdiffusion inpaintingimage registration
0
0 comments X

The pith

AnyMatch turns single-view images into large-scale multi-modal training pairs with explicit 3D geometric consistency for image matching networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that scarce real multi-modal image pairs with accurate geometry limit progress in visual localization and sensor fusion. It proposes AnyMatch, a synthesis method that starts from abundant single-view photos and applies monocular depth estimation followed by 3D reprojection, diffusion inpainting, and cross-modal translation to produce multi-view, multi-modal pairs. Because the geometry is enforced by explicit reprojection rather than SfM-MVS reconstruction, the resulting annotations avoid accumulated errors. The authors build the Any-syn dataset with this pipeline and show that fine-tuning standard matchers such as LoFTR, EDM, and RoMa on Any-syn yields higher accuracy and better generalization on existing multi-modal benchmarks. The central bet is that the synthetic data distribution is close enough to real captures that the observed gains reflect improved robustness rather than overfitting to synthesis artifacts.

Core claim

AnyMatch generates multi-modal image pairs by feeding single-view images through monocular depth estimation, 3D reprojection to create consistent multi-view geometry, diffusion-based inpainting, and cross-modal translation; the resulting Any-syn dataset supplies training pairs whose 3D correspondences are guaranteed by the reprojection step, enabling matching networks to achieve substantial performance gains on multi-modal benchmarks when fine-tuned on this data.

What carries the argument

AnyMatch synthesis pipeline that enforces 3D geometric consistency through explicit monocular-depth-driven reprojection before cross-modal translation.

If this is right

  • Matching networks fine-tuned on Any-syn exhibit higher accuracy and robustness on multi-modal benchmarks than networks trained on prior real or synthetic collections.
  • The explicit 3D reprojection step removes the need for error-prone SfM-MVS pipelines in data preparation.
  • Scene diversity and annotation difficulty can be controlled by varying the input single-view images and chosen camera parameters.
  • Large-scale multi-modal training data becomes obtainable at low cost from existing single-view photo collections.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reprojection-based consistency mechanism could be applied to generate training pairs for other geometry-sensitive tasks such as stereo depth or visual odometry.
  • If the synthesis artifacts prove negligible, the approach reduces dependence on expensive synchronized multi-sensor captures for future dataset creation.
  • Controllable camera parameters in the pipeline may allow targeted generation of hard matching cases to stress-test specific failure modes of current networks.

Load-bearing premise

The images produced by depth estimation, reprojection, inpainting, and translation are close enough in geometry and appearance to real multi-modal captures that fine-tuning on them produces genuine generalization rather than overfitting to synthesis artifacts.

What would settle it

Fine-tuned matching networks show no accuracy improvement, or show degraded performance, when evaluated on held-out real multi-modal image pairs captured by actual sensors.

Figures

Figures reproduced from arXiv: 2606.31077 by Fan Fan, Jiayi Ma, Linfeng Tang, Meng Yang, Zizhuo Li.

Figure 1
Figure 1. Figure 1: Overview of AnyMatch pipeline. AnyMatch aims to generate a large-scale multi￾modal and multi-view image matching dataset for training matching models, thereby enabling robust cross-modal and cross-view matching capabilities. under varying conditions [45], cross-spectral object detection [24], and multi￾sensor fusion for autonomous systems [1]. Traditional feature matching meth￾ods, from handcrafted descrip… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of view transformation module. The view transformation module is designed to generate an arbitrary number of multi-view image pairs that strictly adhere to 3D geometric consistency. GLDv2 and SA-1B), predominantly in the visible modality, to establish the single-view database T_{\text {sin}} . All images are uniformly resized to 512 × 512 via isotropic scaling and center cropping. As shown in [PI… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of sample-level geometric consistency verification module. The image pair is retained if PCK@τ ≥ η and removed if PCK@τ < η. where x i inp is the predicted correspondence in Iinp (from the matching model), and xˆ i inp is the geometrically projected GT correspondence. The proportion PCK@τ of correct correspondences at threshold τ is defined as: \label {equal7} \text {PCK}@\tau = \frac {\left | \{(… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on real multi-modal images. Comparison of LoFTR, LoFTRMINIMA, LoFTRAnyMatch, EDM, EDMMINIMA, EDMAnyMatch, RoMa, RoMaMINIMA, and RoMaAnyMatch in multiple image pairs, including RGB-IR, RGB-Normal, RGB-Depth, RGB-Event, Retina, and Optical-Map (from left to right). Matches generated by each method are drawn, where the red lines indicate epipolar error (pose) or projection error (homograph… view at source ↗
read the original abstract

Multi-modal image matching is essential for visual localization and multi-sensor fusion, but it is hindered by the scarcity of large-scale training data with precise geometric annotations. Existing real-world datasets suffer from prohibitive costs, limited scene diversity, and errors in SfM-MVS pipelines, while synthetic methods struggle to maintain 3D geometric consistency or achieve photorealistic appearance. To address this, we propose AnyMatch, a novel framework that leverages abundant, easily accessible single-view images at minimal cost to generate rich multi-modal training data. AnyMatch integrates monocular depth estimation, 3D reprojection, diffusion-based inpainting, and crossmodal image translation to synthesize multi-view, multi-modal image pairs with 3D geometric fidelity. Crucially, our method provides annotations that strictly adhere to 3D geometric consistency through explicit 3D reprojection, avoiding SfM-MVS error accumulation. Furthermore, AnyMatch offers strong scalability, enabling controllable scene diversity and annotation difficulty via adjustable input and camera parameters. We construct Any-syn, a large-scale synthetic multi-modal dataset using AnyMatch. Experimental results show that matching networks (e.g., LoFTR, EDM, RoMa) fine-tuned on Any-syn achieve substantial performance gains on multi-modal benchmarks, exhibiting superior generalization and robustness compared to models trained on existing data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AnyMatch, a pipeline that generates large-scale synthetic multi-modal image pairs and correspondences from single-view images by combining monocular depth estimation, 3D reprojection, diffusion-based inpainting, and cross-modal translation. It constructs the Any-syn dataset and reports that fine-tuning standard matching networks (LoFTR, EDM, RoMa) on Any-syn yields substantial gains in performance, generalization, and robustness on multi-modal benchmarks relative to models trained on existing real or synthetic data. The central selling point is that the generated annotations maintain strict 3D geometric consistency via explicit reprojection while offering controllable scale and diversity at low cost.

Significance. If the synthesized data truly delivers geometric fidelity and appearance distributions that support genuine generalization rather than artifact overfitting, the work would be significant: it offers a scalable route to training data that bypasses the cost and error accumulation of real SfM-MVS pipelines. The controllability of scene diversity and annotation difficulty via input parameters is a concrete practical strength.

major comments (3)
  1. [Abstract, §3] Abstract and §3 (Method description): The repeated claim that annotations 'strictly adhere to 3D geometric consistency through explicit 3D reprojection, avoiding SfM-MVS error accumulation' is load-bearing for the central argument that Any-syn is superior to real data. However, the pipeline begins with monocular depth estimation; any depth error (known to reach 10-30% under domain shift) propagates directly into the reprojected correspondences. The manuscript provides no quantitative bound on reprojection error induced by the depth estimator, nor a comparison against real SfM baselines on the same scenes.
  2. [§4] §4 (Experiments): The abstract and results section assert 'substantial performance gains' and 'superior generalization' for networks fine-tuned on Any-syn, yet the provided description contains no numerical values for the key metrics (e.g., AUC@5°, mAA, or matching score), no statistical significance tests, and no ablation isolating the contribution of geometric consistency versus appearance translation. Without these, it is impossible to verify that gains arise from the claimed 3D fidelity rather than synthesis artifacts.
  3. [§3.2] §3.2 (Data generation pipeline): The weakest assumption—that monocular depth plus reprojection plus inpainting produces pairs whose geometric and photometric statistics match real multi-modal captures—is not stress-tested. A controlled experiment comparing Any-syn correspondences against ground-truth SfM on a held-out real multi-modal sequence would directly address whether depth-estimator artifacts are being learned.
minor comments (2)
  1. [§3] Notation for camera parameters and reprojection equations in §3 should be made fully explicit (e.g., define the exact form of the 3D-to-2D projection used after depth lifting).
  2. [§4] Figure captions and axis labels in the experimental figures should include the exact benchmark names and metric definitions to allow direct comparison with prior work.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing honest responses and indicating planned revisions to the manuscript where the concerns are valid.

read point-by-point responses
  1. Referee: [Abstract, §3] Abstract and §3 (Method description): The repeated claim that annotations 'strictly adhere to 3D geometric consistency through explicit 3D reprojection, avoiding SfM-MVS error accumulation' is load-bearing for the central argument that Any-syn is superior to real data. However, the pipeline begins with monocular depth estimation; any depth error (known to reach 10-30% under domain shift) propagates directly into the reprojected correspondences. The manuscript provides no quantitative bound on reprojection error induced by the depth estimator, nor a comparison against real SfM baselines on the same scenes.

    Authors: We agree that errors from monocular depth estimation propagate into the reprojected correspondences. The core claim is that, by construction, all correspondences remain strictly consistent with the single estimated depth map and the synthetically chosen camera parameters, thereby avoiding the additional sources of error inherent to SfM-MVS (feature matching failures, drift, and scale inconsistencies across views). We did not provide a quantitative bound or direct SfM comparison in the original manuscript. In the revision we will add an analysis of depth-induced reprojection error on standard depth benchmarks together with a discussion of how this compares to typical SfM-MVS error accumulation. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract and results section assert 'substantial performance gains' and 'superior generalization' for networks fine-tuned on Any-syn, yet the provided description contains no numerical values for the key metrics (e.g., AUC@5°, mAA, or matching score), no statistical significance tests, and no ablation isolating the contribution of geometric consistency versus appearance translation. Without these, it is impossible to verify that gains arise from the claimed 3D fidelity rather than synthesis artifacts.

    Authors: The experimental section contains tables reporting AUC@5°, mAA, and matching scores for LoFTR, EDM, and RoMa on the multi-modal benchmarks. To improve clarity we will quote the key numerical improvements directly in the text of §4. We also acknowledge the absence of statistical tests and a dedicated ablation. In the revision we will add paired statistical significance tests on the reported gains and an ablation that isolates the 3D reprojection component (e.g., by replacing it with 2D homography-based warping while keeping appearance translation fixed). revision: yes

  3. Referee: [§3.2] §3.2 (Data generation pipeline): The weakest assumption—that monocular depth plus reprojection plus inpainting produces pairs whose geometric and photometric statistics match real multi-modal captures—is not stress-tested. A controlled experiment comparing Any-syn correspondences against ground-truth SfM on a held-out real multi-modal sequence would directly address whether depth-estimator artifacts are being learned.

    Authors: We agree that a direct comparison against real SfM ground truth on held-out multi-modal sequences would be the most rigorous validation. Such sequences are, however, precisely the scarce and costly resource our method seeks to circumvent. Our current evidence rests on downstream benchmark gains and qualitative geometric checks. In the revision we will expand the discussion of this limitation and outline why the suggested controlled experiment is difficult to realize at scale without reintroducing the original data-acquisition bottleneck. revision: partial

Circularity Check

0 steps flagged

No circularity; independent data synthesis and external evaluation

full rationale

The paper describes an independent pipeline (monocular depth estimation + 3D reprojection + diffusion inpainting + cross-modal translation) to synthesize Any-syn data from single-view images, then reports empirical gains when fine-tuning existing matchers (LoFTR, EDM, RoMa) on that data and testing on separate multi-modal benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed generalization gains to a definitional identity or to the synthesis inputs themselves. The central result remains an external empirical claim rather than a self-referential construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claim depends on domain assumptions about the accuracy of off-the-shelf monocular depth estimators and the geometric fidelity of diffusion-based synthesis steps; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption Monocular depth estimation yields depth maps accurate enough for 3D reprojection across diverse scenes without systematic bias.
    Invoked to guarantee 3D geometric consistency in the generated pairs.
  • domain assumption Diffusion inpainting and cross-modal translation preserve 3D structure and produce photorealistic outputs without introducing matching-irrelevant artifacts.
    Required for the synthesized data to improve rather than degrade downstream matching performance.

pith-pipeline@v0.9.1-grok · 5776 in / 1338 out tokens · 26828 ms · 2026-07-02T20:10:17.988492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    In: Proceedings of the IEEE International Conference on Robotics and Automation

    Aditya, N., Dhruval, P., et al.: Thermal voyager: A comparative study of rgb and thermal cameras for night-time autonomous navigation. In: Proceedings of the IEEE International Conference on Robotics and Automation. pp. 14116–14122 (2024)

  2. [2]

    In: Proceedings of the European Conference on Computer Vision

    Bay, H., Tuytelaars, T., Van Gool, L.: SURF: Speeded up robust features. In: Proceedings of the European Conference on Computer Vision. pp. 404–417 (2006)

  3. [3]

    In: Proceedings of the Asian Conference on Computer Vision

    Cadar, F., Potje, G., Martins, R., Demonceaux, C., Nascimento, E.R.: Leveraging semantic cues from foundation vision models for enhanced local feature correspon- dence. In: Proceedings of the Asian Conference on Computer Vision. pp. 1268–1283 (2024)

  4. [4]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: Scan- Net: Richly-annotated 3d reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5828–5839 (2017)

  5. [5]

    IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5725–5742 (2024)

    Deng, X., Liu, E., Gao, C., Li, S., Gu, S., Xu, M.: CrossHomo: Cross-modality and cross-resolution homography estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence46(8), 5725–5742 (2024)

  6. [6]

    IEEE Transactions on Image Processing32, 591–602 (2022)

    Deng, Y., Ma, J.: Redfeat: Recoupling detection and description for multimodal feature learning. IEEE Transactions on Image Processing32, 591–602 (2022)

  7. [7]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Edstedt, J., Sun, Q., Bökman, G., Wadenbäck, M., Felsberg, M.: RoMa: Robust dense feature matching. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 19790–19800 (2024)

  8. [8]

    Communi- cations of the ACM24(6), 381–395 (1981)

    Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Communi- cations of the ACM24(6), 381–395 (1981)

  9. [9]

    In: Pro- ceedings of the IEEE Winter Conference on Applications of Computer Vision

    Frolov, A., Rodehorst, V.: Patch your matcher: Correspondence-aware image-to- image translation unlocks cross-modal matching via single-modality priors. In: Pro- ceedings of the IEEE Winter Conference on Applications of Computer Vision. pp. 7913–7924 (2026)

  10. [10]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Gao, C., Li, W., Weng, D.: HOMO-feature: Cross-arbitrary-modal image matching with homomorphism of organized major orientation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10538–10548 (2025)

  11. [11]

    In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition

    Greff, K., Belletti, F., Beyer, L., Doersch, C., Du, Y., Duckworth, D., Fleet, D.J., Gnanapragasam, D., Golemo, F., Herrmann, C., et al.: Kubric: A scalable dataset generator. In: Proceedings of the IEEE Conference on Computer Vision and Pat- tern Recognition. pp. 3749–3761 (2022)

  12. [12]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Han, Z., Mao, C., Jiang, Z., Pan, Y., Zhang, J.: Stylebooth: Image style editing with multimodal instruction. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 1947–1957 (2025)

  13. [13]

    arXiv preprint arXiv:2501.07556 (2025)

    He, X., Yu, H., Peng, S., Tan, D., Shen, Z., Bao, H., Zhou, X.: MatchAnything: Uni- versal cross-modality image matching with large-scale pre-training. arXiv preprint arXiv:2501.07556 (2025)

  14. [14]

    Information Fusion102, 102027 (2024)

    Hou,Z.,Liu,Y.,Zhang,L.:POS-GIFT:Ageometricandintensity-invariantfeature transformation for multimodal images. Information Fusion102, 102027 (2024)

  15. [15]

    In: Proceedings of the International Conference on Learning Representations (2022) AnyMatch 17

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-Rank adaptation of large language models. In: Proceedings of the International Conference on Learning Representations (2022) AnyMatch 17

  16. [16]

    Information Fusion73, 22–71 (2021)

    Jiang, X., Ma, J., Xiao, G., Shao, Z., Guo, X.: A review of multimodal image matching: Methods and applications. Information Fusion73, 22–71 (2021)

  17. [17]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W.Y., et al.: Segment anything. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 4015–4026 (2023)

  18. [18]

    IEEE Transactions on Image Processing 29, 3296–3310 (2019)

    Li, J., Hu, Q., Ai, M.: RIFT: Multi-modal image matching based on radiation- variation insensitive feature transform. IEEE Transactions on Image Processing 29, 3296–3310 (2019)

  19. [19]

    ISPRS Journal of Photogrammetry and Remote Sensing204, 77–88 (2023)

    Li, J., Hu, Q., Zhang, Y.: Multimodal image matching: A scale-invariant algorithm and an open dataset. ISPRS Journal of Photogrammetry and Remote Sensing204, 77–88 (2023)

  20. [20]

    In: Proceedings of the IEEE International Conference on Computer Vision

    Li, X., Rao, T., Pan, C.: EDM: Efficient deep feature matching. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 26198–26208 (2025)

  21. [21]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Li, Z., Snavely, N.: Megadepth: Learning single-view depth prediction from internet photos. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2041–2050 (2018)

  22. [22]

    In: Proceedings of the European Conference on Computer Vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of the European Conference on Computer Vision. pp. 740–755 (2014)

  23. [23]

    A Survey on Hallucination in Large Vision-Language Models

    Liu, H., Xue, W., Chen, Y., Chen, D., Zhao, X., Wang, K., Hou, L., Li, R., Peng, W.: A survey on hallucination in large vision-language models. arXiv preprint arXiv:2402.00253 (2024)

  24. [24]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Liu, J., Fan, X., Huang, Z., Wu, G., Liu, R., Zhong, W., Luo, Z.: Target-aware dual adversarial learning and a multi-scenario multi-modality benchmark to fuse infrared and visible for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5802–5811 (2022)

  25. [25]

    Interna- tional Journal of Computer Vision60(2), 91–110 (2004)

    Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna- tional Journal of Computer Vision60(2), 91–110 (2004)

  26. [26]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Mayer, N., Ilg, E., Hausser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4040–4048 (2016)

  27. [27]

    DINOv2: Learning Robust Visual Features without Supervision

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,etal.:DINOv2:Learningrobust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023)

  28. [28]

    arXiv preprint arXiv:2503.19012 (2025)

    Ran,L.,Wang,L.,Wang,G.,Wang,P.,Zhang,Y.:DiffV2IR:visible-to-infrareddif- fusion model via vision-language understanding. arXiv preprint arXiv:2503.19012 (2025)

  29. [29]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Ren,J.,Jiang,X.,Li,Z.,Liang,D.,Zhou,X.,Bai,X.:MINIMA:Modalityinvariant image matching. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 23059–23068 (2025)

  30. [30]

    In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition. pp. 10684–10695 (2022)

  31. [31]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Schönberger, J.L., Frahm, J.M.: Structure-from-motion revisited. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4104– 4113 (2016)

  32. [32]

    In: Proceedings of the European Conference on Computer Vision

    Schönberger, J.L., Zheng, E., Frahm, J.M., Pollefeys, M.: Pixelwise view selection for unstructured multi-view stereo. In: Proceedings of the European Conference on Computer Vision. pp. 501–518 (2016) 18 M. Yang et al

  33. [33]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: LoFTR: Detector-free local feature matching with transformers. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8922–8931 (2021)

  34. [34]

    In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition

    Tuzcuoğlu, Ö., Köksal, A., Sofu, B., Kalkan, S., Alatan, A.A.: Xoftr: Cross-modal feature matching transformer. In: Proceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition. pp. 4275–4286 (2024)

  35. [35]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Wang,R.,Xu,S.,Dai,C.,Xiang,J.,Deng,Y.,Tong,X.,Yang,J.:MoGe:Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 5261–5271 (2025)

  36. [36]

    In: Proceedings of the Advances in Neural Information Processing Systems

    Wang, R., Xu, S., Dong, Y., Deng, Y., Xiang, J., Lv, Z., Sun, G., Tong, X., Yang, J.: MoGe-2: Accurate monocular geometry with metric scale and sharp details. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 35928–35959 (2026)

  37. [37]

    IEEE Transactions on Cybernetics54(3), 1997–2010 (2023)

    Wang, X., Li, J., Zhu, L., Zhang, Z., Chen, Z., Li, X., Wang, Y., Tian, Y., Wu, F.: VisEvent: Reliable object tracking via collaboration of frame and event flows. IEEE Transactions on Cybernetics54(3), 1997–2010 (2023)

  38. [38]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

    Weyand, T., Araujo, A., Cao, B., Sim, J.: Google landmarks dataset v2-a large- scale benchmark for instance-level recognition and retrieval. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2575–2584 (2020)

  39. [39]

    IEEE Transactions on Image Processing34, 2097–2111 (2025)

    Wu, J., Xu, R., Wood-Doughty, Z., Wang, C., Xu, S., Lam, E.Y.: Segment anything model is a good teacher for local feature learning. IEEE Transactions on Image Processing34, 2097–2111 (2025)

  40. [40]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 45(10), 12148–12166 (2023)

    Xu, H., Yuan, J., Ma, J.: Murf: Mutually reinforcing multi-modal image registra- tion and fusion. IEEE Transactions on Pattern Analysis and Machine Intelligence 45(10), 12148–12166 (2023)

  41. [41]

    In: Proceedings of the Advances in Neural Information Processing Systems

    Yang, L., Kang, B., Huang, Z., Zhao, Z., Xu, X., Feng, J., Zhao, H.: Depth anything v2. In: Proceedings of the Advances in Neural Information Processing Systems. pp. 21875–21911 (2024)

  42. [42]

    In: Proceedings of the International Joint Conference on Artificial Intelligence

    Yang, M., Fan, F., Huang, J., Ma, Y., Mei, X., Cai, Z., Ma, J.: Multimodal image matching based on cross-modality completion pre-training. In: Proceedings of the International Joint Conference on Artificial Intelligence. pp. 2206–2214 (2025)

  43. [43]

    Information Fusion76, 323–336 (2021)

    Zhang, H., Xu, H., Tian, X., Jiang, J., Ma, J.: Image fusion meets deep learning: A survey and perspective. Information Fusion76, 323–336 (2021)

  44. [44]

    In: Proceedings of the International Conference on Machine Learning (2024)

    Zhang, K., Ma, J.: Sparse-to-dense multimodal image registration via multi-task learning. In: Proceedings of the International Conference on Machine Learning (2024)

  45. [45]

    In: Pro- ceedings of the AAAI Conference on Artificial Intelligence

    Zhou, K., Chen, C., Wang, B., Saputra, M.R.U., Trigoni, N., Markham, A.: VM- Loc: Variational fusion for learning-based multimodal camera localization. In: Pro- ceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 6165–6173 (2021) AnyMatch 19 Supplementary Material A Details of View Transformation The view transformation module in AnyM...