pith. sign in

arxiv: 2604.13941 · v1 · submitted 2026-04-15 · 💻 cs.CV

SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Pith reviewed 2026-05-10 13:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords feature matchingscene awarenesstransformercross-view visibilitylocal descriptorsimage correspondencepose estimation
0
0 comments X

The pith

SceneGlue improves cross-view feature matching by adding implicit and explicit scene awareness trained only on local matches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SceneGlue as a way to overcome the limits of purely local feature descriptors in matching images taken from different viewpoints. Traditional descriptors miss broader scene context that helps decide which points actually correspond, leading to errors in tasks like estimating camera poses or aligning images. SceneGlue adds two forms of scene awareness: a parallel attention step that lets descriptors exchange information across both images at once, and a separate visibility transformer that labels which parts of the scene are visible in each view. The entire system learns from ordinary local match supervision with no need for scene-level ground-truth labels. A sympathetic reader would care because this could make matching more accurate and robust in real-world settings such as robotics and augmented reality while avoiding costly extra annotations.

Core claim

SceneGlue uses a hybrid matching approach that combines implicit parallel attention across local descriptors with an explicit Visibility Transformer that classifies features into visible and invisible regions, thereby supplying global scene context that local descriptors alone cannot provide, all while training exclusively on local feature matches without any scene-level groundtruth annotations.

What carries the argument

The hybridizable matching paradigm that runs parallel attention to exchange global context within and across images while the Visibility Transformer explicitly estimates cross-view visibility to label visible versus invisible regions.

If this is right

  • Homography estimation between image pairs becomes more accurate because visible-region cues reduce mismatches in overlapping areas.
  • Camera pose estimation improves when the model can distinguish invisible scene parts that would otherwise produce false correspondences.
  • Visual localization tasks gain robustness since global context helps match features even under large viewpoint changes.
  • Interpretability increases because the explicit visibility output shows which scene regions the matcher considered for each correspondence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visibility-labeling idea could be tested on other correspondence problems such as video tracking where scene parts enter and leave the frame.
  • If the method works without scene labels, it might allow larger training sets drawn from existing local-match datasets that lack expensive annotations.
  • Combining the approach with semantic segmentation could further refine which scene elements are treated as visible or occluded.

Load-bearing premise

Local feature matches by themselves supply enough signal to train accurate cross-view visibility estimates and global scene context without any scene-level groundtruth annotations.

What would settle it

Running the full SceneGlue pipeline on standard matching benchmarks and finding no accuracy gain over a baseline that uses only local descriptors would indicate that the added scene-awareness components are not delivering the claimed compensation.

Figures

Figures reproduced from arXiv: 2604.13941 by Guobao Xiao, Songlin Du, Takeshi Ikenaga, Xiaobo Lu, Xiaoyong Lu, Yaping Yan.

Figure 1
Figure 1. Figure 1: Graphical illustration of the intuition lying behin [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Graphical illustration of the proposed SceneGlue. The proposed scene-aware matching method consists of three parts, namely informative feature representation, parallel attention, and scene-aware matching. The informative feature representation first encodes the position by a Wave Position Encoder (Wave-PE) to obtain position-aware descriptors and then combines each local position-aware descriptor with a le… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-scale Feature Detector. Features in 1, 1/2, 1/4, 1/8 resolution are sampled and fused by lightweight linear layers. The Multi-scale feature network allows feature matching to be more robust to large-scale variation scenarios. projected to query, key and value, i.e., (Qs, Ks, Vs) and (Qt, Kt, Vt), respectively. Then self- and cross-attention are computed in a parallel manner. In the self-attention mod… view at source ↗
Figure 4
Figure 4. Figure 4: Graphical illustration on the wave position encoder and the parallel attention. (a) The wave position encoder fuses the amplitude A estimated with the descriptor d and the phase θ estimated with the position p to generate position encoding. (b) Stacked parallel attention layers utilize self- and cross-attention to enhance the descriptors and find potential matches, where self- and cross-attention are adapt… view at source ↗
Figure 5
Figure 5. Figure 5: Visibility Transformer. The Visibility Transformer is proposed for cross-view visible area estimation. It adopts a Transformer architecture to establish the relationship between multi-scale local descriptors and learnable scene descriptors before matching and assigns the multi-scale local descriptors to the commonly-visible area or commonly-invisible area through a Softmax classifier. the corresponding key… view at source ↗
Figure 6
Figure 6. Figure 6: Visualization of cross-view visibility estimation and feature matching results on (a) homography estimation and (b) outdoor pose estimation tasks. SceneGlue precisely estimates the visible regions in cross-view images and further results in more robust and accurate point-level matching for both homography estimation and outdoor pose estimation. SceneGlue at only one threshold 10◦ , and is inferior at other… view at source ↗
Figure 7
Figure 7. Figure 7: Failure cases. TABLE XI ABLATION STUDY ON THE NUMBER OF PARAMETERS OF THE MULTI-SCALE FEATURE NETWORK USING THE R1M DATASET. THE BEST RESULT IS HIGHLIGHTED IN BOLD. Parameters Precision (%) Recall (%) F1-score (%) 55K 93.0 98.7 95.76 111K 92.9 98.8 95.78 222K 93.2 98.9 95.97 444K 93.0 98.8 95.84 888K 92.9 98.9 95.83 in general, as LoFTR and ASpanFormer take 3 ∼ 5 times of runtime of SceneGlue. SuperGlue is… view at source ↗
read the original abstract

Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene's global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue's superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SceneGlue, a transformer-based framework for local feature matching that augments standard descriptors with implicit scene context (via parallel attention exchanging information within and across images) and explicit scene awareness (via a Visibility Transformer that categorizes features as visible or invisible). The model is trained end-to-end using only local feature match supervision and no scene-level ground-truth annotations, with claims of improved accuracy, robustness, and interpretability on homography estimation, pose estimation, image matching, and visual localization tasks. Source code is released.

Significance. If the Visibility Transformer indeed recovers geometrically meaningful cross-view visibility from local-match supervision alone, the hybrid explicit-implicit design would meaningfully extend local feature matching beyond descriptor limitations, offering both performance gains and added interpretability. The public code release strengthens the contribution by enabling direct verification and extension.

major comments (2)
  1. [§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.
  2. [§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.
minor comments (2)
  1. [Abstract / §3] The term 'hybridizable matching paradigm' is introduced in the abstract but not formally defined or contrasted with prior matching paradigms in the method section; a concise definition or diagram would improve clarity.
  2. [§3.1–3.3] Notation for the parallel attention and visibility heads (e.g., symbols for visible/invisible logits) should be introduced once and used consistently to avoid reader confusion when tracing the loss terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on our paper. We address the major concerns point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.

    Authors: We thank the referee for highlighting this important point. The Visibility Transformer is indeed trained indirectly through the matching loss without explicit scene-level supervision. While we believe the performance improvements and qualitative visualizations in the original manuscript support its effectiveness, we agree that direct quantitative validation is valuable. In the revised manuscript, we will include experiments that compare the predicted visibility maps against visibility derived from ground-truth poses and depth information on appropriate datasets (e.g., those with available 3D data). This will provide the requested evidence for the geometric accuracy of the visibility estimation. revision: yes

  2. Referee: [§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.

    Authors: We acknowledge the lack of detailed ablations isolating the contributions of the parallel attention and the Visibility Transformer. To better attribute the performance gains to the hybrid design, we will add ablation studies in the revised version. These will include: a baseline transformer matcher, variants with only parallel attention, only Visibility Transformer, and the full model. Results will be reported on the main benchmarks to quantify the improvement from each component over the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture with independent training signal

full rationale

The paper presents SceneGlue as a novel hybrid matching framework that adds a Visibility Transformer and parallel attention to local descriptors. These are architectural additions, not derivations that reduce to prior equations or self-fitted quantities. Training uses only local feature match supervision without scene-level ground truth; this is an empirical learning claim, not a mathematical tautology where a 'prediction' is defined as the input. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The derivation chain consists of standard transformer blocks plus new heads, with no reduction by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that local supervision suffices for learning scene-level visibility and on the introduction of two new architectural components whose effectiveness is demonstrated empirically rather than derived from first principles.

free parameters (1)
  • model hyperparameters
    Standard deep-learning choices such as number of layers, attention heads, and learning rate are required to train the transformer but are not enumerated in the abstract.
axioms (1)
  • domain assumption Local feature matches alone suffice to supervise scene-aware visibility estimation
    The training procedure described in the abstract relies on this premise without scene-level ground truth.
invented entities (2)
  • Visibility Transformer no independent evidence
    purpose: Explicitly categorize features into visible and invisible regions across views
    New module introduced to provide explicit scene awareness; no independent falsifiable prediction outside the paper is given.
  • Parallel attention mechanism no independent evidence
    purpose: Simultaneously exchange information among local descriptors within and across images
    New integration of attention for implicit global context; no external validation cited.

pith-pipeline@v0.9.0 · 5538 in / 1406 out tokens · 45815 ms · 2026-05-10T13:38:49.678430+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

  1. [1]

    Deep learning reforms image matching: A survey and ou tlook,

    S. Zhang, Z. Li, K. Zhang, Y . Lu, Y . Deng, L. Tang, X. Jiang, and J. Ma, “Deep learning reforms image matching: A survey and ou tlook,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2506 .04619

  2. [2]

    EC-SfM : Efficient covisibility-based structure-from-motion for b oth sequential and unordered images,

    Z. Y e, C. Bao, X. Zhou, H. Liu, H. Bao, and G. Zhang, “EC-SfM : Efficient covisibility-based structure-from-motion for b oth sequential and unordered images,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 1, pp. 110–123, 2024

  3. [3]

    PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,

    X. Hu, Y . Wu, M. Zhao, L. Y ang, X. Zhang, and X. Ji, “PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,” IEEE Trans. Circuits Syst. Video Technol. , vol. 35, no. 3, pp. 2026–2044, 2025

  4. [4]

    Distinctive image features from scale-inva riant keypoints,

    D. G. Lowe, “Distinctive image features from scale-inva riant keypoints,” Int. J. Comput. Vis. , vol. 60, no. 2, pp. 91–110, 2004

  5. [5]

    D2-Net: A trainable CNN for joint descripti on and detection of local features,

    M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A . Torii, and T. Sattler, “D2-Net: A trainable CNN for joint descripti on and detection of local features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 8092–8101

  6. [6]

    SuperPoi nt: Self- supervised interest point detection and description,

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoi nt: Self- supervised interest point detection and description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. W orkshops (CVPRW) , 2018, pp. 224–236

  7. [7]

    MatchMa mba: Correspondence pruning via selective state space model,

    Y . Wu, X. Li, H. Chen, C. Y ang, L. Wei, and R. Chen, “MatchMa mba: Correspondence pruning via selective state space model,” IEEE Trans. Circuits Syst. Video Technol. , 2025

  8. [8]

    CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,

    C. Y ang, X. Li, J. Ma, F. Zhuang, L. Wei, R. Chen, and G. Chen , “CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 12, pp. 12 450–12 465, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 13

  9. [9]

    MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,

    F. Zhuang, Y . Liu, X. Li, J. Zhou, R. Chen, L. Wei, C. Y ang, a nd J. Ma, “MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,” ISPRS J. Photogramm. Remote Sens. , vol. 219, pp. 38–51, 2025

  10. [10]

    U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,

    Z. Li, S. Zhang, and J. Ma, “U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 46, no. 12, pp. 10 960–10 977, 2024

  11. [11]

    MIN IMA: Modality invariant image matching,

    J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “MIN IMA: Modality invariant image matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025, pp. 23 059–23 068

  12. [12]

    ASLFeat: Learning local features of accurate s hape and localization,

    Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Y ao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate s hape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6588–6597

  13. [13]

    Super- Glue: Learning feature matching with graph neural networks ,

    P . Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich , “Super- Glue: Learning feature matching with graph neural networks ,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020, pp. 4937–4946

  14. [14]

    LoFTR: Dete ctor-free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Dete ctor-free local feature matching with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 8922–8931

  15. [15]

    Ligh tGlue: Local feature matching at light speed,

    P . Lindenberger, P .-E. Sarlin, and M. Pollefeys, “Ligh tGlue: Local feature matching at light speed,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 17 581–17 592

  16. [16]

    ORB: An efficient alternative to SIFT or SURF,

    E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2011, pp. 2564–2571

  17. [17]

    R2D2: Reliable and repeatable detector and descriptor,

    J. Revaud, C. De Souza, M. Humenberger, and P . Weinzaepf el, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2019, pp. 12 414–12 424

  18. [18]

    Attention is all you need,

    A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jone s, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017, pp. 5998–6008

  19. [19]

    Swin Transformer: Hierarchical vision Transformer using shifted win- dows,

    Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B . Guo, “Swin Transformer: Hierarchical vision Transformer using shifted win- dows,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022

  20. [20]

    Match- Former: Interleaving attention in Transformers for featur e matching,

    Q. Wang, J. Zhang, K. Y ang, K. Peng, and R. Stiefelhagen, “Match- Former: Interleaving attention in Transformers for featur e matching,” in Proc. Asia. Conf. Comput. Vis. (ACCV) , 2022, pp. 2746–2762

  21. [21]

    An image patch is a wave: Phase-aware vision MLP,

    Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision MLP,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 10 925–10 934

  22. [22]

    Learning feature matching via matchabl e keypoint- assisted graph neural network,

    Z. Li and J. Ma, “Learning feature matching via matchabl e keypoint- assisted graph neural network,” IEEE Trans. Image Process. , vol. 34, pp. 154–169, 2025

  23. [23]

    Guide local f eature matching by overlap estimation,

    Y . Chen, D. Huang, S. Xu, J. Liu, and Y . Liu, “Guide local f eature matching by overlap estimation,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2022, pp. 365–373

  24. [24]

    Le arning accurate dense correspondences and when to trust them,

    P . Truong, M. Danelljan, L. V an Gool, and R. Timofte, “Le arning accurate dense correspondences and when to trust them,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 5710–5720

  25. [25]

    DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,

    T. Xie, K. Dai, K. Wang, R. Li, and L. Zhao, “DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,” Expert Syst. Appl. , vol. 237, p. 121361, 2024

  26. [26]

    VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,

    K. Dai, Z. Zhou, Z. Jiang, Q. Sun, T. Xie, H. Gao, T. An, R. L i, and L. Zhao, “VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,” IEEE Trans. Circuits Syst. Video Technol., 2025

  27. [27]

    Adaptiv e spot-guided Transformer for consistent local feature matching,

    J. Y u, J. Chang, J. He, T. Zhang, J. Y u, and F. Wu, “Adaptiv e spot-guided Transformer for consistent local feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023, pp. 21 898–21 908

  28. [28]

    ContextDesc: Local descriptor augmentation with cross- modality context,

    Z. Luo, T. Shen, L. Zhou, J. Zhang, Y . Y ao, S. Li, T. Fang, a nd L. Quan, “ContextDesc: Local descriptor augmentation with cross- modality context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 2522–2531

  29. [29]

    Attention weighted local descriptors,

    C. Wang, R. Xu, K. Lu, S. Xu, W. Meng, Y . Zhang, B. Fan, and X. Zhang, “Attention weighted local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 45, no. 9, pp. 10 632–10 649, 2023

  30. [30]

    OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,

    K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, and L. Zhao, “OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,” Pattern Recognit., vol. 147, pp. 110 094:1– 110 094:14, 2024

  31. [31]

    Adaptive assignment for geometry aware local f eature matching,

    D. Huang, Y . Chen, Y . Liu, J. Liu, S. Xu, W. Wu, Y . Ding, F. T ang, and C. Wang, “Adaptive assignment for geometry aware local f eature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5425–5434

  32. [32]

    Scene-aware feature mat ching,

    X. Lu, Y . Y an, T. Wei, and S. Du, “Scene-aware feature mat ching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 3704–3710

  33. [33]

    CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,

    Z. Li, y. Lu, L. Tang, S. Zhang, and J. Ma, “CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2025, pp. 18 521–18 530

  34. [34]

    Object retrieval with large vocabularies and fast spatial matchin g,

    J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman , “Object retrieval with large vocabularies and fast spatial matchin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2007, pp. 1– 8

  35. [35]

    MegaDepth: Learning single-view depth pre- diction from internet photos,

    Z. Li and N. Snavely, “MegaDepth: Learning single-view depth pre- diction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2041–2050

  36. [36]

    Adam: A Method for Stochastic Optimization

    D. P . Kingma and J. Ba, “Adam: A method for stochastic opt imization,” arXiv preprint arXiv:1412.6980 , 2014

  37. [37]

    HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,

    V . Balntas, K. Lenc, A. V edaldi, and K. Mikolajczyk, “HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5173–5182

  38. [38]

    Patch2Pix: Epi polar-guided pixel-level correspondences,

    Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2Pix: Epi polar-guided pixel-level correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 4669–4678

  39. [39]

    NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,

    I. Rocco, M. Cimpoi, R. Arandjelovi ´c, A. Torii, T. Pajdla, and J. Sivic, “NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 2, pp. 1020–1034, 2022

  40. [40]

    Learnin g feature descriptors using camera pose supervision,

    Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learnin g feature descriptors using camera pose supervision,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 757–774

  41. [41]

    Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,

    F. Radenovi ´c, A. Iscen, G. Tolias, Y . Avrithis, and O. Chum, “Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 5706–5715

  42. [42]

    Learning to Find Good Correspondences,

    K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P . Fua, “Learning to Find Good Correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2666–2674

  43. [43]

    OANet: Learning two-view corresponde nces and geometry using order-aware network,

    J. Zhang, D. Sun, Z. Luo, A. Y ao, H. Chen, L. Zhou, T. Shen, Y . Chen, L. Quan, and H. Liao, “OANet: Learning two-view corresponde nces and geometry using order-aware network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3110–3122, 2022

  44. [44]

    YFCC100M: The new data in multimedia research,

    B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland , D. Poland, D. Borth, and L. J. Li, “YFCC100M: The new data in multimedia research,” Commun. ACM , pp. 64–73, 2016

  45. [45]

    Learning to match features with seeded graph match ing network,

    H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai , and L. Quan, “Learning to match features with seeded graph match ing network,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2021, pp. 6301– 6310

  46. [46]

    DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,

    Z. Kuang, J. Li, M. He, T. Wang, and Y . Zhao, “DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,” in Proc. Int. Conf. Pattern Recognit. (ICPR) , 2022, pp. 542–549

  47. [47]

    ClusterGNN: Cluster-based coarse-to-fine graph neural ne twork for efficient feature matching,

    Y . Shi, J.-X. Cai, Y . Shavit, T.-J. Mu, W. Feng, and K. Zha ng, “ClusterGNN: Cluster-based coarse-to-fine graph neural ne twork for efficient feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 12 507–12 516

  48. [48]

    DINOv2: Learning Robust Visual Features without Supervision

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafrani ec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P .-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jeg ou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski, “DINOv2 : Learning robust visual features w...

  49. [49]

    ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,

    A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, a nd M. Nießner, “ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5828–5839

  50. [50]

    InLoc: Indoor visual localization with dense matching and view synthesis,

    H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefey s, J. Sivic, T. Pajdla, and A. Torii, “InLoc: Indoor visual localization with dense matching and view synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 7199–7209

  51. [51]

    DiffGlue: Diffusion-aided image fe ature match- ing,

    S. Zhang and J. Ma, “DiffGlue: Diffusion-aided image fe ature match- ing,” in Proc. ACM Int. Conf. Multimedia , 2024, pp. 8451–8460

  52. [52]

    Handcrafted outlier detection revisited,

    L. Cavalli, V . Larsson, M. R. Oswald, T. Sattler, and M. P ollefeys, “Handcrafted outlier detection revisited,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 770–787

  53. [53]

    ResMatch: R esidual attention learning for feature matching,

    Y . Deng, K. Zhang, S. Zhang, Y . Li, and J. Ma, “ResMatch: R esidual attention learning for feature matching,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2024, pp. 1501–1509. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 14

  54. [54]

    Benchmarking 6DOF outdoor visual localization in changin g condi- tions,

    T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstra nd, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla, “Benchmarking 6DOF outdoor visual localization in changin g condi- tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8601–8610

  55. [55]

    F rom coarse to fine: Robust hierarchical localization at large scale,

    P .-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “F rom coarse to fine: Robust hierarchical localization at large scale,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 12 708–12 717

  56. [56]

    ASpanFormer: Detector-free image matching with adaptive span Transformer,

    H. Chen, Z. Luo, L. Zhou, Y . Tian, M. Zhen, T. Fang, D. McKi nnon, Y . Tsin, and L. Quan, “ASpanFormer: Detector-free image matching with adaptive span Transformer,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , 2022, pp. 20–36. Songlin Du received the Ph.D. degree in Physics from Lanzhou University, Lanzhou, China, and the second Ph.D. degree in Engineerin...

  57. [57]

    He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV

    He is currently a Professor at Tongji Uni- versity, Shanghai, China. He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV . He was awarded the Best Ph.D. Thesis Award by China So- ciety of Image and Graphics (a total of ten awardees in China). He also served on the program committee (PC) of CVPR, IC...

  58. [58]

    He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008)

    He is currently a Professor with the School of Automation and the Deputy Director of the De- tection Technology and Automation Research In- stitute, Southeast University. He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008). His research intere sts include image processing, sig...