SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Guobao Xiao; Songlin Du; Takeshi Ikenaga; Xiaobo Lu; Xiaoyong Lu; Yaping Yan

arxiv: 2604.13941 · v1 · submitted 2026-04-15 · 💻 cs.CV

SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation

Songlin Du , Xiaoyong Lu , Yaping Yan , Guobao Xiao , Xiaobo Lu , Takeshi Ikenaga This is my paper

Pith reviewed 2026-05-10 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords feature matchingscene awarenesstransformercross-view visibilitylocal descriptorsimage correspondencepose estimation

0 comments

The pith

SceneGlue improves cross-view feature matching by adding implicit and explicit scene awareness trained only on local matches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SceneGlue as a way to overcome the limits of purely local feature descriptors in matching images taken from different viewpoints. Traditional descriptors miss broader scene context that helps decide which points actually correspond, leading to errors in tasks like estimating camera poses or aligning images. SceneGlue adds two forms of scene awareness: a parallel attention step that lets descriptors exchange information across both images at once, and a separate visibility transformer that labels which parts of the scene are visible in each view. The entire system learns from ordinary local match supervision with no need for scene-level ground-truth labels. A sympathetic reader would care because this could make matching more accurate and robust in real-world settings such as robotics and augmented reality while avoiding costly extra annotations.

Core claim

SceneGlue uses a hybrid matching approach that combines implicit parallel attention across local descriptors with an explicit Visibility Transformer that classifies features into visible and invisible regions, thereby supplying global scene context that local descriptors alone cannot provide, all while training exclusively on local feature matches without any scene-level groundtruth annotations.

What carries the argument

The hybridizable matching paradigm that runs parallel attention to exchange global context within and across images while the Visibility Transformer explicitly estimates cross-view visibility to label visible versus invisible regions.

If this is right

Homography estimation between image pairs becomes more accurate because visible-region cues reduce mismatches in overlapping areas.
Camera pose estimation improves when the model can distinguish invisible scene parts that would otherwise produce false correspondences.
Visual localization tasks gain robustness since global context helps match features even under large viewpoint changes.
Interpretability increases because the explicit visibility output shows which scene regions the matcher considered for each correspondence.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visibility-labeling idea could be tested on other correspondence problems such as video tracking where scene parts enter and leave the frame.
If the method works without scene labels, it might allow larger training sets drawn from existing local-match datasets that lack expensive annotations.
Combining the approach with semantic segmentation could further refine which scene elements are treated as visible or occluded.

Load-bearing premise

Local feature matches by themselves supply enough signal to train accurate cross-view visibility estimates and global scene context without any scene-level groundtruth annotations.

What would settle it

Running the full SceneGlue pipeline on standard matching benchmarks and finding no accuracy gain over a baseline that uses only local descriptors would indicate that the added scene-awareness components are not delivering the claimed compensation.

Figures

Figures reproduced from arXiv: 2604.13941 by Guobao Xiao, Songlin Du, Takeshi Ikenaga, Xiaobo Lu, Xiaoyong Lu, Yaping Yan.

**Figure 2.** Figure 2: Graphical illustration of the proposed SceneGlue. The proposed scene-aware matching method consists of three parts, namely informative feature representation, parallel attention, and scene-aware matching. The informative feature representation first encodes the position by a Wave Position Encoder (Wave-PE) to obtain position-aware descriptors and then combines each local position-aware descriptor with a le… view at source ↗

**Figure 3.** Figure 3: Multi-scale Feature Detector. Features in 1, 1/2, 1/4, 1/8 resolution are sampled and fused by lightweight linear layers. The Multi-scale feature network allows feature matching to be more robust to large-scale variation scenarios. projected to query, key and value, i.e., (Qs, Ks, Vs) and (Qt, Kt, Vt), respectively. Then self- and cross-attention are computed in a parallel manner. In the self-attention mod… view at source ↗

**Figure 4.** Figure 4: Graphical illustration on the wave position encoder and the parallel attention. (a) The wave position encoder fuses the amplitude A estimated with the descriptor d and the phase θ estimated with the position p to generate position encoding. (b) Stacked parallel attention layers utilize self- and cross-attention to enhance the descriptors and find potential matches, where self- and cross-attention are adapt… view at source ↗

**Figure 5.** Figure 5: Visibility Transformer. The Visibility Transformer is proposed for cross-view visible area estimation. It adopts a Transformer architecture to establish the relationship between multi-scale local descriptors and learnable scene descriptors before matching and assigns the multi-scale local descriptors to the commonly-visible area or commonly-invisible area through a Softmax classifier. the corresponding key… view at source ↗

**Figure 6.** Figure 6: Visualization of cross-view visibility estimation and feature matching results on (a) homography estimation and (b) outdoor pose estimation tasks. SceneGlue precisely estimates the visible regions in cross-view images and further results in more robust and accurate point-level matching for both homography estimation and outdoor pose estimation. SceneGlue at only one threshold 10◦ , and is inferior at other… view at source ↗

**Figure 7.** Figure 7: Failure cases. TABLE XI ABLATION STUDY ON THE NUMBER OF PARAMETERS OF THE MULTI-SCALE FEATURE NETWORK USING THE R1M DATASET. THE BEST RESULT IS HIGHLIGHTED IN BOLD. Parameters Precision (%) Recall (%) F1-score (%) 55K 93.0 98.7 95.76 111K 92.9 98.8 95.78 222K 93.2 98.9 95.97 444K 93.0 98.8 95.84 888K 92.9 98.9 95.83 in general, as LoFTR and ASpanFormer take 3 ∼ 5 times of runtime of SceneGlue. SuperGlue is… view at source ↗

read the original abstract

Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene's global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue's superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SceneGlue adds a visibility transformer on top of parallel attention for feature matching trained only on local matches, but the indirect supervision for cross-view visibility looks like the weakest link.

read the letter

The main thing to know is that SceneGlue combines implicit parallel attention with an explicit Visibility Transformer to inject scene awareness into local feature matching, all without scene-level ground truth, and it reports gains on standard tasks with code released. The hybrid setup is the concrete novelty here: parallel attention exchanges information within and across images for global context, while the visibility module classifies features as visible or invisible to handle occlusions and viewpoint changes. They train end-to-end using only local match supervision, which avoids the need for expensive annotations. Experiments cover homography estimation, pose estimation, image matching, and visual localization, showing improvements over baselines in accuracy and robustness. The code release is a plus for anyone wanting to test it directly. The approach addresses a genuine limitation in descriptor-based matching by trying to bring in non-local scene information. The soft spot is the supervision story for the visibility head. Local matches provide the only training signal, so the module could learn a degenerate proxy that correlates with matches but fails to capture real geometry, especially when matches are sparse under large viewpoint shifts. The abstract gives no direct evidence, such as qualitative visibility maps or metrics against ground-truth structure, that the predictions align with actual cross-view visibility. Without strong ablations isolating the visibility component's contribution, it is unclear how much of the reported gains come from this part versus the attention alone. This paper is for computer vision researchers working on feature matching pipelines for SLAM, SfM, or localization. Readers who need practical incremental improvements with public code will find it useful. It is not a fundamental shift but a reasonable engineering step. It deserves peer review because the architecture is well-specified, the problem is motivated, and the implementation allows checking the claims. I would send it out, asking referees to focus on whether the visibility learning actually delivers meaningful scene awareness or just rides along with the attention.

Referee Report

2 major / 2 minor

Summary. The paper introduces SceneGlue, a transformer-based framework for local feature matching that augments standard descriptors with implicit scene context (via parallel attention exchanging information within and across images) and explicit scene awareness (via a Visibility Transformer that categorizes features as visible or invisible). The model is trained end-to-end using only local feature match supervision and no scene-level ground-truth annotations, with claims of improved accuracy, robustness, and interpretability on homography estimation, pose estimation, image matching, and visual localization tasks. Source code is released.

Significance. If the Visibility Transformer indeed recovers geometrically meaningful cross-view visibility from local-match supervision alone, the hybrid explicit-implicit design would meaningfully extend local feature matching beyond descriptor limitations, offering both performance gains and added interpretability. The public code release strengthens the contribution by enabling direct verification and extension.

major comments (2)

[§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.
[§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.

minor comments (2)

[Abstract / §3] The term 'hybridizable matching paradigm' is introduced in the abstract but not formally defined or contrasted with prior matching paradigms in the method section; a concise definition or diagram would improve clarity.
[§3.1–3.3] Notation for the parallel attention and visibility heads (e.g., symbols for visible/invisible logits) should be introduced once and used consistently to avoid reader confusion when tracing the loss terms.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments on our paper. We address the major concerns point by point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.

Authors: We thank the referee for highlighting this important point. The Visibility Transformer is indeed trained indirectly through the matching loss without explicit scene-level supervision. While we believe the performance improvements and qualitative visualizations in the original manuscript support its effectiveness, we agree that direct quantitative validation is valuable. In the revised manuscript, we will include experiments that compare the predicted visibility maps against visibility derived from ground-truth poses and depth information on appropriate datasets (e.g., those with available 3D data). This will provide the requested evidence for the geometric accuracy of the visibility estimation. revision: yes
Referee: [§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.

Authors: We acknowledge the lack of detailed ablations isolating the contributions of the parallel attention and the Visibility Transformer. To better attribute the performance gains to the hybrid design, we will add ablation studies in the revised version. These will include: a baseline transformer matcher, variants with only parallel attention, only Visibility Transformer, and the full model. Results will be reported on the main benchmarks to quantify the improvement from each component over the baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: new architecture with independent training signal

full rationale

The paper presents SceneGlue as a novel hybrid matching framework that adds a Visibility Transformer and parallel attention to local descriptors. These are architectural additions, not derivations that reduce to prior equations or self-fitted quantities. Training uses only local feature match supervision without scene-level ground truth; this is an empirical learning claim, not a mathematical tautology where a 'prediction' is defined as the input. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The derivation chain consists of standard transformer blocks plus new heads, with no reduction by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that local supervision suffices for learning scene-level visibility and on the introduction of two new architectural components whose effectiveness is demonstrated empirically rather than derived from first principles.

free parameters (1)

model hyperparameters
Standard deep-learning choices such as number of layers, attention heads, and learning rate are required to train the transformer but are not enumerated in the abstract.

axioms (1)

domain assumption Local feature matches alone suffice to supervise scene-aware visibility estimation
The training procedure described in the abstract relies on this premise without scene-level ground truth.

invented entities (2)

Visibility Transformer no independent evidence
purpose: Explicitly categorize features into visible and invisible regions across views
New module introduced to provide explicit scene awareness; no independent falsifiable prediction outside the paper is given.
Parallel attention mechanism no independent evidence
purpose: Simultaneously exchange information among local descriptors within and across images
New integration of attention for implicit global context; no external validation cited.

pith-pipeline@v0.9.0 · 5538 in / 1406 out tokens · 45815 ms · 2026-05-10T13:38:49.678430+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

[1]

Deep learning reforms image matching: A survey and ou tlook,

S. Zhang, Z. Li, K. Zhang, Y . Lu, Y . Deng, L. Tang, X. Jiang, and J. Ma, “Deep learning reforms image matching: A survey and ou tlook,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2506 .04619

work page 2025
[2]

EC-SfM : Efﬁcient covisibility-based structure-from-motion for b oth sequential and unordered images,

Z. Y e, C. Bao, X. Zhou, H. Liu, H. Bao, and G. Zhang, “EC-SfM : Efﬁcient covisibility-based structure-from-motion for b oth sequential and unordered images,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 1, pp. 110–123, 2024

work page 2024
[3]

PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,

X. Hu, Y . Wu, M. Zhao, L. Y ang, X. Zhang, and X. Ji, “PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,” IEEE Trans. Circuits Syst. Video Technol. , vol. 35, no. 3, pp. 2026–2044, 2025

work page 2026
[4]

Distinctive image features from scale-inva riant keypoints,

D. G. Lowe, “Distinctive image features from scale-inva riant keypoints,” Int. J. Comput. Vis. , vol. 60, no. 2, pp. 91–110, 2004

work page 2004
[5]

D2-Net: A trainable CNN for joint descripti on and detection of local features,

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A . Torii, and T. Sattler, “D2-Net: A trainable CNN for joint descripti on and detection of local features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 8092–8101

work page 2019
[6]

SuperPoi nt: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoi nt: Self- supervised interest point detection and description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. W orkshops (CVPRW) , 2018, pp. 224–236

work page 2018
[7]

MatchMa mba: Correspondence pruning via selective state space model,

Y . Wu, X. Li, H. Chen, C. Y ang, L. Wei, and R. Chen, “MatchMa mba: Correspondence pruning via selective state space model,” IEEE Trans. Circuits Syst. Video Technol. , 2025

work page 2025
[8]

CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,

C. Y ang, X. Li, J. Ma, F. Zhuang, L. Wei, R. Chen, and G. Chen , “CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 12, pp. 12 450–12 465, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 13

work page 2024
[9]

MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,

F. Zhuang, Y . Liu, X. Li, J. Zhou, R. Chen, L. Wei, C. Y ang, a nd J. Ma, “MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,” ISPRS J. Photogramm. Remote Sens. , vol. 219, pp. 38–51, 2025

work page 2025
[10]

U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,

Z. Li, S. Zhang, and J. Ma, “U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 46, no. 12, pp. 10 960–10 977, 2024

work page 2024
[11]

MIN IMA: Modality invariant image matching,

J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “MIN IMA: Modality invariant image matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025, pp. 23 059–23 068

work page 2025
[12]

ASLFeat: Learning local features of accurate s hape and localization,

Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Y ao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate s hape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6588–6597

work page 2020
[13]

Super- Glue: Learning feature matching with graph neural networks ,

P . Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich , “Super- Glue: Learning feature matching with graph neural networks ,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020, pp. 4937–4946

work page 2020
[14]

LoFTR: Dete ctor-free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Dete ctor-free local feature matching with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 8922–8931

work page 2021
[15]

Ligh tGlue: Local feature matching at light speed,

P . Lindenberger, P .-E. Sarlin, and M. Pollefeys, “Ligh tGlue: Local feature matching at light speed,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 17 581–17 592

work page 2023
[16]

ORB: An efﬁcient alternative to SIFT or SURF,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “ORB: An efﬁcient alternative to SIFT or SURF,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2011, pp. 2564–2571

work page 2011
[17]

R2D2: Reliable and repeatable detector and descriptor,

J. Revaud, C. De Souza, M. Humenberger, and P . Weinzaepf el, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2019, pp. 12 414–12 424

work page 2019
[18]

Attention is all you need,

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jone s, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017, pp. 5998–6008

work page 2017
[19]

Swin Transformer: Hierarchical vision Transformer using shifted win- dows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B . Guo, “Swin Transformer: Hierarchical vision Transformer using shifted win- dows,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022

work page 2021
[20]

Match- Former: Interleaving attention in Transformers for featur e matching,

Q. Wang, J. Zhang, K. Y ang, K. Peng, and R. Stiefelhagen, “Match- Former: Interleaving attention in Transformers for featur e matching,” in Proc. Asia. Conf. Comput. Vis. (ACCV) , 2022, pp. 2746–2762

work page 2022
[21]

An image patch is a wave: Phase-aware vision MLP,

Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision MLP,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 10 925–10 934

work page 2022
[22]

Learning feature matching via matchabl e keypoint- assisted graph neural network,

Z. Li and J. Ma, “Learning feature matching via matchabl e keypoint- assisted graph neural network,” IEEE Trans. Image Process. , vol. 34, pp. 154–169, 2025

work page 2025
[23]

Guide local f eature matching by overlap estimation,

Y . Chen, D. Huang, S. Xu, J. Liu, and Y . Liu, “Guide local f eature matching by overlap estimation,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2022, pp. 365–373

work page 2022
[24]

Le arning accurate dense correspondences and when to trust them,

P . Truong, M. Danelljan, L. V an Gool, and R. Timofte, “Le arning accurate dense correspondences and when to trust them,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 5710–5720

work page 2021
[25]

DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,

T. Xie, K. Dai, K. Wang, R. Li, and L. Zhao, “DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,” Expert Syst. Appl. , vol. 237, p. 121361, 2024

work page 2024
[26]

VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,

K. Dai, Z. Zhou, Z. Jiang, Q. Sun, T. Xie, H. Gao, T. An, R. L i, and L. Zhao, “VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,” IEEE Trans. Circuits Syst. Video Technol., 2025

work page 2025
[27]

Adaptiv e spot-guided Transformer for consistent local feature matching,

J. Y u, J. Chang, J. He, T. Zhang, J. Y u, and F. Wu, “Adaptiv e spot-guided Transformer for consistent local feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023, pp. 21 898–21 908

work page 2023
[28]

ContextDesc: Local descriptor augmentation with cross- modality context,

Z. Luo, T. Shen, L. Zhou, J. Zhang, Y . Y ao, S. Li, T. Fang, a nd L. Quan, “ContextDesc: Local descriptor augmentation with cross- modality context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 2522–2531

work page 2019
[29]

Attention weighted local descriptors,

C. Wang, R. Xu, K. Lu, S. Xu, W. Meng, Y . Zhang, B. Fan, and X. Zhang, “Attention weighted local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 45, no. 9, pp. 10 632–10 649, 2023

work page 2023
[30]

OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,

K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, and L. Zhao, “OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,” Pattern Recognit., vol. 147, pp. 110 094:1– 110 094:14, 2024

work page 2024
[31]

Adaptive assignment for geometry aware local f eature matching,

D. Huang, Y . Chen, Y . Liu, J. Liu, S. Xu, W. Wu, Y . Ding, F. T ang, and C. Wang, “Adaptive assignment for geometry aware local f eature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5425–5434

work page 2023
[32]

Scene-aware feature mat ching,

X. Lu, Y . Y an, T. Wei, and S. Du, “Scene-aware feature mat ching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 3704–3710

work page 2023
[33]

CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,

Z. Li, y. Lu, L. Tang, S. Zhang, and J. Ma, “CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2025, pp. 18 521–18 530

work page 2025
[34]

Object retrieval with large vocabularies and fast spatial matchin g,

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman , “Object retrieval with large vocabularies and fast spatial matchin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2007, pp. 1– 8

work page 2007
[35]

MegaDepth: Learning single-view depth pre- diction from internet photos,

Z. Li and N. Snavely, “MegaDepth: Learning single-view depth pre- diction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2041–2050

work page 2018
[36]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic opt imization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,

V . Balntas, K. Lenc, A. V edaldi, and K. Mikolajczyk, “HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5173–5182

work page 2017
[38]

Patch2Pix: Epi polar-guided pixel-level correspondences,

Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2Pix: Epi polar-guided pixel-level correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 4669–4678

work page 2021
[39]

NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,

I. Rocco, M. Cimpoi, R. Arandjelovi ´c, A. Torii, T. Pajdla, and J. Sivic, “NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 2, pp. 1020–1034, 2022

work page 2022
[40]

Learnin g feature descriptors using camera pose supervision,

Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learnin g feature descriptors using camera pose supervision,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 757–774

work page 2020
[41]

Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,

F. Radenovi ´c, A. Iscen, G. Tolias, Y . Avrithis, and O. Chum, “Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 5706–5715

work page 2018
[42]

Learning to Find Good Correspondences,

K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P . Fua, “Learning to Find Good Correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2666–2674

work page 2018
[43]

OANet: Learning two-view corresponde nces and geometry using order-aware network,

J. Zhang, D. Sun, Z. Luo, A. Y ao, H. Chen, L. Zhou, T. Shen, Y . Chen, L. Quan, and H. Liao, “OANet: Learning two-view corresponde nces and geometry using order-aware network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3110–3122, 2022

work page 2022
[44]

YFCC100M: The new data in multimedia research,

B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland , D. Poland, D. Borth, and L. J. Li, “YFCC100M: The new data in multimedia research,” Commun. ACM , pp. 64–73, 2016

work page 2016
[45]

Learning to match features with seeded graph match ing network,

H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai , and L. Quan, “Learning to match features with seeded graph match ing network,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2021, pp. 6301– 6310

work page 2021
[46]

DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,

Z. Kuang, J. Li, M. He, T. Wang, and Y . Zhao, “DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,” in Proc. Int. Conf. Pattern Recognit. (ICPR) , 2022, pp. 542–549

work page 2022
[47]

ClusterGNN: Cluster-based coarse-to-ﬁne graph neural ne twork for efﬁcient feature matching,

Y . Shi, J.-X. Cai, Y . Shavit, T.-J. Mu, W. Feng, and K. Zha ng, “ClusterGNN: Cluster-based coarse-to-ﬁne graph neural ne twork for efﬁcient feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 12 507–12 516

work page 2022
[48]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafrani ec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P .-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jeg ou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski, “DINOv2 : Learning robust visual features w...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, a nd M. Nießner, “ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5828–5839

work page 2017
[50]

InLoc: Indoor visual localization with dense matching and view synthesis,

H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefey s, J. Sivic, T. Pajdla, and A. Torii, “InLoc: Indoor visual localization with dense matching and view synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 7199–7209

work page 2018
[51]

DiffGlue: Diffusion-aided image fe ature match- ing,

S. Zhang and J. Ma, “DiffGlue: Diffusion-aided image fe ature match- ing,” in Proc. ACM Int. Conf. Multimedia , 2024, pp. 8451–8460

work page 2024
[52]

Handcrafted outlier detection revisited,

L. Cavalli, V . Larsson, M. R. Oswald, T. Sattler, and M. P ollefeys, “Handcrafted outlier detection revisited,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 770–787

work page 2020
[53]

ResMatch: R esidual attention learning for feature matching,

Y . Deng, K. Zhang, S. Zhang, Y . Li, and J. Ma, “ResMatch: R esidual attention learning for feature matching,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2024, pp. 1501–1509. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 14

work page 2024
[54]

Benchmarking 6DOF outdoor visual localization in changin g condi- tions,

T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstra nd, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla, “Benchmarking 6DOF outdoor visual localization in changin g condi- tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8601–8610

work page 2018
[55]

F rom coarse to ﬁne: Robust hierarchical localization at large scale,

P .-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “F rom coarse to ﬁne: Robust hierarchical localization at large scale,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 12 708–12 717

work page 2019
[56]

ASpanFormer: Detector-free image matching with adaptive span Transformer,

H. Chen, Z. Luo, L. Zhou, Y . Tian, M. Zhen, T. Fang, D. McKi nnon, Y . Tsin, and L. Quan, “ASpanFormer: Detector-free image matching with adaptive span Transformer,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , 2022, pp. 20–36. Songlin Du received the Ph.D. degree in Physics from Lanzhou University, Lanzhou, China, and the second Ph.D. degree in Engineerin...

work page 2022
[57]

He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV

He is currently a Professor at Tongji Uni- versity, Shanghai, China. He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV . He was awarded the Best Ph.D. Thesis Award by China So- ciety of Image and Graphics (a total of ten awardees in China). He also served on the program committee (PC) of CVPR, IC...

work page 1998
[58]

He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008)

He is currently a Professor with the School of Automation and the Deputy Director of the De- tection Technology and Automation Research In- stitute, Southeast University. He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008). His research intere sts include image processing, sig...

work page 2008

[1] [1]

Deep learning reforms image matching: A survey and ou tlook,

S. Zhang, Z. Li, K. Zhang, Y . Lu, Y . Deng, L. Tang, X. Jiang, and J. Ma, “Deep learning reforms image matching: A survey and ou tlook,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2506 .04619

work page 2025

[2] [2]

EC-SfM : Efﬁcient covisibility-based structure-from-motion for b oth sequential and unordered images,

Z. Y e, C. Bao, X. Zhou, H. Liu, H. Bao, and G. Zhang, “EC-SfM : Efﬁcient covisibility-based structure-from-motion for b oth sequential and unordered images,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 1, pp. 110–123, 2024

work page 2024

[3] [3]

PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,

X. Hu, Y . Wu, M. Zhao, L. Y ang, X. Zhang, and X. Ji, “PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,” IEEE Trans. Circuits Syst. Video Technol. , vol. 35, no. 3, pp. 2026–2044, 2025

work page 2026

[4] [4]

Distinctive image features from scale-inva riant keypoints,

D. G. Lowe, “Distinctive image features from scale-inva riant keypoints,” Int. J. Comput. Vis. , vol. 60, no. 2, pp. 91–110, 2004

work page 2004

[5] [5]

D2-Net: A trainable CNN for joint descripti on and detection of local features,

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A . Torii, and T. Sattler, “D2-Net: A trainable CNN for joint descripti on and detection of local features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 8092–8101

work page 2019

[6] [6]

SuperPoi nt: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoi nt: Self- supervised interest point detection and description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. W orkshops (CVPRW) , 2018, pp. 224–236

work page 2018

[7] [7]

MatchMa mba: Correspondence pruning via selective state space model,

Y . Wu, X. Li, H. Chen, C. Y ang, L. Wei, and R. Chen, “MatchMa mba: Correspondence pruning via selective state space model,” IEEE Trans. Circuits Syst. Video Technol. , 2025

work page 2025

[8] [8]

CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,

C. Y ang, X. Li, J. Ma, F. Zhuang, L. Wei, R. Chen, and G. Chen , “CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 12, pp. 12 450–12 465, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 13

work page 2024

[9] [9]

MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,

F. Zhuang, Y . Liu, X. Li, J. Zhou, R. Chen, L. Wei, C. Y ang, a nd J. Ma, “MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,” ISPRS J. Photogramm. Remote Sens. , vol. 219, pp. 38–51, 2025

work page 2025

[10] [10]

U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,

Z. Li, S. Zhang, and J. Ma, “U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 46, no. 12, pp. 10 960–10 977, 2024

work page 2024

[11] [11]

MIN IMA: Modality invariant image matching,

J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “MIN IMA: Modality invariant image matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025, pp. 23 059–23 068

work page 2025

[12] [12]

ASLFeat: Learning local features of accurate s hape and localization,

Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Y ao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate s hape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6588–6597

work page 2020

[13] [13]

Super- Glue: Learning feature matching with graph neural networks ,

P . Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich , “Super- Glue: Learning feature matching with graph neural networks ,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020, pp. 4937–4946

work page 2020

[14] [14]

LoFTR: Dete ctor-free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Dete ctor-free local feature matching with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 8922–8931

work page 2021

[15] [15]

Ligh tGlue: Local feature matching at light speed,

P . Lindenberger, P .-E. Sarlin, and M. Pollefeys, “Ligh tGlue: Local feature matching at light speed,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 17 581–17 592

work page 2023

[16] [16]

ORB: An efﬁcient alternative to SIFT or SURF,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “ORB: An efﬁcient alternative to SIFT or SURF,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2011, pp. 2564–2571

work page 2011

[17] [17]

R2D2: Reliable and repeatable detector and descriptor,

J. Revaud, C. De Souza, M. Humenberger, and P . Weinzaepf el, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2019, pp. 12 414–12 424

work page 2019

[18] [18]

Attention is all you need,

A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jone s, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017, pp. 5998–6008

work page 2017

[19] [19]

Swin Transformer: Hierarchical vision Transformer using shifted win- dows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B . Guo, “Swin Transformer: Hierarchical vision Transformer using shifted win- dows,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022

work page 2021

[20] [20]

Match- Former: Interleaving attention in Transformers for featur e matching,

Q. Wang, J. Zhang, K. Y ang, K. Peng, and R. Stiefelhagen, “Match- Former: Interleaving attention in Transformers for featur e matching,” in Proc. Asia. Conf. Comput. Vis. (ACCV) , 2022, pp. 2746–2762

work page 2022

[21] [21]

An image patch is a wave: Phase-aware vision MLP,

Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision MLP,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 10 925–10 934

work page 2022

[22] [22]

Learning feature matching via matchabl e keypoint- assisted graph neural network,

Z. Li and J. Ma, “Learning feature matching via matchabl e keypoint- assisted graph neural network,” IEEE Trans. Image Process. , vol. 34, pp. 154–169, 2025

work page 2025

[23] [23]

Guide local f eature matching by overlap estimation,

Y . Chen, D. Huang, S. Xu, J. Liu, and Y . Liu, “Guide local f eature matching by overlap estimation,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2022, pp. 365–373

work page 2022

[24] [24]

Le arning accurate dense correspondences and when to trust them,

P . Truong, M. Danelljan, L. V an Gool, and R. Timofte, “Le arning accurate dense correspondences and when to trust them,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 5710–5720

work page 2021

[25] [25]

DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,

T. Xie, K. Dai, K. Wang, R. Li, and L. Zhao, “DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,” Expert Syst. Appl. , vol. 237, p. 121361, 2024

work page 2024

[26] [26]

VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,

K. Dai, Z. Zhou, Z. Jiang, Q. Sun, T. Xie, H. Gao, T. An, R. L i, and L. Zhao, “VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,” IEEE Trans. Circuits Syst. Video Technol., 2025

work page 2025

[27] [27]

Adaptiv e spot-guided Transformer for consistent local feature matching,

J. Y u, J. Chang, J. He, T. Zhang, J. Y u, and F. Wu, “Adaptiv e spot-guided Transformer for consistent local feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023, pp. 21 898–21 908

work page 2023

[28] [28]

ContextDesc: Local descriptor augmentation with cross- modality context,

Z. Luo, T. Shen, L. Zhou, J. Zhang, Y . Y ao, S. Li, T. Fang, a nd L. Quan, “ContextDesc: Local descriptor augmentation with cross- modality context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 2522–2531

work page 2019

[29] [29]

Attention weighted local descriptors,

C. Wang, R. Xu, K. Lu, S. Xu, W. Meng, Y . Zhang, B. Fan, and X. Zhang, “Attention weighted local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 45, no. 9, pp. 10 632–10 649, 2023

work page 2023

[30] [30]

OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,

K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, and L. Zhao, “OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,” Pattern Recognit., vol. 147, pp. 110 094:1– 110 094:14, 2024

work page 2024

[31] [31]

Adaptive assignment for geometry aware local f eature matching,

D. Huang, Y . Chen, Y . Liu, J. Liu, S. Xu, W. Wu, Y . Ding, F. T ang, and C. Wang, “Adaptive assignment for geometry aware local f eature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5425–5434

work page 2023

[32] [32]

Scene-aware feature mat ching,

X. Lu, Y . Y an, T. Wei, and S. Du, “Scene-aware feature mat ching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 3704–3710

work page 2023

[33] [33]

CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,

Z. Li, y. Lu, L. Tang, S. Zhang, and J. Ma, “CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2025, pp. 18 521–18 530

work page 2025

[34] [34]

Object retrieval with large vocabularies and fast spatial matchin g,

J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman , “Object retrieval with large vocabularies and fast spatial matchin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2007, pp. 1– 8

work page 2007

[35] [35]

MegaDepth: Learning single-view depth pre- diction from internet photos,

Z. Li and N. Snavely, “MegaDepth: Learning single-view depth pre- diction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2041–2050

work page 2018

[36] [36]

Adam: A Method for Stochastic Optimization

D. P . Kingma and J. Ba, “Adam: A method for stochastic opt imization,” arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,

V . Balntas, K. Lenc, A. V edaldi, and K. Mikolajczyk, “HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5173–5182

work page 2017

[38] [38]

Patch2Pix: Epi polar-guided pixel-level correspondences,

Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2Pix: Epi polar-guided pixel-level correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 4669–4678

work page 2021

[39] [39]

NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,

I. Rocco, M. Cimpoi, R. Arandjelovi ´c, A. Torii, T. Pajdla, and J. Sivic, “NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 2, pp. 1020–1034, 2022

work page 2022

[40] [40]

Learnin g feature descriptors using camera pose supervision,

Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learnin g feature descriptors using camera pose supervision,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 757–774

work page 2020

[41] [41]

Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,

F. Radenovi ´c, A. Iscen, G. Tolias, Y . Avrithis, and O. Chum, “Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 5706–5715

work page 2018

[42] [42]

Learning to Find Good Correspondences,

K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P . Fua, “Learning to Find Good Correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2666–2674

work page 2018

[43] [43]

OANet: Learning two-view corresponde nces and geometry using order-aware network,

J. Zhang, D. Sun, Z. Luo, A. Y ao, H. Chen, L. Zhou, T. Shen, Y . Chen, L. Quan, and H. Liao, “OANet: Learning two-view corresponde nces and geometry using order-aware network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3110–3122, 2022

work page 2022

[44] [44]

YFCC100M: The new data in multimedia research,

B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland , D. Poland, D. Borth, and L. J. Li, “YFCC100M: The new data in multimedia research,” Commun. ACM , pp. 64–73, 2016

work page 2016

[45] [45]

Learning to match features with seeded graph match ing network,

H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai , and L. Quan, “Learning to match features with seeded graph match ing network,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2021, pp. 6301– 6310

work page 2021

[46] [46]

DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,

Z. Kuang, J. Li, M. He, T. Wang, and Y . Zhao, “DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,” in Proc. Int. Conf. Pattern Recognit. (ICPR) , 2022, pp. 542–549

work page 2022

[47] [47]

ClusterGNN: Cluster-based coarse-to-ﬁne graph neural ne twork for efﬁcient feature matching,

Y . Shi, J.-X. Cai, Y . Shavit, T.-J. Mu, W. Feng, and K. Zha ng, “ClusterGNN: Cluster-based coarse-to-ﬁne graph neural ne twork for efﬁcient feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 12 507–12 516

work page 2022

[48] [48]

DINOv2: Learning Robust Visual Features without Supervision

M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafrani ec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P .-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jeg ou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski, “DINOv2 : Learning robust visual features w...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[49] [49]

ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,

A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, a nd M. Nießner, “ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5828–5839

work page 2017

[50] [50]

InLoc: Indoor visual localization with dense matching and view synthesis,

H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefey s, J. Sivic, T. Pajdla, and A. Torii, “InLoc: Indoor visual localization with dense matching and view synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 7199–7209

work page 2018

[51] [51]

DiffGlue: Diffusion-aided image fe ature match- ing,

S. Zhang and J. Ma, “DiffGlue: Diffusion-aided image fe ature match- ing,” in Proc. ACM Int. Conf. Multimedia , 2024, pp. 8451–8460

work page 2024

[52] [52]

Handcrafted outlier detection revisited,

L. Cavalli, V . Larsson, M. R. Oswald, T. Sattler, and M. P ollefeys, “Handcrafted outlier detection revisited,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 770–787

work page 2020

[53] [53]

ResMatch: R esidual attention learning for feature matching,

Y . Deng, K. Zhang, S. Zhang, Y . Li, and J. Ma, “ResMatch: R esidual attention learning for feature matching,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2024, pp. 1501–1509. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 14

work page 2024

[54] [54]

Benchmarking 6DOF outdoor visual localization in changin g condi- tions,

T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstra nd, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla, “Benchmarking 6DOF outdoor visual localization in changin g condi- tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8601–8610

work page 2018

[55] [55]

F rom coarse to ﬁne: Robust hierarchical localization at large scale,

P .-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “F rom coarse to ﬁne: Robust hierarchical localization at large scale,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 12 708–12 717

work page 2019

[56] [56]

ASpanFormer: Detector-free image matching with adaptive span Transformer,

H. Chen, Z. Luo, L. Zhou, Y . Tian, M. Zhen, T. Fang, D. McKi nnon, Y . Tsin, and L. Quan, “ASpanFormer: Detector-free image matching with adaptive span Transformer,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , 2022, pp. 20–36. Songlin Du received the Ph.D. degree in Physics from Lanzhou University, Lanzhou, China, and the second Ph.D. degree in Engineerin...

work page 2022

[57] [57]

He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV

He is currently a Professor at Tongji Uni- versity, Shanghai, China. He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV . He was awarded the Best Ph.D. Thesis Award by China So- ciety of Image and Graphics (a total of ten awardees in China). He also served on the program committee (PC) of CVPR, IC...

work page 1998

[58] [58]

He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008)

He is currently a Professor with the School of Automation and the Deputy Director of the De- tection Technology and Automation Research In- stitute, Southeast University. He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008). His research intere sts include image processing, sig...

work page 2008