SceneGlue: Scene-Aware Transformer for Feature Matching without Scene-Level Annotation
Pith reviewed 2026-05-10 13:38 UTC · model grok-4.3
The pith
SceneGlue improves cross-view feature matching by adding implicit and explicit scene awareness trained only on local matches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SceneGlue uses a hybrid matching approach that combines implicit parallel attention across local descriptors with an explicit Visibility Transformer that classifies features into visible and invisible regions, thereby supplying global scene context that local descriptors alone cannot provide, all while training exclusively on local feature matches without any scene-level groundtruth annotations.
What carries the argument
The hybridizable matching paradigm that runs parallel attention to exchange global context within and across images while the Visibility Transformer explicitly estimates cross-view visibility to label visible versus invisible regions.
If this is right
- Homography estimation between image pairs becomes more accurate because visible-region cues reduce mismatches in overlapping areas.
- Camera pose estimation improves when the model can distinguish invisible scene parts that would otherwise produce false correspondences.
- Visual localization tasks gain robustness since global context helps match features even under large viewpoint changes.
- Interpretability increases because the explicit visibility output shows which scene regions the matcher considered for each correspondence.
Where Pith is reading between the lines
- The same visibility-labeling idea could be tested on other correspondence problems such as video tracking where scene parts enter and leave the frame.
- If the method works without scene labels, it might allow larger training sets drawn from existing local-match datasets that lack expensive annotations.
- Combining the approach with semantic segmentation could further refine which scene elements are treated as visible or occluded.
Load-bearing premise
Local feature matches by themselves supply enough signal to train accurate cross-view visibility estimates and global scene context without any scene-level groundtruth annotations.
What would settle it
Running the full SceneGlue pipeline on standard matching benchmarks and finding no accuracy gain over a baseline that uses only local descriptors would indicate that the added scene-awareness components are not delivering the claimed compensation.
Figures
read the original abstract
Local feature matching plays a critical role in understanding the correspondence between cross-view images. However, traditional methods are constrained by the inherent local nature of feature descriptors, limiting their ability to capture non-local scene information that is essential for accurate cross-view correspondence. In this paper, we introduce SceneGlue, a scene-aware feature matching framework designed to overcome these limitations. SceneGlue leverages a hybridizable matching paradigm that integrates implicit parallel attention and explicit cross-view visibility estimation. The parallel attention mechanism simultaneously exchanges information among local descriptors within and across images, enhancing the scene's global context. To further enrich the scene awareness, we propose the Visibility Transformer, which explicitly categorizes features into visible and invisible regions, providing an understanding of cross-view scene visibility. By combining explicit and implicit scene-level awareness, SceneGlue effectively compensates for the local descriptor constraints. Notably, SceneGlue is trained using only local feature matches, without requiring scene-level groundtruth annotations. This scene-aware approach not only improves accuracy and robustness but also enhances interpretability compared to traditional methods. Extensive experiments on applications such as homography estimation, pose estimation, image matching, and visual localization validate SceneGlue's superior performance. The source code is available at https://github.com/songlin-du/SceneGlue.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SceneGlue, a transformer-based framework for local feature matching that augments standard descriptors with implicit scene context (via parallel attention exchanging information within and across images) and explicit scene awareness (via a Visibility Transformer that categorizes features as visible or invisible). The model is trained end-to-end using only local feature match supervision and no scene-level ground-truth annotations, with claims of improved accuracy, robustness, and interpretability on homography estimation, pose estimation, image matching, and visual localization tasks. Source code is released.
Significance. If the Visibility Transformer indeed recovers geometrically meaningful cross-view visibility from local-match supervision alone, the hybrid explicit-implicit design would meaningfully extend local feature matching beyond descriptor limitations, offering both performance gains and added interpretability. The public code release strengthens the contribution by enabling direct verification and extension.
major comments (2)
- [§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.
- [§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.
minor comments (2)
- [Abstract / §3] The term 'hybridizable matching paradigm' is introduced in the abstract but not formally defined or contrasted with prior matching paradigms in the method section; a concise definition or diagram would improve clarity.
- [§3.1–3.3] Notation for the parallel attention and visibility heads (e.g., symbols for visible/invisible logits) should be introduced once and used consistently to avoid reader confusion when tracing the loss terms.
Simulated Author's Rebuttal
Thank you for the constructive comments on our paper. We address the major concerns point by point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Visibility Transformer description): The architecture is trained solely via the indirect matching loss on local correspondences, with no scene-level ground truth or auxiliary regularization. This leaves open the possibility that the visibility head converges to a locally consistent but geometrically inaccurate proxy, particularly when matches are sparse under large viewpoint changes. Direct evidence (e.g., quantitative alignment of predicted visibility maps with pose-derived or depth-derived visibility) is required to substantiate the central claim that explicit visibility estimation supplies true scene-level awareness.
Authors: We thank the referee for highlighting this important point. The Visibility Transformer is indeed trained indirectly through the matching loss without explicit scene-level supervision. While we believe the performance improvements and qualitative visualizations in the original manuscript support its effectiveness, we agree that direct quantitative validation is valuable. In the revised manuscript, we will include experiments that compare the predicted visibility maps against visibility derived from ground-truth poses and depth information on appropriate datasets (e.g., those with available 3D data). This will provide the requested evidence for the geometric accuracy of the visibility estimation. revision: yes
-
Referee: [§4] §4 (Experiments and ablations): While superior performance is reported on standard benchmarks, the manuscript does not present ablations that isolate the contribution of the Visibility Transformer versus the parallel attention mechanism, nor does it quantify how much each component improves over a baseline transformer matcher. Without these controls, the attribution of gains specifically to the hybrid scene-aware design remains under-supported.
Authors: We acknowledge the lack of detailed ablations isolating the contributions of the parallel attention and the Visibility Transformer. To better attribute the performance gains to the hybrid design, we will add ablation studies in the revised version. These will include: a baseline transformer matcher, variants with only parallel attention, only Visibility Transformer, and the full model. Results will be reported on the main benchmarks to quantify the improvement from each component over the baseline. revision: yes
Circularity Check
No circularity: new architecture with independent training signal
full rationale
The paper presents SceneGlue as a novel hybrid matching framework that adds a Visibility Transformer and parallel attention to local descriptors. These are architectural additions, not derivations that reduce to prior equations or self-fitted quantities. Training uses only local feature match supervision without scene-level ground truth; this is an empirical learning claim, not a mathematical tautology where a 'prediction' is defined as the input. No self-citations, uniqueness theorems, or ansatzes from prior author work are invoked as load-bearing steps. The derivation chain consists of standard transformer blocks plus new heads, with no reduction by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- model hyperparameters
axioms (1)
- domain assumption Local feature matches alone suffice to supervise scene-aware visibility estimation
invented entities (2)
-
Visibility Transformer
no independent evidence
-
Parallel attention mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep learning reforms image matching: A survey and ou tlook,
S. Zhang, Z. Li, K. Zhang, Y . Lu, Y . Deng, L. Tang, X. Jiang, and J. Ma, “Deep learning reforms image matching: A survey and ou tlook,” arXiv, 2025. [Online]. Available: https://arxiv.org/abs/2506 .04619
work page 2025
-
[2]
Z. Y e, C. Bao, X. Zhou, H. Liu, H. Bao, and G. Zhang, “EC-SfM : Efficient covisibility-based structure-from-motion for b oth sequential and unordered images,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 1, pp. 110–123, 2024
work page 2024
-
[3]
PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,
X. Hu, Y . Wu, M. Zhao, L. Y ang, X. Zhang, and X. Ji, “PAS-SLA M: A visual SLAM system for planar-ambiguous scenes,” IEEE Trans. Circuits Syst. Video Technol. , vol. 35, no. 3, pp. 2026–2044, 2025
work page 2026
-
[4]
Distinctive image features from scale-inva riant keypoints,
D. G. Lowe, “Distinctive image features from scale-inva riant keypoints,” Int. J. Comput. Vis. , vol. 60, no. 2, pp. 91–110, 2004
work page 2004
-
[5]
D2-Net: A trainable CNN for joint descripti on and detection of local features,
M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A . Torii, and T. Sattler, “D2-Net: A trainable CNN for joint descripti on and detection of local features,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 8092–8101
work page 2019
-
[6]
SuperPoi nt: Self- supervised interest point detection and description,
D. DeTone, T. Malisiewicz, and A. Rabinovich, “SuperPoi nt: Self- supervised interest point detection and description,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. W orkshops (CVPRW) , 2018, pp. 224–236
work page 2018
-
[7]
MatchMa mba: Correspondence pruning via selective state space model,
Y . Wu, X. Li, H. Chen, C. Y ang, L. Wei, and R. Chen, “MatchMa mba: Correspondence pruning via selective state space model,” IEEE Trans. Circuits Syst. Video Technol. , 2025
work page 2025
-
[8]
CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,
C. Y ang, X. Li, J. Ma, F. Zhuang, L. Wei, R. Chen, and G. Chen , “CGR-Net: Consistency guided ResFormer for two-view corre spondence learning,” IEEE Trans. Circuits Syst. Video Technol. , vol. 34, no. 12, pp. 12 450–12 465, 2024. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 13
work page 2024
-
[9]
MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,
F. Zhuang, Y . Liu, X. Li, J. Zhou, R. Chen, L. Wei, C. Y ang, a nd J. Ma, “MGCNet: Multi-granularity consensus network for remote s ensing image correspondence pruning,” ISPRS J. Photogramm. Remote Sens. , vol. 219, pp. 38–51, 2025
work page 2025
-
[10]
U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,
Z. Li, S. Zhang, and J. Ma, “U-Match: Exploring hierarch y-aware local context for two-view correspondence learning,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 46, no. 12, pp. 10 960–10 977, 2024
work page 2024
-
[11]
MIN IMA: Modality invariant image matching,
J. Ren, X. Jiang, Z. Li, D. Liang, X. Zhou, and X. Bai, “MIN IMA: Modality invariant image matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2025, pp. 23 059–23 068
work page 2025
-
[12]
ASLFeat: Learning local features of accurate s hape and localization,
Z. Luo, L. Zhou, X. Bai, H. Chen, J. Zhang, Y . Y ao, S. Li, T. Fang, and L. Quan, “ASLFeat: Learning local features of accurate s hape and localization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2020, pp. 6588–6597
work page 2020
-
[13]
Super- Glue: Learning feature matching with graph neural networks ,
P . Sarlin, D. DeTone, T. Malisiewicz, and A. Rabinovich , “Super- Glue: Learning feature matching with graph neural networks ,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2020, pp. 4937–4946
work page 2020
-
[14]
LoFTR: Dete ctor-free local feature matching with transformers,
J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “LoFTR: Dete ctor-free local feature matching with transformers,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 8922–8931
work page 2021
-
[15]
Ligh tGlue: Local feature matching at light speed,
P . Lindenberger, P .-E. Sarlin, and M. Pollefeys, “Ligh tGlue: Local feature matching at light speed,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 17 581–17 592
work page 2023
-
[16]
ORB: An efficient alternative to SIFT or SURF,
E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient alternative to SIFT or SURF,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2011, pp. 2564–2571
work page 2011
-
[17]
R2D2: Reliable and repeatable detector and descriptor,
J. Revaud, C. De Souza, M. Humenberger, and P . Weinzaepf el, “R2D2: Reliable and repeatable detector and descriptor,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2019, pp. 12 414–12 424
work page 2019
-
[18]
A. V aswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jone s, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS) , 2017, pp. 5998–6008
work page 2017
-
[19]
Swin Transformer: Hierarchical vision Transformer using shifted win- dows,
Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B . Guo, “Swin Transformer: Hierarchical vision Transformer using shifted win- dows,” in Proc. Int. Conf. Comput. Vis. (ICCV), 2021, pp. 10 012–10 022
work page 2021
-
[20]
Match- Former: Interleaving attention in Transformers for featur e matching,
Q. Wang, J. Zhang, K. Y ang, K. Peng, and R. Stiefelhagen, “Match- Former: Interleaving attention in Transformers for featur e matching,” in Proc. Asia. Conf. Comput. Vis. (ACCV) , 2022, pp. 2746–2762
work page 2022
-
[21]
An image patch is a wave: Phase-aware vision MLP,
Y . Tang, K. Han, J. Guo, C. Xu, Y . Li, C. Xu, and Y . Wang, “An image patch is a wave: Phase-aware vision MLP,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 10 925–10 934
work page 2022
-
[22]
Learning feature matching via matchabl e keypoint- assisted graph neural network,
Z. Li and J. Ma, “Learning feature matching via matchabl e keypoint- assisted graph neural network,” IEEE Trans. Image Process. , vol. 34, pp. 154–169, 2025
work page 2025
-
[23]
Guide local f eature matching by overlap estimation,
Y . Chen, D. Huang, S. Xu, J. Liu, and Y . Liu, “Guide local f eature matching by overlap estimation,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2022, pp. 365–373
work page 2022
-
[24]
Le arning accurate dense correspondences and when to trust them,
P . Truong, M. Danelljan, L. V an Gool, and R. Timofte, “Le arning accurate dense correspondences and when to trust them,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 5710–5720
work page 2021
-
[25]
DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,
T. Xie, K. Dai, K. Wang, R. Li, and L. Zhao, “DeepMatcher: A deep transformer-based network for robust and accurate loc al feature matching,” Expert Syst. Appl. , vol. 237, p. 121361, 2024
work page 2024
-
[26]
VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,
K. Dai, Z. Zhou, Z. Jiang, Q. Sun, T. Xie, H. Gao, T. An, R. L i, and L. Zhao, “VD-Matcher: A very deep local feature matcher w ith weight recycling and keypoint detection,” IEEE Trans. Circuits Syst. Video Technol., 2025
work page 2025
-
[27]
Adaptiv e spot-guided Transformer for consistent local feature matching,
J. Y u, J. Chang, J. He, T. Zhang, J. Y u, and F. Wu, “Adaptiv e spot-guided Transformer for consistent local feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2023, pp. 21 898–21 908
work page 2023
-
[28]
ContextDesc: Local descriptor augmentation with cross- modality context,
Z. Luo, T. Shen, L. Zhou, J. Zhang, Y . Y ao, S. Li, T. Fang, a nd L. Quan, “ContextDesc: Local descriptor augmentation with cross- modality context,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 2522–2531
work page 2019
-
[29]
Attention weighted local descriptors,
C. Wang, R. Xu, K. Lu, S. Xu, W. Meng, Y . Zhang, B. Fan, and X. Zhang, “Attention weighted local descriptors,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 45, no. 9, pp. 10 632–10 649, 2023
work page 2023
-
[30]
K. Dai, T. Xie, K. Wang, Z. Jiang, R. Li, and L. Zhao, “OAMa tcher: An overlapping areas-based network with label credibility for robust and accurate feature matching,” Pattern Recognit., vol. 147, pp. 110 094:1– 110 094:14, 2024
work page 2024
-
[31]
Adaptive assignment for geometry aware local f eature matching,
D. Huang, Y . Chen, Y . Liu, J. Liu, S. Xu, W. Wu, Y . Ding, F. T ang, and C. Wang, “Adaptive assignment for geometry aware local f eature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2023, pp. 5425–5434
work page 2023
-
[32]
Scene-aware feature mat ching,
X. Lu, Y . Y an, T. Wei, and S. Du, “Scene-aware feature mat ching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2023, pp. 3704–3710
work page 2023
-
[33]
Z. Li, y. Lu, L. Tang, S. Zhang, and J. Ma, “CoMatch: Dynam ic covisibility-aware transformer for bilateral subpixel-l evel semi-dense image matching,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2025, pp. 18 521–18 530
work page 2025
-
[34]
Object retrieval with large vocabularies and fast spatial matchin g,
J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman , “Object retrieval with large vocabularies and fast spatial matchin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2007, pp. 1– 8
work page 2007
-
[35]
MegaDepth: Learning single-view depth pre- diction from internet photos,
Z. Li and N. Snavely, “MegaDepth: Learning single-view depth pre- diction from internet photos,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2041–2050
work page 2018
-
[36]
Adam: A Method for Stochastic Optimization
D. P . Kingma and J. Ba, “Adam: A method for stochastic opt imization,” arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,
V . Balntas, K. Lenc, A. V edaldi, and K. Mikolajczyk, “HP atches: A benchmark and evaluation of handcrafted and learned local d escriptors,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5173–5182
work page 2017
-
[38]
Patch2Pix: Epi polar-guided pixel-level correspondences,
Q. Zhou, T. Sattler, and L. Leal-Taixe, “Patch2Pix: Epi polar-guided pixel-level correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2021, pp. 4669–4678
work page 2021
-
[39]
NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,
I. Rocco, M. Cimpoi, R. Arandjelovi ´c, A. Torii, T. Pajdla, and J. Sivic, “NCNet: Neighbourhood consensus networks for estimating i mage cor- respondences,” IEEE Trans. Pattern Anal. Mach. Intell. , vol. 44, no. 2, pp. 1020–1034, 2022
work page 2022
-
[40]
Learnin g feature descriptors using camera pose supervision,
Q. Wang, X. Zhou, B. Hariharan, and N. Snavely, “Learnin g feature descriptors using camera pose supervision,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 757–774
work page 2020
-
[41]
Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,
F. Radenovi ´c, A. Iscen, G. Tolias, Y . Avrithis, and O. Chum, “Revisiting Oxford and Paris: Large-scale image retrieval benchmarkin g,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 5706–5715
work page 2018
-
[42]
Learning to Find Good Correspondences,
K. M. Yi, E. Trulls, Y . Ono, V . Lepetit, M. Salzmann, and P . Fua, “Learning to Find Good Correspondences,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 2666–2674
work page 2018
-
[43]
OANet: Learning two-view corresponde nces and geometry using order-aware network,
J. Zhang, D. Sun, Z. Luo, A. Y ao, H. Chen, L. Zhou, T. Shen, Y . Chen, L. Quan, and H. Liao, “OANet: Learning two-view corresponde nces and geometry using order-aware network,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 6, pp. 3110–3122, 2022
work page 2022
-
[44]
YFCC100M: The new data in multimedia research,
B. Thomee, B. Elizalde, D. A. Shamma, K. Ni, G. Friedland , D. Poland, D. Borth, and L. J. Li, “YFCC100M: The new data in multimedia research,” Commun. ACM , pp. 64–73, 2016
work page 2016
-
[45]
Learning to match features with seeded graph match ing network,
H. Chen, Z. Luo, J. Zhang, L. Zhou, X. Bai, Z. Hu, C.-L. Tai , and L. Quan, “Learning to match features with seeded graph match ing network,” in Proc. Int. Conf. Comput. Vis. (ICCV) , 2021, pp. 6301– 6310
work page 2021
-
[46]
DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,
Z. Kuang, J. Li, M. He, T. Wang, and Y . Zhao, “DenseGAP: Gr aph- structured dense correspondence learning with anchor poin ts,” in Proc. Int. Conf. Pattern Recognit. (ICPR) , 2022, pp. 542–549
work page 2022
-
[47]
ClusterGNN: Cluster-based coarse-to-fine graph neural ne twork for efficient feature matching,
Y . Shi, J.-X. Cai, Y . Shavit, T.-J. Mu, W. Feng, and K. Zha ng, “ClusterGNN: Cluster-based coarse-to-fine graph neural ne twork for efficient feature matching,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2022, pp. 12 507–12 516
work page 2022
-
[48]
DINOv2: Learning Robust Visual Features without Supervision
M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafrani ec, V . Khalidov, P . Fernandez, D. Haziza, F. Massa, A. El-Nouby, R. Howes, P .-Y . Huang, H. Xu, V . Sharma, S.-W. Li, W. Galuba, M. Rabbat, M. Assran, N. Ballas, G. Synnaeve, I. Misra, H. Jeg ou, J. Mairal, P . Labatut, A. Joulin, and P . Bojanowski, “DINOv2 : Learning robust visual features w...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[49]
ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,
A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, a nd M. Nießner, “ScanNet: Richly-annotated 3D reconstruction s of indoor scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2017, pp. 5828–5839
work page 2017
-
[50]
InLoc: Indoor visual localization with dense matching and view synthesis,
H. Taira, M. Okutomi, T. Sattler, M. Cimpoi, M. Pollefey s, J. Sivic, T. Pajdla, and A. Torii, “InLoc: Indoor visual localization with dense matching and view synthesis,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2018, pp. 7199–7209
work page 2018
-
[51]
DiffGlue: Diffusion-aided image fe ature match- ing,
S. Zhang and J. Ma, “DiffGlue: Diffusion-aided image fe ature match- ing,” in Proc. ACM Int. Conf. Multimedia , 2024, pp. 8451–8460
work page 2024
-
[52]
Handcrafted outlier detection revisited,
L. Cavalli, V . Larsson, M. R. Oswald, T. Sattler, and M. P ollefeys, “Handcrafted outlier detection revisited,” in Proc. Eur . Conf. Comput. Vis. (ECCV), 2020, pp. 770–787
work page 2020
-
[53]
ResMatch: R esidual attention learning for feature matching,
Y . Deng, K. Zhang, S. Zhang, Y . Li, and J. Ma, “ResMatch: R esidual attention learning for feature matching,” in Proc. AAAI Conf. Artif. Intell. (AAAI), 2024, pp. 1501–1509. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOL OGY 14
work page 2024
-
[54]
Benchmarking 6DOF outdoor visual localization in changin g condi- tions,
T. Sattler, W. Maddern, C. Toft, A. Torii, L. Hammarstra nd, E. Stenborg, D. Safari, M. Okutomi, M. Pollefeys, J. Sivic, F. Kahl, and T. Pajdla, “Benchmarking 6DOF outdoor visual localization in changin g condi- tions,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR), 2018, pp. 8601–8610
work page 2018
-
[55]
F rom coarse to fine: Robust hierarchical localization at large scale,
P .-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk, “F rom coarse to fine: Robust hierarchical localization at large scale,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit. (CVPR) , 2019, pp. 12 708–12 717
work page 2019
-
[56]
ASpanFormer: Detector-free image matching with adaptive span Transformer,
H. Chen, Z. Luo, L. Zhou, Y . Tian, M. Zhen, T. Fang, D. McKi nnon, Y . Tsin, and L. Quan, “ASpanFormer: Detector-free image matching with adaptive span Transformer,” in Proc. Eur . Conf. Comput. Vis. (ECCV) , 2022, pp. 20–36. Songlin Du received the Ph.D. degree in Physics from Lanzhou University, Lanzhou, China, and the second Ph.D. degree in Engineerin...
work page 2022
-
[57]
He is currently a Professor at Tongji Uni- versity, Shanghai, China. He has published over 50 papers in journals and conferences including IEEE TPAMI/TIP , IJCV , ICCV , and ECCV . He was awarded the Best Ph.D. Thesis Award by China So- ciety of Image and Graphics (a total of ten awardees in China). He also served on the program committee (PC) of CVPR, IC...
work page 1998
-
[58]
He is currently a Professor with the School of Automation and the Deputy Director of the De- tection Technology and Automation Research In- stitute, Southeast University. He is a co-author of the book An Introduction to the Intelligent Transportation Systems (China Communications Press, Beijing, 2008). His research intere sts include image processing, sig...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.