SA-Homo: Scale Adaptive Homography Estimation for Scale Variation Scenarios

Haifeng Wu; Huarong Jia; Shangxuan Xie; Wen Li; Yuhang Wang

arxiv: 2606.30408 · v1 · pith:Q3BHYPH2new · submitted 2026-06-29 · 💻 cs.CV

SA-Homo: Scale Adaptive Homography Estimation for Scale Variation Scenarios

Shangxuan Xie , Haifeng Wu , Yuhang Wang , Huarong Jia , Wen Li This is my paper

Pith reviewed 2026-06-30 06:03 UTC · model grok-4.3

classification 💻 cs.CV

keywords homography estimationscale variationscale adaptivedeep learningcomputer visionsatellite imageryHMSA dataset

0 comments

The pith

SA-Homo uses a global-to-local module sequence to estimate homography accurately when image scales differ by up to eight times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SA-Homo to solve homography estimation when image pairs show large scale differences that break standard local-feature assumptions. It applies a heavy module first to align scales from a global view using attention and similarity structures, then hands the result to a lightweight module for local polishing. The authors also release the HMSA dataset of high-resolution multi-modal satellite images to test these conditions. If the two-stage bridge works as described, homography-based tasks such as image registration can maintain precision across wider zoom ranges without extra preprocessing. Experiments claim the method beats prior approaches on both ordinary and extreme scale-variation cases.

Core claim

SA-Homo maintains high precision even under 8× scale discrepancies by adopting a hierarchical scale alignment strategy that transitions from the global perspective with a heavy module to a local perspective with a light module. The Scale-aware Discrepancy Bridging Module uses Multi-scale Linear Attention Cascade to capture long-range dependencies and a Cross-scale Similarity Matrix Block for robust correlation, after which the Iterative Homography Estimation Refinement Module progressively refines the result using local correlations.

What carries the argument

The hierarchical scale alignment strategy that starts with the Scale-aware Discrepancy Bridging Module (SDBM) containing Multi-scale Linear Attention Cascade (MLAC) and Cross-scale Similarity Matrix Block (CSMB) to reduce global scale gaps before applying the lightweight Iterative Homography Estimation Refinement Module (IHERM).

If this is right

Homography estimation remains accurate in satellite and multi-modal image pairs that differ by factors up to 8 in scale.
The same framework improves results on conventional scale-similar pairs as well as the new challenging cases.
The released HMSA dataset supplies a benchmark for developing and comparing future scale-robust methods.
The two-stage design allows heavy computation only at the start and keeps later refinement efficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same global-to-local bridging pattern may transfer to other geometric tasks such as essential-matrix estimation when focal lengths or distances vary.
Image-registration pipelines could drop explicit scale-normalization steps if the bridging module generalizes beyond the tested satellite domain.
Real-time video applications with changing camera distances could be tested to measure whether the refinement stage stays fast enough for live use.

Load-bearing premise

The initial global bridging step reduces the scale gap enough for the later lightweight local module to finish the alignment without itself handling the full discrepancy.

What would settle it

On the HMSA dataset, a direct comparison showing that SA-Homo's homography error under 8× scale difference is not lower than current state-of-the-art methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.30408 by Haifeng Wu, Huarong Jia, Shangxuan Xie, Wen Li, Yuhang Wang.

**Figure 2.** Figure 2: Impact of increasing scale discrepancy ratio on estimation accuracy across three datasets: (a) MSCOCO [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed SA-Homo. The architecture consists of two main modules: (a) the Scale-aware Discrep [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 5.** Figure 5: (a) Illustration of the Multi-scale Linear Attention [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: (a) Illustration of the local correlation computation. (b) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Overview of the data generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Scale Discrepancy Ratio (SDR) distributions of the validation datasets. (a)–(c) illustrate the SDR distributions of the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Performance comparison under two settings: the scale variation scenario on (a) MSCOCO [ [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: The Visualization results on (a) MSCOCO dataset [ [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 11.** Figure 11: Visualization of boundary cases, where only part of the [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Runtime and efficiency analysis. We present the inference latency, total parameters, and MACE of state-of-the-art [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 13.** Figure 13: Ablation studies on the HMSA dataset. We investigate the impact of: (a) attention mechanisms (linear vs. multi-scale) [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗

read the original abstract

Homography estimation, as one of the fundamental problems in computer vision, remains challenged by scale variation scenarios where image pairs potentially exhibit significant scale discrepancies. Existing deep learning frameworks frequently suffer from a significant performance degradation in such cases, as they rely on limited displacement assumptions and local feature consistency that might not hold under large scale gaps. In this paper, we propose SA-Homo, a novel scale-adaptive homography estimation framework designed to achieve robust alignment across a wide range of scale discrepancy ratios. We adopt a hierarchical scale alignment strategy that transitions from the global perspective with a heavy module to a local perspective with a light module. Specifically, we introduce the Scale-aware Discrepancy Bridging Module (SDBM) for initial alignment, which utilizes a Multi-scale Linear Attention Cascade (MLAC) to capture long-range dependencies and mitigate feature inconsistencies, along with a global Cross-scale Similarity Matrix Block (CSMB) for scale robust correlation representation. Once the initial scale gap is bridged, a lightweight Iterative Homography Estimation Refinement Module (IHERM) progressively polishes the result using local correlations. To facilitate this research, we contribute the HMSA dataset, a high-resolution, multi-modal satellite benchmark specifically tailored for scale-variant challenges. Extensive experiments demonstrate that SA-Homo maintains high precision even under 8$\times$ scale discrepancies, outperforming state-of-the-art methods in both conventional scale-similar scenarios and challenging scale variation scenarios. Code and collected datasets are available at https://github.com/shangxuanx330/SA_Homo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SA-Homo adds a hierarchical global-to-local pipeline and a new satellite dataset for large-scale homography gaps, but the abstract gives no numbers or ablations to judge whether the gains hold up.

read the letter

The main thing to know is that this paper targets scale variation in homography estimation with a two-stage setup: a heavier Scale-aware Discrepancy Bridging Module that uses multi-scale linear attention and a cross-scale similarity block to reduce large gaps first, then a lightweight iterative refinement module to polish the result. They also release the HMSA dataset of high-resolution multi-modal satellite pairs built specifically for this problem.

The work is solid on identifying a practical weakness in existing deep homography networks, which often assume limited displacement or similar scales. The hierarchical split makes internal sense—handle the global mismatch before local correlation—and releasing code plus data is a clear positive for anyone who wants to test or extend it. The stress-test note is right that the pipeline description does not contain an obvious internal contradiction.

The soft spot is the lack of visible experimental grounding. The abstract states outperformance at 8x scale gaps and in both normal and variant cases, yet supplies no baselines, error tables, ablation results, or implementation specifics. Without those, it is impossible to tell whether the new modules deliver real improvement or whether the gains depend on particular training choices or dataset quirks. The reader's low soundness score tracks with this gap.

This paper is for computer vision groups working on geometric tasks in remote sensing or multi-view settings where scale changes matter. A reader who needs a benchmark for scale-variant homography or wants to try the released code would get direct value. It is worth sending to peer review because the dataset and code release are concrete contributions even if the method itself is an engineering combination rather than a conceptual leap.

Referee Report

0 major / 3 minor

Summary. The paper claims to introduce SA-Homo, a scale-adaptive homography estimation framework that uses a hierarchical scale alignment strategy. It features the Scale-aware Discrepancy Bridging Module (SDBM) utilizing Multi-scale Linear Attention Cascade (MLAC) to capture long-range dependencies and Cross-scale Similarity Matrix Block (CSMB) for scale robust correlation, followed by the lightweight Iterative Homography Estimation Refinement Module (IHERM) for progressive refinement. The work also presents the HMSA dataset for scale-variant challenges and demonstrates through experiments that SA-Homo outperforms state-of-the-art methods in both conventional and challenging scale variation scenarios, maintaining high precision under 8× scale discrepancies.

Significance. If validated, the results would represent a meaningful advance in handling scale variations in homography estimation, which is critical for multi-modal and satellite imaging applications. The contribution of the HMSA dataset and the open-sourcing of code are notable strengths that facilitate reproducibility and future work in the area.

minor comments (3)

[Abstract] The description of the modules is high-level; consider adding a brief mention of key performance metrics or number of baselines in the abstract to strengthen the claim.
[Method] Ensure that the transition from global to local perspective is clearly motivated with references to prior work on hierarchical methods if applicable.
[Experiments] Verify that all figures and tables have clear captions and that error bars or statistical significance are reported where appropriate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of its significance for multi-modal and satellite imaging applications, and recommendation for minor revision. We appreciate the acknowledgment of the HMSA dataset and code release as strengths for reproducibility.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new neural architecture (SA-Homo) with modules SDBM (MLAC + CSMB) and IHERM, plus a contributed dataset HMSA, and validates performance claims empirically on scale-variation benchmarks. No mathematical derivation chain, fitted-parameter predictions, or self-referential equations appear; the central claim rests on experimental results rather than reducing to inputs by construction. No load-bearing self-citations or ansatz smuggling are present in the text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unverified effectiveness of the proposed hierarchical modules for bridging scale gaps and on standard deep learning training assumptions; full paper would be needed to list all design choices.

free parameters (1)

Network hyperparameters and training settings for MLAC, CSMB, and IHERM
Deep learning models contain many tunable parameters whose values are fitted to data during training.

axioms (1)

domain assumption Multi-scale linear attention can capture long-range dependencies to mitigate feature inconsistencies under large scale gaps
Invoked to justify the SDBM design for initial alignment.

invented entities (2)

Scale-aware Discrepancy Bridging Module (SDBM) no independent evidence
purpose: Perform initial global scale alignment using MLAC and CSMB
New module introduced by the paper to address scale variation.
Iterative Homography Estimation Refinement Module (IHERM) no independent evidence
purpose: Progressively refine alignment using local correlations
New lightweight module introduced by the paper.

pith-pipeline@v0.9.1-grok · 5819 in / 1522 out tokens · 37742 ms · 2026-06-30T06:03:13.599371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,

Y . Zhao, G. Liu, S. Xu, S. Bu, H. Jiang, and G. Wan, “Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 3502–3517, 2020

2020
[2]

Seam-adaptive structure-preserving image stitching for drone images,

J. Li and Y . Zhou, “Seam-adaptive structure-preserving image stitching for drone images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–12, 2024

2024
[3]

Megastitch: Robust large-scale image stitching,

A. Zarei, E. Gonzalez, N. Merchant, D. Pauli, E. Lyons, and K. Barnard, “Megastitch: Robust large-scale image stitching,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–9, 2022

2022
[4]

Multimodal image fusion framework for end-to-end remote sensing image registration,

L. Li, L. Han, M. Ding, and H. Cao, “Multimodal image fusion framework for end-to-end remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023
[5]

Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,

Z. Zhou, J. Luo, Q. Zhu, Y . Wang, H. Zhong, M. Feng, and L. Chen, “Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–14, 2024

2024
[6]

Rcvs: A unified registration and fusion framework for video streams,

H. Xie, M. Sang, Y . Zhang, Y . Yang, S. Zhao, and J. Zhong, “Rcvs: A unified registration and fusion framework for video streams,”IEEE Transactions on Multimedia, vol. 26, pp. 11 031–11 043, 2024

2024
[7]

An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,

S. Wang, G. Wang, Y . Wang, R. Zhou, M. Zhao, and Y . Wang, “An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025
[8]

Cinematic- l1 video stabilization with a log-homography model,

A. Bradley, J. Klivington, J. Triscari, and R. van der Merwe, “Cinematic- l1 video stabilization with a log-homography model,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1041–1049

2021
[9]

Dut: Learning video stabilization by simply watching unstable videos,

Y . Xu, J. Zhang, S. J. Maybank, and D. Tao, “Dut: Learning video stabilization by simply watching unstable videos,”IEEE Transactions on Image Processing, vol. 31, pp. 4306–4320, 2022

2022
[10]

Homography decomposition networks for planar object tracking,

X. Zhan, Y . Liu, J. Zhu, and Y . Li, “Homography decomposition networks for planar object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3234– 3242

2022
[11]

Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,

Y . Xue, G. Jin, T. Shen, L. Tan, N. Wang, J. Gao, and L. Wang, “Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

2023
[12]

Aerial image registration for track- ing,

M. E. Linger and A. A. Goshtasby, “Aerial image registration for track- ing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2137–2145, 2014

2014
[13]

Deep Image Homography Estimation

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homogra- phy estimation,”arXiv preprint arXiv:1606.03798, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[14]

Homography estimation from image pairs with hierarchical convolutional networks,

F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homography estimation from image pairs with hierarchical convolutional networks,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 913–920

2017
[15]

Stn-homography: Direct estimation of homography parameters for image pairs,

Q. Zhou and X. Li, “Stn-homography: Direct estimation of homography parameters for image pairs,”Applied Sciences, vol. 9, no. 23, p. 5187, 2019

2019
[16]

Clkn: Cascaded lucas- kanade networks for image alignment,

C.-H. Chang, C.-N. Chou, and E. Y . Chang, “Clkn: Cascaded lucas- kanade networks for image alignment,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2213– 2221

2017
[17]

Deep lucas-kanade homography for multimodal image alignment,

Y . Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homography for multimodal image alignment,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 950–15 959

2021
[18]

Image stitching via deep homography estimation,

Q. Zhao, Y . Ma, C. Zhu, C. Yao, B. Feng, and F. Dai, “Image stitching via deep homography estimation,”Neurocomputing, vol. 450, pp. 219– 229, 2021

2021
[19]

Iterative deep homography estimation,

S.-Y . Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep homography estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1879–1888

2022
[20]

Recurrent homography estimation using homography-guided image warping and focus transformer,

S.-Y . Cao, R. Zhang, L. Luo, B. Yu, Z. Sheng, J. Li, and H.-L. Shen, “Recurrent homography estimation using homography-guided image warping and focus transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9833–9842

2023
[21]

Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,

H. Zhu, S.-Y . Cao, J. Hu, S. Zuo, B. Yu, J. Ying, J. Li, and H.-L. Shen, “Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 932–25 941

2024
[22]

Adapting dense matching for homography estimation with grid-based acceleration,

K. Zhang, Y . Deng, J. Ma, and P. Favaro, “Adapting dense matching for homography estimation with grid-based acceleration,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6294–6303

2025
[23]

Deep homography estimation for dynamic scenes,

H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography estimation for dynamic scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7652–7661

2020
[24]

Localtrans: A multiscale local transformer network for cross-resolution homography estimation,

R. Shao, G. Wu, Y . Zhou, Y . Fu, L. Fang, and Y . Liu, “Localtrans: A multiscale local transformer network for cross-resolution homography estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 14 890–14 899

2021
[25]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

2014
[26]

Multi-spectral sift for scene category recognition,

M. Brown and S. S ¨usstrunk, “Multi-spectral sift for scene category recognition,” inCVPR 2011. IEEE, 2011, pp. 177–184

2011
[27]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

1981
[28]

Landsat-8: Science and product vision for terrestrial global change research,

D. P. Roy, M. A. Wulder, T. R. Loveland, W. Ce, R. G. Allen, M. C. Anderson, D. Helder, J. R. Irons, D. M. Johnson, R. Kennedyet al., “Landsat-8: Science and product vision for terrestrial global change research,”Remote sensing of Environment, vol. 145, pp. 154–172, 2014

2014
[29]

Distinctive image features from scale-invariant keypoints,

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004

2004
[30]

Surf: Speeded up robust features,

H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” inComputer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 2006, pp. 404–417

2006
[31]

Orb: An efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571

2011
[32]

Superpoint: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236

2018
[33]

Sosnet: Second order similarity regularization for local descriptor learning,

Y . Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V . Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 016–11 025

2019
[34]

D2-net: A trainable cnn for joint description and detection of local features,

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101

2019
[35]

Loftr: Detector- free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931

2021
[36]

Efficient loftr: Semi- dense local feature matching with sparse-like speed,

Y . Wang, X. He, S. Peng, D. Tan, and X. Zhou, “Efficient loftr: Semi- dense local feature matching with sparse-like speed,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 21 666–21 675

2024
[37]

Robust regression using itera- tively reweighted least-squares,

P. W. Holland and R. E. Welsch, “Robust regression using itera- tively reweighted least-squares,”Communications in Statistics-theory and Methods, vol. 6, no. 9, pp. 813–827, 1977

1977
[38]

Magsac: marginalizing sample consensus,

D. Barath, J. Matas, and J. Noskova, “Magsac: marginalizing sample consensus,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 197–10 205

2019
[39]

Multiple view geometry in computer vision,

R. Hartley, “Multiple view geometry in computer vision,” 2003

2003
[40]

Codinghomo: Bootstrapping deep homography with video coding,

Y . Liu, H. Li, S. Liu, and B. Zeng, “Codinghomo: Bootstrapping deep homography with video coding,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 11 214–11 228, 2024

2024
[41]

Crosshomo: Cross- modality and cross-resolution homography estimation,

X. Deng, E. Liu, C. Gao, S. Li, S. Gu, and M. Xu, “Crosshomo: Cross- modality and cross-resolution homography estimation,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2024

2024
[42]

Roma: Robust dense feature matching,

J. Edstedt, Q. Sun, G. B ¨okman, M. Wadenb¨ack, and M. Felsberg, “Roma: Robust dense feature matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 790–19 800

2024
[43]

P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,

S. Xie, H. Wu, W. Li, and L. Duan, “P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,” 06 2025, pp. 1–6

2025
[44]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

2020
[45]

Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

work page arXiv 2022
[46]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- lutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[47]

Sinkhorn distances: Lightspeed computation of optimal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,”Advances in neural information processing systems, vol. 26, 2013

2013
[48]

Siamcorners: Siamese corner networks for visual tracking,

K. Yang, Z. He, W. Pei, Z. Zhou, X. Li, D. Yuan, and H. Zhang, “Siamcorners: Siamese corner networks for visual tracking,”IEEE Transactions on Multimedia, vol. 24, pp. 1956–1967, 2021

1956
[49]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

2017
[50]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

2019
[51]

Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,

Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700–6713, 2022

2022
[52]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,

Y . Zhao, G. Liu, S. Xu, S. Bu, H. Jiang, and G. Wan, “Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 3502–3517, 2020

2020

[2] [2]

Seam-adaptive structure-preserving image stitching for drone images,

J. Li and Y . Zhou, “Seam-adaptive structure-preserving image stitching for drone images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–12, 2024

2024

[3] [3]

Megastitch: Robust large-scale image stitching,

A. Zarei, E. Gonzalez, N. Merchant, D. Pauli, E. Lyons, and K. Barnard, “Megastitch: Robust large-scale image stitching,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–9, 2022

2022

[4] [4]

Multimodal image fusion framework for end-to-end remote sensing image registration,

L. Li, L. Han, M. Ding, and H. Cao, “Multimodal image fusion framework for end-to-end remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

2023

[5] [5]

Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,

Z. Zhou, J. Luo, Q. Zhu, Y . Wang, H. Zhong, M. Feng, and L. Chen, “Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–14, 2024

2024

[6] [6]

Rcvs: A unified registration and fusion framework for video streams,

H. Xie, M. Sang, Y . Zhang, Y . Yang, S. Zhao, and J. Zhong, “Rcvs: A unified registration and fusion framework for video streams,”IEEE Transactions on Multimedia, vol. 26, pp. 11 031–11 043, 2024

2024

[7] [7]

An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,

S. Wang, G. Wang, Y . Wang, R. Zhou, M. Zhao, and Y . Wang, “An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,”IEEE Transactions on Geoscience and Remote Sensing, 2025

2025

[8] [8]

Cinematic- l1 video stabilization with a log-homography model,

A. Bradley, J. Klivington, J. Triscari, and R. van der Merwe, “Cinematic- l1 video stabilization with a log-homography model,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1041–1049

2021

[9] [9]

Dut: Learning video stabilization by simply watching unstable videos,

Y . Xu, J. Zhang, S. J. Maybank, and D. Tao, “Dut: Learning video stabilization by simply watching unstable videos,”IEEE Transactions on Image Processing, vol. 31, pp. 4306–4320, 2022

2022

[10] [10]

Homography decomposition networks for planar object tracking,

X. Zhan, Y . Liu, J. Zhu, and Y . Li, “Homography decomposition networks for planar object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3234– 3242

2022

[11] [11]

Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,

Y . Xue, G. Jin, T. Shen, L. Tan, N. Wang, J. Gao, and L. Wang, “Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

2023

[12] [12]

Aerial image registration for track- ing,

M. E. Linger and A. A. Goshtasby, “Aerial image registration for track- ing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2137–2145, 2014

2014

[13] [13]

Deep Image Homography Estimation

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homogra- phy estimation,”arXiv preprint arXiv:1606.03798, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[14] [14]

Homography estimation from image pairs with hierarchical convolutional networks,

F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homography estimation from image pairs with hierarchical convolutional networks,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 913–920

2017

[15] [15]

Stn-homography: Direct estimation of homography parameters for image pairs,

Q. Zhou and X. Li, “Stn-homography: Direct estimation of homography parameters for image pairs,”Applied Sciences, vol. 9, no. 23, p. 5187, 2019

2019

[16] [16]

Clkn: Cascaded lucas- kanade networks for image alignment,

C.-H. Chang, C.-N. Chou, and E. Y . Chang, “Clkn: Cascaded lucas- kanade networks for image alignment,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2213– 2221

2017

[17] [17]

Deep lucas-kanade homography for multimodal image alignment,

Y . Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homography for multimodal image alignment,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 950–15 959

2021

[18] [18]

Image stitching via deep homography estimation,

Q. Zhao, Y . Ma, C. Zhu, C. Yao, B. Feng, and F. Dai, “Image stitching via deep homography estimation,”Neurocomputing, vol. 450, pp. 219– 229, 2021

2021

[19] [19]

Iterative deep homography estimation,

S.-Y . Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep homography estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1879–1888

2022

[20] [20]

Recurrent homography estimation using homography-guided image warping and focus transformer,

S.-Y . Cao, R. Zhang, L. Luo, B. Yu, Z. Sheng, J. Li, and H.-L. Shen, “Recurrent homography estimation using homography-guided image warping and focus transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9833–9842

2023

[21] [21]

Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,

H. Zhu, S.-Y . Cao, J. Hu, S. Zuo, B. Yu, J. Ying, J. Li, and H.-L. Shen, “Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 932–25 941

2024

[22] [22]

Adapting dense matching for homography estimation with grid-based acceleration,

K. Zhang, Y . Deng, J. Ma, and P. Favaro, “Adapting dense matching for homography estimation with grid-based acceleration,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6294–6303

2025

[23] [23]

Deep homography estimation for dynamic scenes,

H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography estimation for dynamic scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7652–7661

2020

[24] [24]

Localtrans: A multiscale local transformer network for cross-resolution homography estimation,

R. Shao, G. Wu, Y . Zhou, Y . Fu, L. Fang, and Y . Liu, “Localtrans: A multiscale local transformer network for cross-resolution homography estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 14 890–14 899

2021

[25] [25]

Microsoft coco: Common objects in context,

T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

2014

[26] [26]

Multi-spectral sift for scene category recognition,

M. Brown and S. S ¨usstrunk, “Multi-spectral sift for scene category recognition,” inCVPR 2011. IEEE, 2011, pp. 177–184

2011

[27] [27]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,

M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

1981

[28] [28]

Landsat-8: Science and product vision for terrestrial global change research,

D. P. Roy, M. A. Wulder, T. R. Loveland, W. Ce, R. G. Allen, M. C. Anderson, D. Helder, J. R. Irons, D. M. Johnson, R. Kennedyet al., “Landsat-8: Science and product vision for terrestrial global change research,”Remote sensing of Environment, vol. 145, pp. 154–172, 2014

2014

[29] [29]

Distinctive image features from scale-invariant keypoints,

D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004

2004

[30] [30]

Surf: Speeded up robust features,

H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” inComputer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 2006, pp. 404–417

2006

[31] [31]

Orb: An efficient alternative to sift or surf,

E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571

2011

[32] [32]

Superpoint: Self- supervised interest point detection and description,

D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236

2018

[33] [33]

Sosnet: Second order similarity regularization for local descriptor learning,

Y . Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V . Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 016–11 025

2019

[34] [34]

D2-net: A trainable cnn for joint description and detection of local features,

M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101

2019

[35] [35]

Loftr: Detector- free local feature matching with transformers,

J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931

2021

[36] [36]

Efficient loftr: Semi- dense local feature matching with sparse-like speed,

Y . Wang, X. He, S. Peng, D. Tan, and X. Zhou, “Efficient loftr: Semi- dense local feature matching with sparse-like speed,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 21 666–21 675

2024

[37] [37]

Robust regression using itera- tively reweighted least-squares,

P. W. Holland and R. E. Welsch, “Robust regression using itera- tively reweighted least-squares,”Communications in Statistics-theory and Methods, vol. 6, no. 9, pp. 813–827, 1977

1977

[38] [38]

Magsac: marginalizing sample consensus,

D. Barath, J. Matas, and J. Noskova, “Magsac: marginalizing sample consensus,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 197–10 205

2019

[39] [39]

Multiple view geometry in computer vision,

R. Hartley, “Multiple view geometry in computer vision,” 2003

2003

[40] [40]

Codinghomo: Bootstrapping deep homography with video coding,

Y . Liu, H. Li, S. Liu, and B. Zeng, “Codinghomo: Bootstrapping deep homography with video coding,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 11 214–11 228, 2024

2024

[41] [41]

Crosshomo: Cross- modality and cross-resolution homography estimation,

X. Deng, E. Liu, C. Gao, S. Li, S. Gu, and M. Xu, “Crosshomo: Cross- modality and cross-resolution homography estimation,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2024

2024

[42] [42]

Roma: Robust dense feature matching,

J. Edstedt, Q. Sun, G. B ¨okman, M. Wadenb¨ack, and M. Felsberg, “Roma: Robust dense feature matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 790–19 800

2024

[43] [43]

P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,

S. Xie, H. Wu, W. Li, and L. Duan, “P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,” 06 2025, pp. 1–6

2025

[44] [44]

Transformers are rnns: Fast autoregressive transformers with linear attention,

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

2020

[45] [45]

Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

work page arXiv 2022

[46] [46]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- lutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[47] [47]

Sinkhorn distances: Lightspeed computation of optimal transport,

M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,”Advances in neural information processing systems, vol. 26, 2013

2013

[48] [48]

Siamcorners: Siamese corner networks for visual tracking,

K. Yang, Z. He, W. Pei, Z. Zhou, X. Li, D. Yuan, and H. Zhang, “Siamcorners: Siamese corner networks for visual tracking,”IEEE Transactions on Multimedia, vol. 24, pp. 1956–1967, 2021

1956

[49] [49]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

2017

[50] [50]

Generalized intersection over union: A metric and a loss for bounding box regression,

H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

2019

[51] [51]

Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,

Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700–6713, 2022

2022

[52] [52]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017