pith. sign in

arxiv: 2606.30408 · v1 · pith:Q3BHYPH2new · submitted 2026-06-29 · 💻 cs.CV

SA-Homo: Scale Adaptive Homography Estimation for Scale Variation Scenarios

Pith reviewed 2026-06-30 06:03 UTC · model grok-4.3

classification 💻 cs.CV
keywords homography estimationscale variationscale adaptivedeep learningcomputer visionsatellite imageryHMSA dataset
0
0 comments X

The pith

SA-Homo uses a global-to-local module sequence to estimate homography accurately when image scales differ by up to eight times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SA-Homo to solve homography estimation when image pairs show large scale differences that break standard local-feature assumptions. It applies a heavy module first to align scales from a global view using attention and similarity structures, then hands the result to a lightweight module for local polishing. The authors also release the HMSA dataset of high-resolution multi-modal satellite images to test these conditions. If the two-stage bridge works as described, homography-based tasks such as image registration can maintain precision across wider zoom ranges without extra preprocessing. Experiments claim the method beats prior approaches on both ordinary and extreme scale-variation cases.

Core claim

SA-Homo maintains high precision even under 8× scale discrepancies by adopting a hierarchical scale alignment strategy that transitions from the global perspective with a heavy module to a local perspective with a light module. The Scale-aware Discrepancy Bridging Module uses Multi-scale Linear Attention Cascade to capture long-range dependencies and a Cross-scale Similarity Matrix Block for robust correlation, after which the Iterative Homography Estimation Refinement Module progressively refines the result using local correlations.

What carries the argument

The hierarchical scale alignment strategy that starts with the Scale-aware Discrepancy Bridging Module (SDBM) containing Multi-scale Linear Attention Cascade (MLAC) and Cross-scale Similarity Matrix Block (CSMB) to reduce global scale gaps before applying the lightweight Iterative Homography Estimation Refinement Module (IHERM).

If this is right

  • Homography estimation remains accurate in satellite and multi-modal image pairs that differ by factors up to 8 in scale.
  • The same framework improves results on conventional scale-similar pairs as well as the new challenging cases.
  • The released HMSA dataset supplies a benchmark for developing and comparing future scale-robust methods.
  • The two-stage design allows heavy computation only at the start and keeps later refinement efficient.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same global-to-local bridging pattern may transfer to other geometric tasks such as essential-matrix estimation when focal lengths or distances vary.
  • Image-registration pipelines could drop explicit scale-normalization steps if the bridging module generalizes beyond the tested satellite domain.
  • Real-time video applications with changing camera distances could be tested to measure whether the refinement stage stays fast enough for live use.

Load-bearing premise

The initial global bridging step reduces the scale gap enough for the later lightweight local module to finish the alignment without itself handling the full discrepancy.

What would settle it

On the HMSA dataset, a direct comparison showing that SA-Homo's homography error under 8× scale difference is not lower than current state-of-the-art methods would falsify the performance claim.

Figures

Figures reproduced from arXiv: 2606.30408 by Haifeng Wu, Huarong Jia, Shangxuan Xie, Wen Li, Yuhang Wang.

Figure 1
Figure 1. Figure 1: Homography estimation results under scale variation [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Impact of increasing scale discrepancy ratio on estimation accuracy across three datasets: (a) MSCOCO [ [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed SA-Homo. The architecture consists of two main modules: (a) the Scale-aware Discrep [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: (a) Illustration of the Multi-scale Linear Attention [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (a) Illustration of the local correlation computation. (b) [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Overview of the data generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Scale Discrepancy Ratio (SDR) distributions of the validation datasets. (a)–(c) illustrate the SDR distributions of the [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Performance comparison under two settings: the scale variation scenario on (a) MSCOCO [ [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The Visualization results on (a) MSCOCO dataset [ [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of boundary cases, where only part of the [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Runtime and efficiency analysis. We present the inference latency, total parameters, and MACE of state-of-the-art [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Ablation studies on the HMSA dataset. We investigate the impact of: (a) attention mechanisms (linear vs. multi-scale) [PITH_FULL_IMAGE:figures/full_fig_p011_13.png] view at source ↗
read the original abstract

Homography estimation, as one of the fundamental problems in computer vision, remains challenged by scale variation scenarios where image pairs potentially exhibit significant scale discrepancies. Existing deep learning frameworks frequently suffer from a significant performance degradation in such cases, as they rely on limited displacement assumptions and local feature consistency that might not hold under large scale gaps. In this paper, we propose SA-Homo, a novel scale-adaptive homography estimation framework designed to achieve robust alignment across a wide range of scale discrepancy ratios. We adopt a hierarchical scale alignment strategy that transitions from the global perspective with a heavy module to a local perspective with a light module. Specifically, we introduce the Scale-aware Discrepancy Bridging Module (SDBM) for initial alignment, which utilizes a Multi-scale Linear Attention Cascade (MLAC) to capture long-range dependencies and mitigate feature inconsistencies, along with a global Cross-scale Similarity Matrix Block (CSMB) for scale robust correlation representation. Once the initial scale gap is bridged, a lightweight Iterative Homography Estimation Refinement Module (IHERM) progressively polishes the result using local correlations. To facilitate this research, we contribute the HMSA dataset, a high-resolution, multi-modal satellite benchmark specifically tailored for scale-variant challenges. Extensive experiments demonstrate that SA-Homo maintains high precision even under 8$\times$ scale discrepancies, outperforming state-of-the-art methods in both conventional scale-similar scenarios and challenging scale variation scenarios. Code and collected datasets are available at https://github.com/shangxuanx330/SA_Homo

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper claims to introduce SA-Homo, a scale-adaptive homography estimation framework that uses a hierarchical scale alignment strategy. It features the Scale-aware Discrepancy Bridging Module (SDBM) utilizing Multi-scale Linear Attention Cascade (MLAC) to capture long-range dependencies and Cross-scale Similarity Matrix Block (CSMB) for scale robust correlation, followed by the lightweight Iterative Homography Estimation Refinement Module (IHERM) for progressive refinement. The work also presents the HMSA dataset for scale-variant challenges and demonstrates through experiments that SA-Homo outperforms state-of-the-art methods in both conventional and challenging scale variation scenarios, maintaining high precision under 8× scale discrepancies.

Significance. If validated, the results would represent a meaningful advance in handling scale variations in homography estimation, which is critical for multi-modal and satellite imaging applications. The contribution of the HMSA dataset and the open-sourcing of code are notable strengths that facilitate reproducibility and future work in the area.

minor comments (3)
  1. [Abstract] The description of the modules is high-level; consider adding a brief mention of key performance metrics or number of baselines in the abstract to strengthen the claim.
  2. [Method] Ensure that the transition from global to local perspective is clearly motivated with references to prior work on hierarchical methods if applicable.
  3. [Experiments] Verify that all figures and tables have clear captions and that error bars or statistical significance are reported where appropriate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary of our work, recognition of its significance for multi-modal and satellite imaging applications, and recommendation for minor revision. We appreciate the acknowledgment of the HMSA dataset and code release as strengths for reproducibility.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper proposes a new neural architecture (SA-Homo) with modules SDBM (MLAC + CSMB) and IHERM, plus a contributed dataset HMSA, and validates performance claims empirically on scale-variation benchmarks. No mathematical derivation chain, fitted-parameter predictions, or self-referential equations appear; the central claim rests on experimental results rather than reducing to inputs by construction. No load-bearing self-citations or ansatz smuggling are present in the text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim depends on the unverified effectiveness of the proposed hierarchical modules for bridging scale gaps and on standard deep learning training assumptions; full paper would be needed to list all design choices.

free parameters (1)
  • Network hyperparameters and training settings for MLAC, CSMB, and IHERM
    Deep learning models contain many tunable parameters whose values are fitted to data during training.
axioms (1)
  • domain assumption Multi-scale linear attention can capture long-range dependencies to mitigate feature inconsistencies under large scale gaps
    Invoked to justify the SDBM design for initial alignment.
invented entities (2)
  • Scale-aware Discrepancy Bridging Module (SDBM) no independent evidence
    purpose: Perform initial global scale alignment using MLAC and CSMB
    New module introduced by the paper to address scale variation.
  • Iterative Homography Estimation Refinement Module (IHERM) no independent evidence
    purpose: Progressively refine alignment using local correlations
    New lightweight module introduced by the paper.

pith-pipeline@v0.9.1-grok · 5819 in / 1522 out tokens · 37742 ms · 2026-06-30T06:03:13.599371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,

    Y . Zhao, G. Liu, S. Xu, S. Bu, H. Jiang, and G. Wan, “Fast georeferenced aerial image stitching with absolute rotation averaging and planar- restricted pose graph,”IEEE Transactions on Geoscience and Remote Sensing, vol. 59, no. 4, pp. 3502–3517, 2020

  2. [2]

    Seam-adaptive structure-preserving image stitching for drone images,

    J. Li and Y . Zhou, “Seam-adaptive structure-preserving image stitching for drone images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–12, 2024

  3. [3]

    Megastitch: Robust large-scale image stitching,

    A. Zarei, E. Gonzalez, N. Merchant, D. Pauli, E. Lyons, and K. Barnard, “Megastitch: Robust large-scale image stitching,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, pp. 1–9, 2022

  4. [4]

    Multimodal image fusion framework for end-to-end remote sensing image registration,

    L. Li, L. Han, M. Ding, and H. Cao, “Multimodal image fusion framework for end-to-end remote sensing image registration,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–14, 2023

  5. [5]

    Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,

    Z. Zhou, J. Luo, Q. Zhu, Y . Wang, H. Zhong, M. Feng, and L. Chen, “Uncertainty guided deep lucas-kanade homography for multimodal im- age alignment,”IEEE Transactions on Geoscience and Remote Sensing, vol. 63, pp. 1–14, 2024

  6. [6]

    Rcvs: A unified registration and fusion framework for video streams,

    H. Xie, M. Sang, Y . Zhang, Y . Yang, S. Zhao, and J. Zhong, “Rcvs: A unified registration and fusion framework for video streams,”IEEE Transactions on Multimedia, vol. 26, pp. 11 031–11 043, 2024

  7. [7]

    An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,

    S. Wang, G. Wang, Y . Wang, R. Zhou, M. Zhao, and Y . Wang, “An integrated inter-frame stabilization and fast imaging method for video synthetic aperture radar,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  8. [8]

    Cinematic- l1 video stabilization with a log-homography model,

    A. Bradley, J. Klivington, J. Triscari, and R. van der Merwe, “Cinematic- l1 video stabilization with a log-homography model,” inProceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 1041–1049

  9. [9]

    Dut: Learning video stabilization by simply watching unstable videos,

    Y . Xu, J. Zhang, S. J. Maybank, and D. Tao, “Dut: Learning video stabilization by simply watching unstable videos,”IEEE Transactions on Image Processing, vol. 31, pp. 4306–4320, 2022

  10. [10]

    Homography decomposition networks for planar object tracking,

    X. Zhan, Y . Liu, J. Zhu, and Y . Li, “Homography decomposition networks for planar object tracking,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3234– 3242

  11. [11]

    Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,

    Y . Xue, G. Jin, T. Shen, L. Tan, N. Wang, J. Gao, and L. Wang, “Smalltrack: Wavelet pooling and graph enhanced classification for uav small object tracking,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–15, 2023

  12. [12]

    Aerial image registration for track- ing,

    M. E. Linger and A. A. Goshtasby, “Aerial image registration for track- ing,”IEEE Transactions on Geoscience and Remote Sensing, vol. 53, no. 4, pp. 2137–2145, 2014

  13. [13]

    Deep Image Homography Estimation

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “Deep image homogra- phy estimation,”arXiv preprint arXiv:1606.03798, 2016

  14. [14]

    Homography estimation from image pairs with hierarchical convolutional networks,

    F. Erlik Nowruzi, R. Laganiere, and N. Japkowicz, “Homography estimation from image pairs with hierarchical convolutional networks,” inProceedings of the IEEE international conference on computer vision workshops, 2017, pp. 913–920

  15. [15]

    Stn-homography: Direct estimation of homography parameters for image pairs,

    Q. Zhou and X. Li, “Stn-homography: Direct estimation of homography parameters for image pairs,”Applied Sciences, vol. 9, no. 23, p. 5187, 2019

  16. [16]

    Clkn: Cascaded lucas- kanade networks for image alignment,

    C.-H. Chang, C.-N. Chou, and E. Y . Chang, “Clkn: Cascaded lucas- kanade networks for image alignment,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2213– 2221

  17. [17]

    Deep lucas-kanade homography for multimodal image alignment,

    Y . Zhao, X. Huang, and Z. Zhang, “Deep lucas-kanade homography for multimodal image alignment,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 15 950–15 959

  18. [18]

    Image stitching via deep homography estimation,

    Q. Zhao, Y . Ma, C. Zhu, C. Yao, B. Feng, and F. Dai, “Image stitching via deep homography estimation,”Neurocomputing, vol. 450, pp. 219– 229, 2021

  19. [19]

    Iterative deep homography estimation,

    S.-Y . Cao, J. Hu, Z. Sheng, and H.-L. Shen, “Iterative deep homography estimation,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 1879–1888

  20. [20]

    Recurrent homography estimation using homography-guided image warping and focus transformer,

    S.-Y . Cao, R. Zhang, L. Luo, B. Yu, Z. Sheng, J. Li, and H.-L. Shen, “Recurrent homography estimation using homography-guided image warping and focus transformer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 9833–9842

  21. [21]

    Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,

    H. Zhu, S.-Y . Cao, J. Hu, S. Zuo, B. Yu, J. Ying, J. Li, and H.-L. Shen, “Mcnet: Rethinking the core ingredients for accurate and efficient homography estimation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 25 932–25 941

  22. [22]

    Adapting dense matching for homography estimation with grid-based acceleration,

    K. Zhang, Y . Deng, J. Ma, and P. Favaro, “Adapting dense matching for homography estimation with grid-based acceleration,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6294–6303

  23. [23]

    Deep homography estimation for dynamic scenes,

    H. Le, F. Liu, S. Zhang, and A. Agarwala, “Deep homography estimation for dynamic scenes,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7652–7661

  24. [24]

    Localtrans: A multiscale local transformer network for cross-resolution homography estimation,

    R. Shao, G. Wu, Y . Zhou, Y . Fu, L. Fang, and Y . Liu, “Localtrans: A multiscale local transformer network for cross-resolution homography estimation,” inProceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 14 890–14 899

  25. [25]

    Microsoft coco: Common objects in context,

    T.-Y . Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Doll ´ar, and C. L. Zitnick, “Microsoft coco: Common objects in context,” inEuropean conference on computer vision. Springer, 2014, pp. 740–755

  26. [26]

    Multi-spectral sift for scene category recognition,

    M. Brown and S. S ¨usstrunk, “Multi-spectral sift for scene category recognition,” inCVPR 2011. IEEE, 2011, pp. 177–184

  27. [27]

    Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,

    M. A. Fischler and R. C. Bolles, “Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography,”Communications of the ACM, vol. 24, no. 6, pp. 381–395, 1981

  28. [28]

    Landsat-8: Science and product vision for terrestrial global change research,

    D. P. Roy, M. A. Wulder, T. R. Loveland, W. Ce, R. G. Allen, M. C. Anderson, D. Helder, J. R. Irons, D. M. Johnson, R. Kennedyet al., “Landsat-8: Science and product vision for terrestrial global change research,”Remote sensing of Environment, vol. 145, pp. 154–172, 2014

  29. [29]

    Distinctive image features from scale-invariant keypoints,

    D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” International journal of computer vision, vol. 60, pp. 91–110, 2004

  30. [30]

    Surf: Speeded up robust features,

    H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,” inComputer Vision–ECCV 2006: 9th European Conference on Computer Vision, Graz, Austria, May 7-13, 2006. Proceedings, Part I 9. Springer, 2006, pp. 404–417

  31. [31]

    Orb: An efficient alternative to sift or surf,

    E. Rublee, V . Rabaud, K. Konolige, and G. Bradski, “Orb: An efficient alternative to sift or surf,” in2011 International conference on computer vision. Ieee, 2011, pp. 2564–2571

  32. [32]

    Superpoint: Self- supervised interest point detection and description,

    D. DeTone, T. Malisiewicz, and A. Rabinovich, “Superpoint: Self- supervised interest point detection and description,” inProceedings of the IEEE conference on computer vision and pattern recognition workshops, 2018, pp. 224–236

  33. [33]

    Sosnet: Second order similarity regularization for local descriptor learning,

    Y . Tian, X. Yu, B. Fan, F. Wu, H. Heijnen, and V . Balntas, “Sosnet: Second order similarity regularization for local descriptor learning,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 11 016–11 025

  34. [34]

    D2-net: A trainable cnn for joint description and detection of local features,

    M. Dusmanu, I. Rocco, T. Pajdla, M. Pollefeys, J. Sivic, A. Torii, and T. Sattler, “D2-net: A trainable cnn for joint description and detection of local features,” inProceedings of the ieee/cvf conference on computer vision and pattern recognition, 2019, pp. 8092–8101

  35. [35]

    Loftr: Detector- free local feature matching with transformers,

    J. Sun, Z. Shen, Y . Wang, H. Bao, and X. Zhou, “Loftr: Detector- free local feature matching with transformers,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 8922–8931

  36. [36]

    Efficient loftr: Semi- dense local feature matching with sparse-like speed,

    Y . Wang, X. He, S. Peng, D. Tan, and X. Zhou, “Efficient loftr: Semi- dense local feature matching with sparse-like speed,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2024, pp. 21 666–21 675

  37. [37]

    Robust regression using itera- tively reweighted least-squares,

    P. W. Holland and R. E. Welsch, “Robust regression using itera- tively reweighted least-squares,”Communications in Statistics-theory and Methods, vol. 6, no. 9, pp. 813–827, 1977

  38. [38]

    Magsac: marginalizing sample consensus,

    D. Barath, J. Matas, and J. Noskova, “Magsac: marginalizing sample consensus,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 10 197–10 205

  39. [39]

    Multiple view geometry in computer vision,

    R. Hartley, “Multiple view geometry in computer vision,” 2003

  40. [40]

    Codinghomo: Bootstrapping deep homography with video coding,

    Y . Liu, H. Li, S. Liu, and B. Zeng, “Codinghomo: Bootstrapping deep homography with video coding,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 11 214–11 228, 2024

  41. [41]

    Crosshomo: Cross- modality and cross-resolution homography estimation,

    X. Deng, E. Liu, C. Gao, S. Li, S. Gu, and M. Xu, “Crosshomo: Cross- modality and cross-resolution homography estimation,”IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2024

  42. [42]

    Roma: Robust dense feature matching,

    J. Edstedt, Q. Sun, G. B ¨okman, M. Wadenb¨ack, and M. Felsberg, “Roma: Robust dense feature matching,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 19 790–19 800

  43. [43]

    P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,

    S. Xie, H. Wu, W. Li, and L. Duan, “P2wnet: Homography estimation for part-to-whole and cross-modality scenarios,” 06 2025, pp. 1–6

  44. [44]

    Transformers are rnns: Fast autoregressive transformers with linear attention,

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International conference on machine learning. PMLR, 2020, pp. 5156– 5165. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 13

  45. [45]

    Efficientvit: Multi-scale linear attention for high-resolution dense prediction,

    H. Cai, J. Li, M. Hu, C. Gan, and S. Han, “Efficientvit: Multi-scale linear attention for high-resolution dense prediction,”arXiv preprint arXiv:2205.14756, 2022

  46. [46]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convo- lutional neural networks for mobile vision applications,”arXiv preprint arXiv:1704.04861, 2017

  47. [47]

    Sinkhorn distances: Lightspeed computation of optimal transport,

    M. Cuturi, “Sinkhorn distances: Lightspeed computation of optimal transport,”Advances in neural information processing systems, vol. 26, 2013

  48. [48]

    Siamcorners: Siamese corner networks for visual tracking,

    K. Yang, Z. He, W. Pei, Z. Zhou, X. Li, D. Yuan, and H. Zhang, “Siamcorners: Siamese corner networks for visual tracking,”IEEE Transactions on Multimedia, vol. 24, pp. 1956–1967, 2021

  49. [49]

    Focal loss for dense object detection,

    T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 2980–2988

  50. [50]

    Generalized intersection over union: A metric and a loss for bounding box regression,

    H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid, and S. Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2019, pp. 658–666

  51. [51]

    Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,

    Y . Sun, B. Cao, P. Zhu, and Q. Hu, “Drone-based rgb-infrared cross- modality vehicle detection via uncertainty-aware learning,”IEEE Trans- actions on Circuits and Systems for Video Technology, vol. 32, no. 10, pp. 6700–6713, 2022

  52. [52]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017