pith. sign in

arxiv: 2505.04864 · v2 · submitted 2025-05-08 · 💻 cs.CV · cs.AI

Auto-regressive transformation for image alignment

Pith reviewed 2026-05-22 16:24 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image alignmentauto-regressive transformationcross-attentionmulti-scale featurestransformation estimationiterative refinementcomputer visionfeature-sparse alignment
0
0 comments X

The pith

Auto-regressive transformation iteratively refines image alignments by focusing cross-attention on critical regions at multiple scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing image alignment methods struggle with feature-sparse regions, extreme scale differences, and large deformations. This paper introduces Auto-Regressive Transformation to address these issues through an iterative pipeline that refines transformations from coarse to fine. The approach uses hierarchical multi-scale features and randomly samples points at each scale while cross-attention directs focus to important areas. This design aims to maintain accuracy even when traditional feature matching would fail. Experiments indicate stronger results on planar images and similar performance on 3D scenes compared with current techniques.

Core claim

The paper claims that an auto-regressive pipeline iteratively estimates coarse-to-fine transformations for image alignment by refining transform field parameters with randomly sampled points drawn from hierarchical multi-scale features, guided by a cross-attention layer that directs attention to critical regions and thereby achieves accurate results under conditions of sparse features, large scale changes, and substantial deformations.

What carries the argument

The auto-regressive pipeline that refines the transform field iteratively at each scale using randomly sampled points and cross-attention guidance from multi-scale features.

If this is right

  • Outperforms existing methods on planar image alignment tasks
  • Achieves performance comparable to state-of-the-art on 3D scene images
  • Handles feature-sparse regions, extreme scale and field-of-view differences, and large deformations more reliably
  • Provides a versatile pipeline for precise alignment across varied imaging conditions

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The sampling-plus-attention strategy could be adapted to other dense correspondence tasks such as optical flow estimation in low-texture video.
  • If the iterative refinement proves stable, the method might reduce reliance on hand-crafted feature detectors in downstream applications like panoramic stitching.
  • Extending the pipeline to include temporal auto-regression could support alignment across video sequences without separate tracking modules.

Load-bearing premise

Randomly sampling points at each scale combined with cross-attention guidance is sufficient to accurately refine the transform field even in feature-sparse regions.

What would settle it

A dataset of images containing large feature-sparse areas where cross-attention maps consistently miss the true correspondence regions and produce alignment errors exceeding those of current state-of-the-art methods.

Figures

Figures reproduced from arXiv: 2505.04864 by Kanggeon Lee, Kyoung Mu Lee, Soochahn Lee.

Figure 1
Figure 1. Figure 1: Alignment Results in Challenging Scenarios. For image pairs with sparse features, scale differences, de￾formations, degradations, and domain shifts, our method performs coarse-to-fine auto-regressive transformation refine￾ment, achieving accurate alignment even in challenging sce￾narios where state-of-the-art methods struggle. The zoomed￾in boxes show the local alignment results, and the highlighted vessel… view at source ↗
Figure 2
Figure 2. Figure 2: Method Overview. Auto-Regressive Transformation (ART) iteratively refines the transformation D for image pairs I in a coarse-to-fine manner. Its sampling strategy enables effective operation across diverse domains and datasets. ance cues from the entire image pair as conditioning sig￾nals, ART achieves robustness to initialization. Extensive evaluations demonstrate that ART significantly outperforms existi… view at source ↗
Figure 3
Figure 3. Figure 3: Overall Framework. ART first extracts multi-scale features Fs and Fd from the input image pair Is and Id. At each sampling step k, the corresponding features, F k s and F k d , are passed through the Cross-Attention Layer (CAL) to identify the correlated features that guide the network’s focus on regions requiring refinement. The attentive feature map F˜k s→d is then used to refine the transform field para… view at source ↗
Figure 4
Figure 4. Figure 4: Point-based Image Warping. At sampling step k, the extracted source points set P k s is warped to P˜k s→d by se￾quentially multiplying with the corresponding values of the transform field parameter Dk M and adding Dk A for each point. These point pairs are then used to compute the warped image I˜ s→d. where k is the current and K is the maximum transform field parameters sampling step. Each output’s spatia… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative Evaluation on Retinal Datasets. Across various domains, ART robustly identifies sufficient matches compared to SuperRetina [4], GeoFormer [1], and RetinaRegNet [26]. Correct and incorrect matches are shown in green and red, respectively. The zoomed-in boxes highlight overlaid local regions [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Evaluation on Scene-LR Datasets. On the GoogleEarth [16], GoogleMap [16], and MSCOCO [61] datasets, ART successfully finds the correct transformation between input image pairs, even with sparse features from low resolution, domain gaps, and scale differences [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ablation Study on Sampling. ART performance varies with (a) the number of sampling steps and (b) differ￾ent initialization strategies, across HR (left) and LR (right) datasets. 4.5 Understanding ART Here, we present ablation studies to gain a deeper understand￾ing of the key components that constitute ART. Sampling Efficiency The aforementioned number of itera￾tion steps, 6 for HR images and 4 for LR image… view at source ↗
read the original abstract

Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges can be improved through iterative refinement of the transform field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations through an auto-regressive pipeline. Leveraging hierarchical multi-scale features, our network refines the transform field parameters using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments demonstrate that ART significantly outperforms state-of-the-art methods on planar images and achieves comparable performance on 3D scene images, establishing it as a powerful and versatile solution for precise image alignment.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proposes Auto-Regressive Transformation (ART), a novel image alignment method that iteratively estimates coarse-to-fine transformations via an auto-regressive pipeline. It extracts hierarchical multi-scale features, refines transform parameters from randomly sampled points at each scale, and uses cross-attention to focus on critical regions, with the goal of improving robustness to feature-sparse areas, extreme scale/FOV differences, and large deformations. Experiments are claimed to show significant outperformance versus state-of-the-art on planar images and comparable results on 3D scenes.

Significance. If the performance claims are substantiated, the auto-regressive coarse-to-fine refinement with cross-attention guidance offers a plausible route to better handling of challenging alignment cases that defeat current methods. The approach is internally consistent and does not reduce to prior fitted parameters; the novelty lies in the pipeline design rather than circular reuse of earlier results.

major comments (1)
  1. [Method] Method section (pipeline description): the central robustness claim for feature-sparse regions rests on the combination of random point sampling at each scale plus cross-attention guidance. No ablation isolating the sampling strategy or error stratified by local feature density is described, leaving open whether the sampled set remains informative when repeatable features are absent; this directly threatens the headline outperformance result on planar images.
minor comments (2)
  1. [Abstract] Abstract: the statement that 'extensive experiments demonstrate' superiority would be strengthened by naming the primary datasets, key metrics (e.g., mean endpoint error or success rate), and at least one quantitative delta versus the strongest baseline.
  2. [Experiments] Experiments: absence of detailed quantitative tables, error analysis, or ablation studies in the visible text makes the superiority claim only moderately verifiable; adding these would allow readers to assess effect sizes and failure modes.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Method] Method section (pipeline description): the central robustness claim for feature-sparse regions rests on the combination of random point sampling at each scale plus cross-attention guidance. No ablation isolating the sampling strategy or error stratified by local feature density is described, leaving open whether the sampled set remains informative when repeatable features are absent; this directly threatens the headline outperformance result on planar images.

    Authors: We acknowledge the value of an explicit ablation isolating the random sampling strategy and an error analysis stratified by local feature density. The design rationale is that random sampling at each scale deliberately avoids dependence on repeatable keypoints, allowing the network to draw from any image locations while cross-attention modulates focus toward regions that contribute most to alignment. In the revised manuscript we will add an ablation that replaces random sampling with feature-based point selection (e.g., using SIFT or SuperPoint) and report both overall alignment error and error binned by local feature density computed via keypoint counts in image patches. These results will be placed in the experiments section to directly support the robustness claim on planar images. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ART derivation or claims

full rationale

The paper introduces a new neural architecture (ART) that performs iterative coarse-to-fine transform refinement via an auto-regressive pipeline on hierarchical features, random point sampling per scale, and cross-attention guidance. All load-bearing elements are presented as design choices in a novel network, with performance claims resting on experimental comparisons rather than any mathematical reduction, fitted-parameter renaming, or self-citation chain. No equations or steps in the abstract or described method reduce to inputs by construction; the derivation is self-contained as an empirical architecture proposal.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The method relies on standard supervised training of a neural network for regression of transformation parameters; design choices such as number of scales and sampling density function as hyperparameters.

free parameters (2)
  • number of hierarchical scales
    Choice of multi-scale levels is a design decision that affects the coarse-to-fine refinement schedule.
  • number of randomly sampled points per scale
    The count of points used to estimate transform parameters at each level is a tunable hyperparameter.
axioms (1)
  • domain assumption A neural network trained on image pairs can learn to predict accurate transformation parameters from multi-scale features and attention signals.
    The entire pipeline presupposes that end-to-end learning from data will produce reliable iterative refinements.

pith-pipeline@v0.9.0 · 5676 in / 1398 out tokens · 108213 ms · 2026-05-22T16:24:23.895596+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages

  1. [1]

    Geometrized transformer for self- supervised homography estimation

    Jiazhen Liu and Xirong Li. Geometrized transformer for self- supervised homography estimation. InICCV, 2023. 1, 2, 5, 6, 8

  2. [2]

    Rempe: Registration of retinal images through eye modelling and pose estimation.IEEE Journal of Biomedical and Health Informatics, 24, 2020

    Carlos Hernandez-Matas, Xenophon Zabulis, and Antonis A Argyros. Rempe: Registration of retinal images through eye modelling and pose estimation.IEEE Journal of Biomedical and Health Informatics, 24, 2020. 1, 2, 5, 6

  3. [3]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InCVPR, 2021. 1, 2, 5, 6, 7, 8

  4. [4]

    Semi-supervised keypoint detector and descriptor for retinal im- age matching

    Jiazhen Liu, Xirong Li, Qijie Wei, Jie Xu, and Dayong Ding. Semi-supervised keypoint detector and descriptor for retinal im- age matching. InECCV, 2022. 1, 2, 5, 6

  5. [5]

    Superjunction: Learning- based junction detection for retinal image registration.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(1):292–300, Mar

    Yu Wang, Xiaoye Wang, Zaiwang Gu, Weide Liu, Wee Siong Ng, Weimin Huang, and Jun Cheng. Superjunction: Learning- based junction detection for retinal image registration.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(1):292–300, Mar. 2024. 1

  6. [6]

    V oxelmorph: a learning framework for deformable medical image registration.IEEE Transactions on Medical Imaging, 38, 2019

    Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. V oxelmorph: a learning framework for deformable medical image registration.IEEE Transactions on Medical Imaging, 38, 2019. 1, 2

  7. [7]

    Cyclemorph: cycle consistent un- supervised deformable image registration.Medical Image Anal- ysis, 71, 2021

    Boah Kim, Dong Hwan Kim, Seong Ho Park, Jieun Kim, June- Goo Lee, and Jong Chul Ye. Cyclemorph: cycle consistent un- supervised deformable image registration.Medical Image Anal- ysis, 71, 2021. 1, 2

  8. [8]

    Diffusemorph: unsu- pervised deformable image registration using diffusion model

    Boah Kim, Inhwa Han, and Jong Chul Ye. Diffusemorph: unsu- pervised deformable image registration using diffusion model. InECCV, 2022. 1, 4

  9. [9]

    Frey, Yufan He, William P

    Junyu Chen, Eric C. Frey, Yufan He, William P. Segars, Ye Li, and Yong Du. Transmorph: Transformer for unsupervised med- ical image registration.Medical Image Analysis, 82:102615, November 2022. 1

  10. [10]

    Springer Nature Switzerland, 2023

    Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, and Jinman Kim.Non-iterative Coarse-to-Fine Transformer Net- works for Joint Affine and Deformable Image Registration, page 750–760. Springer Nature Switzerland, 2023. 1

  11. [11]

    H- ViT: A Hierarchical Vision Transformer for Deformable Image Registration

    Morteza Ghahremani, Mohammad Khateri, Bailiang Jian, Benedikt Wiestler, Ehsan Adeli, and Christian Wachinger. H- ViT: A Hierarchical Vision Transformer for Deformable Image Registration . In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 11513–11523, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. 1

  12. [12]

    Besl and Neil D

    P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, 1992. 1, 2

  13. [13]

    Active shape models-their training and applica- tion.Computer Vision and Image Understanding, 61, 1995

    Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and applica- tion.Computer Vision and Image Understanding, 61, 1995. 1

  14. [14]

    Lee, Ozan Oktay, Andreas Schuh, Michiel Schaap, and Ben Glocker

    Matthew C.H. Lee, Ozan Oktay, Andreas Schuh, Michiel Schaap, and Ben Glocker. Image-and-spatial transformer net- works for structure-guided image registration. InMICCAI,

  15. [15]

    A deep learning framework for unsupervised affine and deformable image regis- tration.Medical Image Analysis, 52, 2019

    Bob D De V os, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana I ˇsgum. A deep learning framework for unsupervised affine and deformable image regis- tration.Medical Image Analysis, 52, 2019. 1, 2

  16. [16]

    Deep lucas- kanade homography for multimodal image alignment

    Yiming Zhao, Xinming Huang, and Ziming Zhang. Deep lucas- kanade homography for multimodal image alignment. InCVPR,

  17. [17]

    Mcnet: Re- thinking the core ingredients for accurate and efficient homog- raphy estimation

    Haokai Zhu, Si-Yuan Cao, Jianxin Hu, Sitong Zuo, Beinan Yu, Jiacheng Ying, Junwei Li, and Hui-Liang Shen. Mcnet: Re- thinking the core ingredients for accurate and efficient homog- raphy estimation. InCVPR, 2024. 1, 2, 3, 5, 6, 7, 8

  18. [18]

    Correlation-aware coarse-to-fine mlps for deformable medical image registration, 2024

    Mingyuan Meng, Dagan Feng, Lei Bi, and Jinman Kim. Correlation-aware coarse-to-fine mlps for deformable medical image registration, 2024. 1

  19. [19]

    Stendahl, Lawrence Staib, Albert J

    Xiaoran Zhang, John C. Stendahl, Lawrence Staib, Albert J. Sinusas, Alex Wong, and James S. Duncan. Adaptive corre- spondence scoring for unsupervised medical image registration,

  20. [20]

    Iirp-net: Iterative inference residual pyramid network for enhanced im- age registration

    Tai Ma, Suwei Zhang, Jiafeng Li, and Ying Wen. Iirp-net: Iterative inference residual pyramid network for enhanced im- age registration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11546–11555,

  21. [21]

    Superpoint: Self-supervised interest point detection and de- scription

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and de- scription. InCVPRW, 2018. 2, 5, 6

  22. [22]

    Glam- points: Greedily learned accurate match points

    Prune Truong, Stefanos Apostolopoulos, Agata Mosinska, Samuel Stucky, Carlos Ciller, and Sandro De Zanet. Glam- points: Greedily learned accurate match points. InICCV, 2019. 2, 5, 6

  23. [23]

    Ncnet: Neighbourhood consensus networks for estimating image correspondences

    Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi ´c, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Ncnet: Neighbourhood consensus networks for estimating image correspondences. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 44, 2020. 2, 5, 6

  24. [24]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, 2020. 2, 5, 6

  25. [25]

    Aspanformer: Detector-free image matching with adap- tive span transformer

    Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adap- tive span transformer. InECCV, 2022. 2, 5, 6

  26. [26]

    Tamplin, Isabella M

    Vishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei, Preethika Muralidharan, Michelle R. Tamplin, Isabella M . Grumbach, Randy H. Kardon, Jui-Kai Wang, Yuyin Zhou, and Wei Shao. Retinaregnet: A zero-shot approach for retinal image registration, 2024. 2, 5, 6

  27. [27]

    Iterative deep homography estimation

    Si-Yuan Cao, Jianxin Hu, Zehua Sheng, and Hui-Liang Shen. Iterative deep homography estimation. InCVPR, 2022. 2, 3, 5, 6, 7, 8 9

  28. [28]

    Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60, 2004

    David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60, 2004. 2, 4, 5

  29. [29]

    Surf: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InECCV, 2006. 2

  30. [30]

    Faster and better: A machine learning approach to corner detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32,

    Edward Rosten, Reid Porter, and Tom Drummond. Faster and better: A machine learning approach to corner detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32,

  31. [31]

    Brief: Binary robust independent elementary features

    Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pas- cal Fua. Brief: Binary robust independent elementary features. InECCV, 2010. 2

  32. [32]

    R2d2: Repeatable and reliable detector and descriptor.arXiv preprint, 2019

    Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Repeatable and reliable detector and descriptor.arXiv preprint, 2019. 2

  33. [33]

    LightGlue: Local Feature Matching at Light Speed

    Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV,

  34. [34]

    Deep image homography estimation.arXiv preprint, 2016

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Deep image homography estimation.arXiv preprint, 2016. 2

  35. [35]

    Gmflow: Learning optical flow via global match- ing

    Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global match- ing. InCVPR, 2022. 2

  36. [36]

    Flowformer: A transformer architecture for optical flow

    Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022. 2

  37. [37]

    RoMa: Robust Dense Feature Matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust Dense Feature Matching. IEEE Conference on Computer Vision and Pattern Recognition,

  38. [38]

    Emergent correspondence from image diffusion

    Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 2

  39. [39]

    Wells III

    Paul Viola and William M. Wells III. Alignment by maximiza- tion of mutual information.International Journal of Computer Vision, 24(2):137–154, 1997. 2

  40. [40]

    Image registration methods: a survey.Image and Vision Computing, 21(11):977–1000, 2003

    Barbara Zitov ´a and Jan Flusser. Image registration methods: a survey.Image and Vision Computing, 21(11):977–1000, 2003. 2

  41. [41]

    Multimodality image registration by maximization of mutual information.IEEE Transactions on Medical Imaging, 16(2):187–198, 1997

    Frederik Maes, Andr ´e Collignon, Dirk Vandermeulen, Guy Marchal, and Paul Suetens. Multimodality image registration by maximization of mutual information.IEEE Transactions on Medical Imaging, 16(2):187–198, 1997. 2

  42. [42]

    Image matching as a diffusion process: an analogy with maxwell’s demons.Medical Image Analysis, 2(3):243–260, 1998

    Jean-Philippe Thirion. Image matching as a diffusion process: an analogy with maxwell’s demons.Medical Image Analysis, 2(3):243–260, 1998. 2

  43. [43]

    Spatial transformer networks.arXiv preprint,

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Ko- ray Kavukcuoglu. Spatial transformer networks.arXiv preprint,

  44. [44]

    Separable flow: Learning motion cost volumes for optical flow estimation

    Feihu Zhang, Oliver J Woodford, Victor Adrian Prisacariu, and Philip HS Torr. Separable flow: Learning motion cost volumes for optical flow estimation. InICCV, 2021. 2

  45. [45]

    Deformable image regis- tration based on similarity-steered cnn regression

    Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image regis- tration based on similarity-steered cnn regression. InMICCAI,

  46. [46]

    Weakly-supervised con- volutional neural networks for multimodal image registration

    Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Car- oline M Moore, Mark Emberton, et al. Weakly-supervised con- volutional neural networks for multimodal image registration. Medical Image Analysis, 49, 2018. 2

  47. [47]

    Deepatlas: Joint semi- supervised learning of image registration and segmentation

    Zhenlin Xu and Marc Niethammer. Deepatlas: Joint semi- supervised learning of image registration and segmentation. In MICCAI, 2019. 2

  48. [48]

    Springer Nature Switzerland, 2023

    Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, and Jinman Kim.Non-iterative Coarse-to-Fine Transformer Net- works for Joint Affine and Deformable Image Registration, page 750–760. Springer Nature Switzerland, 2023. 2

  49. [49]

    Random sample con- sensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981

    Martin A Fischler and Robert C Bolles. Random sample con- sensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 2

  50. [50]

    Lucas and Takeo Kanade

    Bruce D. Lucas and Takeo Kanade. An iterative image regis- tration technique with an application to stereo vision. InPro- ceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981. 2

  51. [51]

    Posediffusion: Solving pose estimation via diffusion-aided bun- dle adjustment

    Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bun- dle adjustment. InICCV, 2023. 3

  52. [52]

    Cameras as rays: Pose estimation via ray diffusion

    Jason Y Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion. InICLR, 2024. 3

  53. [53]

    Ang Jr, and Daniela Rus

    Yechao Bai, Ziyuan Huang, Lyuyu Shen, Hongliang Guo, Marcelo H. Ang Jr, and Daniela Rus. Multi-scale feature ag- gregation by cross-scale pixel-to-region relation operation for semantic segmentation.IEEE Robotics and Automation Letters, 6(3):5889–5896, July 2021. 3

  54. [54]

    Cambridge, 2003

    Richard Hartley and Andrew Zisserman.Multiple view geome- try in computer vision. Cambridge, 2003. 4

  55. [55]

    Dsac* - differentiable ransac for camera lo- calization

    Brachmann et al. Dsac* - differentiable ransac for camera lo- calization. InCVPR, 2019. 4

  56. [56]

    Fire: Fundus image registration dataset.Journal for Modeling in Ophthalmology, 1, 2017

    Carlos Hernandez-Matas, Xenophon Zabulis, Areti Triantafyl- lou, Panagiota Anyfanti, Stella Douma, and Antonis A Argyros. Fire: Fundus image registration dataset.Journal for Modeling in Ophthalmology, 1, 2017. 5, 6, 8

  57. [57]

    Flori21: Fluorescein an- giography longitudinal retinal image registration dataset, 2021

    Li Ding, Tony Kang, Ajay Kuriyan, Rajeev Ramchandran, Charles Wykoff, and Gaurav Sharma. Flori21: Fluorescein an- giography longitudinal retinal image registration dataset, 2021. 5, 6, 8

  58. [58]

    Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors, 2017

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors, 2017. 5, 7, 8

  59. [59]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InComputer Vision and Pattern Recognition (CVPR), 2018. 5, 7

  60. [60]

    Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 5, 7 10

  61. [61]

    Lawrence Zitnick, and Piotr Doll ´ar

    Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- dev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll ´ar. Microsoft coco: Com- mon objects in context, 2015. 5, 7, 8

  62. [62]

    Decoupled weight decay reg- ularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay reg- ularization, 2019. 5

  63. [63]

    The dual-bootstrap iterative closest point algorithm with application to retinal image registration.IEEE Transactions on Medical Imaging, 22, 2003

    Charles Stewart, Chia-Ling Tsai, and Badrinath Roysam. The dual-bootstrap iterative closest point algorithm with application to retinal image registration.IEEE Transactions on Medical Imaging, 22, 2003. 5

  64. [64]

    Matchformer: Interleaving attention in transformers for feature matching

    Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching. InAsian Conference on Computer Vision, 2022. 7

  65. [65]

    Object retrieval with large vocabularies and fast spatial matching

    James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007. 7

  66. [66]

    Lost in quantization: Improving particular object retrieval in large scale image databases

    James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008. 7

  67. [67]

    Recurrent homogra- phy estimation using homography-guided image warping and focus transformer

    Si-Yuan Cao, Runmin Zhang, Lun Luo, Beinan Yu, Zehua Sheng, Junwei Li, and Hui-Liang Shen. Recurrent homogra- phy estimation using homography-guided image warping and focus transformer. InCVPR, 2023. 8 11