Auto-regressive transformation for image alignment
Pith reviewed 2026-05-22 16:24 UTC · model grok-4.3
The pith
Auto-regressive transformation iteratively refines image alignments by focusing cross-attention on critical regions at multiple scales.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that an auto-regressive pipeline iteratively estimates coarse-to-fine transformations for image alignment by refining transform field parameters with randomly sampled points drawn from hierarchical multi-scale features, guided by a cross-attention layer that directs attention to critical regions and thereby achieves accurate results under conditions of sparse features, large scale changes, and substantial deformations.
What carries the argument
The auto-regressive pipeline that refines the transform field iteratively at each scale using randomly sampled points and cross-attention guidance from multi-scale features.
If this is right
- Outperforms existing methods on planar image alignment tasks
- Achieves performance comparable to state-of-the-art on 3D scene images
- Handles feature-sparse regions, extreme scale and field-of-view differences, and large deformations more reliably
- Provides a versatile pipeline for precise alignment across varied imaging conditions
Where Pith is reading between the lines
- The sampling-plus-attention strategy could be adapted to other dense correspondence tasks such as optical flow estimation in low-texture video.
- If the iterative refinement proves stable, the method might reduce reliance on hand-crafted feature detectors in downstream applications like panoramic stitching.
- Extending the pipeline to include temporal auto-regression could support alignment across video sequences without separate tracking modules.
Load-bearing premise
Randomly sampling points at each scale combined with cross-attention guidance is sufficient to accurately refine the transform field even in feature-sparse regions.
What would settle it
A dataset of images containing large feature-sparse areas where cross-attention maps consistently miss the true correspondence regions and produce alignment errors exceeding those of current state-of-the-art methods.
Figures
read the original abstract
Existing methods for image alignment struggle in cases involving feature-sparse regions, extreme scale and field-of-view differences, and large deformations, often resulting in suboptimal accuracy. Robustness to these challenges can be improved through iterative refinement of the transform field while focusing on critical regions in multi-scale image representations. We thus propose Auto-Regressive Transformation (ART), a novel method that iteratively estimates the coarse-to-fine transformations through an auto-regressive pipeline. Leveraging hierarchical multi-scale features, our network refines the transform field parameters using randomly sampled points at each scale. By incorporating guidance from the cross-attention layer, the model focuses on critical regions, ensuring accurate alignment even in challenging, feature-limited conditions. Extensive experiments demonstrate that ART significantly outperforms state-of-the-art methods on planar images and achieves comparable performance on 3D scene images, establishing it as a powerful and versatile solution for precise image alignment.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Auto-Regressive Transformation (ART), a novel image alignment method that iteratively estimates coarse-to-fine transformations via an auto-regressive pipeline. It extracts hierarchical multi-scale features, refines transform parameters from randomly sampled points at each scale, and uses cross-attention to focus on critical regions, with the goal of improving robustness to feature-sparse areas, extreme scale/FOV differences, and large deformations. Experiments are claimed to show significant outperformance versus state-of-the-art on planar images and comparable results on 3D scenes.
Significance. If the performance claims are substantiated, the auto-regressive coarse-to-fine refinement with cross-attention guidance offers a plausible route to better handling of challenging alignment cases that defeat current methods. The approach is internally consistent and does not reduce to prior fitted parameters; the novelty lies in the pipeline design rather than circular reuse of earlier results.
major comments (1)
- [Method] Method section (pipeline description): the central robustness claim for feature-sparse regions rests on the combination of random point sampling at each scale plus cross-attention guidance. No ablation isolating the sampling strategy or error stratified by local feature density is described, leaving open whether the sampled set remains informative when repeatable features are absent; this directly threatens the headline outperformance result on planar images.
minor comments (2)
- [Abstract] Abstract: the statement that 'extensive experiments demonstrate' superiority would be strengthened by naming the primary datasets, key metrics (e.g., mean endpoint error or success rate), and at least one quantitative delta versus the strongest baseline.
- [Experiments] Experiments: absence of detailed quantitative tables, error analysis, or ablation studies in the visible text makes the superiority claim only moderately verifiable; adding these would allow readers to assess effect sizes and failure modes.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comment below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Method] Method section (pipeline description): the central robustness claim for feature-sparse regions rests on the combination of random point sampling at each scale plus cross-attention guidance. No ablation isolating the sampling strategy or error stratified by local feature density is described, leaving open whether the sampled set remains informative when repeatable features are absent; this directly threatens the headline outperformance result on planar images.
Authors: We acknowledge the value of an explicit ablation isolating the random sampling strategy and an error analysis stratified by local feature density. The design rationale is that random sampling at each scale deliberately avoids dependence on repeatable keypoints, allowing the network to draw from any image locations while cross-attention modulates focus toward regions that contribute most to alignment. In the revised manuscript we will add an ablation that replaces random sampling with feature-based point selection (e.g., using SIFT or SuperPoint) and report both overall alignment error and error binned by local feature density computed via keypoint counts in image patches. These results will be placed in the experiments section to directly support the robustness claim on planar images. revision: yes
Circularity Check
No significant circularity in ART derivation or claims
full rationale
The paper introduces a new neural architecture (ART) that performs iterative coarse-to-fine transform refinement via an auto-regressive pipeline on hierarchical features, random point sampling per scale, and cross-attention guidance. All load-bearing elements are presented as design choices in a novel network, with performance claims resting on experimental comparisons rather than any mathematical reduction, fitted-parameter renaming, or self-citation chain. No equations or steps in the abstract or described method reduce to inputs by construction; the derivation is self-contained as an empirical architecture proposal.
Axiom & Free-Parameter Ledger
free parameters (2)
- number of hierarchical scales
- number of randomly sampled points per scale
axioms (1)
- domain assumption A neural network trained on image pairs can learn to predict accurate transformation parameters from multi-scale features and attention signals.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ART employs an auto-regressive approach, iteratively sampling and refining local transform parameters by joint estimation for a set of points in a coarse-to-fine manner guided by multi-scale representations
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Geometrized transformer for self- supervised homography estimation
Jiazhen Liu and Xirong Li. Geometrized transformer for self- supervised homography estimation. InICCV, 2023. 1, 2, 5, 6, 8
work page 2023
-
[2]
Carlos Hernandez-Matas, Xenophon Zabulis, and Antonis A Argyros. Rempe: Registration of retinal images through eye modelling and pose estimation.IEEE Journal of Biomedical and Health Informatics, 24, 2020. 1, 2, 5, 6
work page 2020
-
[3]
Loftr: Detector-free local feature matching with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xi- aowei Zhou. Loftr: Detector-free local feature matching with transformers. InCVPR, 2021. 1, 2, 5, 6, 7, 8
work page 2021
-
[4]
Semi-supervised keypoint detector and descriptor for retinal im- age matching
Jiazhen Liu, Xirong Li, Qijie Wei, Jie Xu, and Dayong Ding. Semi-supervised keypoint detector and descriptor for retinal im- age matching. InECCV, 2022. 1, 2, 5, 6
work page 2022
-
[5]
Yu Wang, Xiaoye Wang, Zaiwang Gu, Weide Liu, Wee Siong Ng, Weimin Huang, and Jun Cheng. Superjunction: Learning- based junction detection for retinal image registration.Pro- ceedings of the AAAI Conference on Artificial Intelligence, 38(1):292–300, Mar. 2024. 1
work page 2024
-
[6]
Guha Balakrishnan, Amy Zhao, Mert R Sabuncu, John Guttag, and Adrian V Dalca. V oxelmorph: a learning framework for deformable medical image registration.IEEE Transactions on Medical Imaging, 38, 2019. 1, 2
work page 2019
-
[7]
Boah Kim, Dong Hwan Kim, Seong Ho Park, Jieun Kim, June- Goo Lee, and Jong Chul Ye. Cyclemorph: cycle consistent un- supervised deformable image registration.Medical Image Anal- ysis, 71, 2021. 1, 2
work page 2021
-
[8]
Diffusemorph: unsu- pervised deformable image registration using diffusion model
Boah Kim, Inhwa Han, and Jong Chul Ye. Diffusemorph: unsu- pervised deformable image registration using diffusion model. InECCV, 2022. 1, 4
work page 2022
-
[9]
Junyu Chen, Eric C. Frey, Yufan He, William P. Segars, Ye Li, and Yong Du. Transmorph: Transformer for unsupervised med- ical image registration.Medical Image Analysis, 82:102615, November 2022. 1
work page 2022
-
[10]
Springer Nature Switzerland, 2023
Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, and Jinman Kim.Non-iterative Coarse-to-Fine Transformer Net- works for Joint Affine and Deformable Image Registration, page 750–760. Springer Nature Switzerland, 2023. 1
work page 2023
-
[11]
H- ViT: A Hierarchical Vision Transformer for Deformable Image Registration
Morteza Ghahremani, Mohammad Khateri, Bailiang Jian, Benedikt Wiestler, Ehsan Adeli, and Christian Wachinger. H- ViT: A Hierarchical Vision Transformer for Deformable Image Registration . In2024 IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition (CVPR), pages 11513–11523, Los Alamitos, CA, USA, June 2024. IEEE Computer Society. 1
work page 2024
-
[12]
P.J. Besl and Neil D. McKay. A method for registration of 3-d shapes.IEEE Transactions on Pattern Analysis and Machine Intelligence, 14, 1992. 1, 2
work page 1992
-
[13]
Timothy F Cootes, Christopher J Taylor, David H Cooper, and Jim Graham. Active shape models-their training and applica- tion.Computer Vision and Image Understanding, 61, 1995. 1
work page 1995
-
[14]
Lee, Ozan Oktay, Andreas Schuh, Michiel Schaap, and Ben Glocker
Matthew C.H. Lee, Ozan Oktay, Andreas Schuh, Michiel Schaap, and Ben Glocker. Image-and-spatial transformer net- works for structure-guided image registration. InMICCAI,
-
[15]
Bob D De V os, Floris F Berendsen, Max A Viergever, Hessam Sokooti, Marius Staring, and Ivana I ˇsgum. A deep learning framework for unsupervised affine and deformable image regis- tration.Medical Image Analysis, 52, 2019. 1, 2
work page 2019
-
[16]
Deep lucas- kanade homography for multimodal image alignment
Yiming Zhao, Xinming Huang, and Ziming Zhang. Deep lucas- kanade homography for multimodal image alignment. InCVPR,
-
[17]
Mcnet: Re- thinking the core ingredients for accurate and efficient homog- raphy estimation
Haokai Zhu, Si-Yuan Cao, Jianxin Hu, Sitong Zuo, Beinan Yu, Jiacheng Ying, Junwei Li, and Hui-Liang Shen. Mcnet: Re- thinking the core ingredients for accurate and efficient homog- raphy estimation. InCVPR, 2024. 1, 2, 3, 5, 6, 7, 8
work page 2024
-
[18]
Correlation-aware coarse-to-fine mlps for deformable medical image registration, 2024
Mingyuan Meng, Dagan Feng, Lei Bi, and Jinman Kim. Correlation-aware coarse-to-fine mlps for deformable medical image registration, 2024. 1
work page 2024
-
[19]
Stendahl, Lawrence Staib, Albert J
Xiaoran Zhang, John C. Stendahl, Lawrence Staib, Albert J. Sinusas, Alex Wong, and James S. Duncan. Adaptive corre- spondence scoring for unsupervised medical image registration,
-
[20]
Iirp-net: Iterative inference residual pyramid network for enhanced im- age registration
Tai Ma, Suwei Zhang, Jiafeng Li, and Ying Wen. Iirp-net: Iterative inference residual pyramid network for enhanced im- age registration. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 11546–11555,
-
[21]
Superpoint: Self-supervised interest point detection and de- scription
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superpoint: Self-supervised interest point detection and de- scription. InCVPRW, 2018. 2, 5, 6
work page 2018
-
[22]
Glam- points: Greedily learned accurate match points
Prune Truong, Stefanos Apostolopoulos, Agata Mosinska, Samuel Stucky, Carlos Ciller, and Sandro De Zanet. Glam- points: Greedily learned accurate match points. InICCV, 2019. 2, 5, 6
work page 2019
-
[23]
Ncnet: Neighbourhood consensus networks for estimating image correspondences
Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi ´c, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Ncnet: Neighbourhood consensus networks for estimating image correspondences. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 44, 2020. 2, 5, 6
work page 2020
-
[24]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, 2020. 2, 5, 6
work page 2020
-
[25]
Aspanformer: Detector-free image matching with adap- tive span transformer
Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Mingmin Zhen, Tian Fang, David McKinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adap- tive span transformer. InECCV, 2022. 2, 5, 6
work page 2022
-
[26]
Vishal Balaji Sivaraman, Muhammad Imran, Qingyue Wei, Preethika Muralidharan, Michelle R. Tamplin, Isabella M . Grumbach, Randy H. Kardon, Jui-Kai Wang, Yuyin Zhou, and Wei Shao. Retinaregnet: A zero-shot approach for retinal image registration, 2024. 2, 5, 6
work page 2024
-
[27]
Iterative deep homography estimation
Si-Yuan Cao, Jianxin Hu, Zehua Sheng, and Hui-Liang Shen. Iterative deep homography estimation. InCVPR, 2022. 2, 3, 5, 6, 7, 8 9
work page 2022
-
[28]
David G Lowe. Distinctive image features from scale-invariant keypoints.International Journal of Computer Vision, 60, 2004. 2, 4, 5
work page 2004
-
[29]
Surf: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InECCV, 2006. 2
work page 2006
-
[30]
Edward Rosten, Reid Porter, and Tom Drummond. Faster and better: A machine learning approach to corner detection.IEEE Transactions on Pattern Analysis and Machine Intelligence, 32,
-
[31]
Brief: Binary robust independent elementary features
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pas- cal Fua. Brief: Binary robust independent elementary features. InECCV, 2010. 2
work page 2010
-
[32]
R2d2: Repeatable and reliable detector and descriptor.arXiv preprint, 2019
Jerome Revaud, Cesar De Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Repeatable and reliable detector and descriptor.arXiv preprint, 2019. 2
work page 2019
-
[33]
LightGlue: Local Feature Matching at Light Speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. LightGlue: Local Feature Matching at Light Speed. InICCV,
-
[34]
Deep image homography estimation.arXiv preprint, 2016
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Deep image homography estimation.arXiv preprint, 2016. 2
work page 2016
-
[35]
Gmflow: Learning optical flow via global match- ing
Haofei Xu, Jing Zhang, Jianfei Cai, Hamid Rezatofighi, and Dacheng Tao. Gmflow: Learning optical flow via global match- ing. InCVPR, 2022. 2
work page 2022
-
[36]
Flowformer: A transformer architecture for optical flow
Zhaoyang Huang, Xiaoyu Shi, Chao Zhang, Qiang Wang, Ka Chun Cheung, Hongwei Qin, Jifeng Dai, and Hongsheng Li. Flowformer: A transformer architecture for optical flow. In ECCV, 2022. 2
work page 2022
-
[37]
RoMa: Robust Dense Feature Matching
Johan Edstedt, Qiyu Sun, Georg B ¨okman, M˚arten Wadenb¨ack, and Michael Felsberg. RoMa: Robust Dense Feature Matching. IEEE Conference on Computer Vision and Pattern Recognition,
-
[38]
Emergent correspondence from image diffusion
Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion. InThirty-seventh Conference on Neural Information Processing Systems, 2023. 2
work page 2023
- [39]
-
[40]
Image registration methods: a survey.Image and Vision Computing, 21(11):977–1000, 2003
Barbara Zitov ´a and Jan Flusser. Image registration methods: a survey.Image and Vision Computing, 21(11):977–1000, 2003. 2
work page 2003
-
[41]
Frederik Maes, Andr ´e Collignon, Dirk Vandermeulen, Guy Marchal, and Paul Suetens. Multimodality image registration by maximization of mutual information.IEEE Transactions on Medical Imaging, 16(2):187–198, 1997. 2
work page 1997
-
[42]
Jean-Philippe Thirion. Image matching as a diffusion process: an analogy with maxwell’s demons.Medical Image Analysis, 2(3):243–260, 1998. 2
work page 1998
-
[43]
Spatial transformer networks.arXiv preprint,
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Ko- ray Kavukcuoglu. Spatial transformer networks.arXiv preprint,
-
[44]
Separable flow: Learning motion cost volumes for optical flow estimation
Feihu Zhang, Oliver J Woodford, Victor Adrian Prisacariu, and Philip HS Torr. Separable flow: Learning motion cost volumes for optical flow estimation. InICCV, 2021. 2
work page 2021
-
[45]
Deformable image regis- tration based on similarity-steered cnn regression
Xiaohuan Cao, Jianhua Yang, Jun Zhang, Dong Nie, Minjeong Kim, Qian Wang, and Dinggang Shen. Deformable image regis- tration based on similarity-steered cnn regression. InMICCAI,
-
[46]
Weakly-supervised con- volutional neural networks for multimodal image registration
Yipeng Hu, Marc Modat, Eli Gibson, Wenqi Li, Nooshin Ghavami, Ester Bonmati, Guotai Wang, Steven Bandula, Car- oline M Moore, Mark Emberton, et al. Weakly-supervised con- volutional neural networks for multimodal image registration. Medical Image Analysis, 49, 2018. 2
work page 2018
-
[47]
Deepatlas: Joint semi- supervised learning of image registration and segmentation
Zhenlin Xu and Marc Niethammer. Deepatlas: Joint semi- supervised learning of image registration and segmentation. In MICCAI, 2019. 2
work page 2019
-
[48]
Springer Nature Switzerland, 2023
Mingyuan Meng, Lei Bi, Michael Fulham, Dagan Feng, and Jinman Kim.Non-iterative Coarse-to-Fine Transformer Net- works for Joint Affine and Deformable Image Registration, page 750–760. Springer Nature Switzerland, 2023. 2
work page 2023
-
[49]
Martin A Fischler and Robert C Bolles. Random sample con- sensus: a paradigm for model fitting with applications to image analysis and automated cartography.Communications of the ACM, 24(6):381–395, 1981. 2
work page 1981
-
[50]
Bruce D. Lucas and Takeo Kanade. An iterative image regis- tration technique with an application to stereo vision. InPro- ceedings of the 7th International Joint Conference on Artificial Intelligence (IJCAI), pages 674–679, 1981. 2
work page 1981
-
[51]
Posediffusion: Solving pose estimation via diffusion-aided bun- dle adjustment
Jianyuan Wang, Christian Rupprecht, and David Novotny. Posediffusion: Solving pose estimation via diffusion-aided bun- dle adjustment. InICCV, 2023. 3
work page 2023
-
[52]
Cameras as rays: Pose estimation via ray diffusion
Jason Y Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan, and Shubham Tulsiani. Cameras as rays: Pose estimation via ray diffusion. InICLR, 2024. 3
work page 2024
-
[53]
Yechao Bai, Ziyuan Huang, Lyuyu Shen, Hongliang Guo, Marcelo H. Ang Jr, and Daniela Rus. Multi-scale feature ag- gregation by cross-scale pixel-to-region relation operation for semantic segmentation.IEEE Robotics and Automation Letters, 6(3):5889–5896, July 2021. 3
work page 2021
-
[54]
Richard Hartley and Andrew Zisserman.Multiple view geome- try in computer vision. Cambridge, 2003. 4
work page 2003
-
[55]
Dsac* - differentiable ransac for camera lo- calization
Brachmann et al. Dsac* - differentiable ransac for camera lo- calization. InCVPR, 2019. 4
work page 2019
-
[56]
Fire: Fundus image registration dataset.Journal for Modeling in Ophthalmology, 1, 2017
Carlos Hernandez-Matas, Xenophon Zabulis, Areti Triantafyl- lou, Panagiota Anyfanti, Stella Douma, and Antonis A Argyros. Fire: Fundus image registration dataset.Journal for Modeling in Ophthalmology, 1, 2017. 5, 6, 8
work page 2017
-
[57]
Flori21: Fluorescein an- giography longitudinal retinal image registration dataset, 2021
Li Ding, Tony Kang, Ajay Kuriyan, Rajeev Ramchandran, Charles Wykoff, and Gaurav Sharma. Flori21: Fluorescein an- giography longitudinal retinal image registration dataset, 2021. 5, 6, 8
work page 2021
-
[58]
Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors, 2017
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of hand- crafted and learned local descriptors, 2017. 5, 7, 8
work page 2017
-
[59]
Megadepth: Learning single- view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InComputer Vision and Pattern Recognition (CVPR), 2018. 5, 7
work page 2018
-
[60]
Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly- annotated 3d reconstructions of indoor scenes. InProc. Com- puter Vision and Pattern Recognition (CVPR), IEEE, 2017. 5, 7 10
work page 2017
-
[61]
Lawrence Zitnick, and Piotr Doll ´ar
Tsung-Yi Lin, Michael Maire, Serge Belongie, Lubomir Bour- dev, Ross Girshick, James Hays, Pietro Perona, Deva Ramanan, C. Lawrence Zitnick, and Piotr Doll ´ar. Microsoft coco: Com- mon objects in context, 2015. 5, 7, 8
work page 2015
-
[62]
Decoupled weight decay reg- ularization, 2019
Ilya Loshchilov and Frank Hutter. Decoupled weight decay reg- ularization, 2019. 5
work page 2019
-
[63]
Charles Stewart, Chia-Ling Tsai, and Badrinath Roysam. The dual-bootstrap iterative closest point algorithm with application to retinal image registration.IEEE Transactions on Medical Imaging, 22, 2003. 5
work page 2003
-
[64]
Matchformer: Interleaving attention in transformers for feature matching
Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching. InAsian Conference on Computer Vision, 2022. 7
work page 2022
-
[65]
Object retrieval with large vocabularies and fast spatial matching
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Object retrieval with large vocabularies and fast spatial matching. In2007 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2007. 7
work page 2007
-
[66]
Lost in quantization: Improving particular object retrieval in large scale image databases
James Philbin, Ondrej Chum, Michael Isard, Josef Sivic, and Andrew Zisserman. Lost in quantization: Improving particular object retrieval in large scale image databases. In2008 IEEE Conference on Computer Vision and Pattern Recognition, pages 1–8, 2008. 7
work page 2008
-
[67]
Recurrent homogra- phy estimation using homography-guided image warping and focus transformer
Si-Yuan Cao, Runmin Zhang, Lun Luo, Beinan Yu, Zehua Sheng, Junwei Li, and Hui-Liang Shen. Recurrent homogra- phy estimation using homography-guided image warping and focus transformer. InCVPR, 2023. 8 11
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.