LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Akil Ahmad Taki; Md Abtahi Majeed Chowdhury; Md Rifat Ur Rahman

arxiv: 2504.14386 · v2 · submitted 2025-04-19 · 💻 cs.CV · cs.AI

LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers

Md Abtahi Majeed Chowdhury , Md Rifat Ur Rahman , Akil Ahmad Taki This is my paper

Pith reviewed 2026-05-22 18:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords positional embeddingsvision transformerspatch orderinglearnable positional encodingspatial inductive biasthree cell experimentimage classificationViT architectures

0 comments

The pith

Optimizing the sequence order of image patches in positional embeddings improves Vision Transformer accuracy and positional retention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the standard raster ordering of patches when injecting positional embeddings into Vision Transformers has been an overlooked variable that limits how well models capture spatial relationships. It introduces a method to learn a better ordering tailored to the frequencies used in the embeddings. This matters because self-attention itself carries no spatial bias, so the quality of the positional signal directly affects whether the model can distinguish nearby patches from distant ones or preserve monotonicity across shifts. The authors also supply a new diagnostic test that exposes much larger performance gaps between models with and without good positional information than conventional accuracy benchmarks do.

Core claim

LOOPE learns an ordering of the 2D patches that optimizes the spatial representation produced by a fixed set of sinusoidal frequencies; when this ordering is used in place of the conventional raster scan, classification accuracy rises across multiple Vision Transformer backbones and the model shows markedly stronger retention of both relative and absolute positional information.

What carries the argument

LOOPE, a learnable patch-ordering module that selects the sequence in which 2D patches are fed to frequency-based positional embeddings so that the resulting vectors better encode grid geometry.

If this is right

Classification accuracy increases across several standard Vision Transformer architectures when the learned ordering replaces the default raster scan.
The Three Cell Experiment registers a 30 to 35 percent performance gap attributable to positional information, far larger than the 4 to 6 percent gaps seen in ordinary benchmarks.
Both relative and absolute positional cues are retained more effectively than with conventional ordering.
The same ordering can be plugged into existing ViT pipelines without changing the underlying frequency set or attention mechanism.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the learned ordering proves stable across datasets, similar ordering optimization could be applied to other sequence-to-grid tasks such as object detection or semantic segmentation.
The approach raises the question of whether an ordering discovered on one frequency basis remains near-optimal when the embedding dimension or the number of frequencies changes.
One could test whether the learned order itself encodes a form of dataset-specific spatial prior that might be inspected or transferred to non-transformer vision models.

Load-bearing premise

That a patch order learned for one fixed set of frequencies on one dataset will still supply useful spatial structure when the same ordering is applied to new data or different Vision Transformer sizes without causing overfitting or breaking shift invariance.

What would settle it

Training a ViT with the LOOPE-derived ordering on a new dataset or architecture and finding no measurable lift in classification accuracy or in the Three Cell Experiment retention scores compared with the standard raster ordering.

Figures

Figures reproduced from arXiv: 2504.14386 by Akil Ahmad Taki, Md Abtahi Majeed Chowdhury, Md Rifat Ur Rahman.

**Figure 2.** Figure 2: (a) Conventional zigzag order, (b) Hilbert Order [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of local order manipulation based on context [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Four cases of three-cell experiment: (a) compare distance [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Cosine Similarity Maps for Three APEs: Top-Right Cor [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Trend of directed Monotonicity,MD with increasing angle precision, δ = 2π/N monotonicity, with this tool one can investigate how precisely the positional embeddings can maintain monotonicity. Clearly, increasing precision all positional encoders struggles to provide a monotone trend in cosine similarity. As N → 1, MD → MU which is perfectly as we expected. At N = 4, δ = π/2, we can see that, LOOPE outpe… view at source ↗

read the original abstract

Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown theoretical advantages over relative positional embeddings (RPE), particularly due to the ability of sinusoidal functions to preserve spatial inductive biases like monotonicity and shift invariance, a fundamental challenge arises when mapping a 2D grid to a 1D sequence. Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings. To address this, we propose LOOPE, a learnable patch-ordering method that optimizes spatial representation for a given set of frequencies, providing a principled approach to patch order optimization. Empirical results show that our PE significantly improves classification accuracy across various ViT architectures. To rigorously evaluate the effectiveness of positional embeddings, we introduce the "Three Cell Experiment", a novel benchmarking framework that assesses the ability of PEs to retain relative and absolute positional information across different ViT architectures. Unlike standard evaluations, which typically report a performance gap of 4 to 6% between models with and without PE, our method reveals a striking 30 to 35% difference, offering a more sensitive diagnostic tool to measure the efficacy of PEs. Our experimental analysis confirms that the proposed LOOPE demonstrates enhanced effectiveness in retaining both relative and absolute positional information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LOOPE learns a patch ordering for sinusoidal embeddings in ViTs and adds a new benchmark, but the gains look hard to judge without details on the optimization or invariance checks.

read the letter

The main thing to know is that this paper treats the order of patches as a learnable parameter when assigning sinusoidal frequency vectors in Vision Transformers, and it introduces a Three Cell Experiment benchmark that reports much larger gaps in positional retention than the usual 4-6% figures. The claim is that this ordering improves accuracy across ViT variants while keeping the benefits of absolute positional embeddings. The new angle is making the 2D-to-1D mapping itself optimizable rather than fixed to raster order. That is a concrete, if narrow, step beyond prior work that mostly ignored ordering. The benchmark is also a reasonable attempt to create a more sensitive test for how well positional information is preserved. The paper does a fair job showing results on multiple architectures and framing the problem clearly in the abstract. The soft spots are more substantial. The abstract gives no numbers, no description of the learning procedure, and no ablations, so it is impossible to tell whether the reported accuracy lifts come from the ordering or from extra fitting. The stress-test concern lands: nothing shown so far demonstrates that a data-driven permutation preserves the shift-invariance and monotonicity that sinusoidal embeddings are supposed to provide. A learned order could easily break the regular grid structure that lets those embeddings generalize. Without seeing the actual loss, constraints, or cross-dataset tests, the 30-35% benchmark gap remains hard to interpret as independent evidence rather than a consequence of the fitting process. This work is aimed at people already working on small refinements to Vision Transformer positional encodings. A reader focused on practical ViT tweaks might pick up a usable idea if the full experiments hold, but it is not likely to shift the broader field. I would send it for peer review to get the optimization details and invariance checks properly examined.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces LOOPE, a learnable method for optimizing the ordering of patches when assigning fixed sinusoidal frequency vectors as absolute positional embeddings in Vision Transformers. The central claims are that this ordering improves classification accuracy across ViT architectures and that the newly proposed Three Cell Experiment reveals a 30-35% performance gap in positional retention (versus the typical 4-6% gap), providing a more sensitive diagnostic for PE effectiveness while preserving the monotonicity and shift-invariance properties of sinusoidal encodings.

Significance. If the empirical gains hold under rigorous controls and the learned ordering demonstrably retains the inductive biases of sinusoidal PEs, the work would offer a lightweight, architecture-agnostic improvement to positional encoding in ViTs. The Three Cell Experiment is a potentially useful addition as a more discriminative benchmark. The manuscript receives credit for focusing on an under-explored aspect of patch ordering and for attempting to keep the derivation grounded in the theoretical advantages of absolute sinusoidal embeddings rather than replacing them entirely.

major comments (3)

[§3] §3 (Method): The description of the learnable patch-order optimization does not specify the loss, any regularization terms, or constraints that would ensure the resulting permutation preserves the translation-equivariance and monotonicity properties asserted for sinusoidal PEs in the introduction. Because the central claim relies on retaining these biases while only re-assigning frequencies, the absence of such details makes it impossible to assess whether the reported accuracy gains and 30-35% gap are independent of the fitting process or simply artifacts of data-driven permutation.
[§5.2] §5.2 (Three Cell Experiment): The 30-35% performance difference is presented as a key result, yet the section provides no quantitative tables, exact cell configurations, ViT backbones, training protocols, or ablation controls that would allow verification that the gap arises from improved positional retention rather than confounding factors. This is load-bearing for the claim that the new benchmark is a more sensitive diagnostic tool.
[§4] §4 (Experiments): The classification accuracy improvements are stated to hold across various ViT architectures, but the manuscript does not report standard deviations over multiple seeds, direct comparisons against strong RPE baselines or other APE variants, or ablations isolating the contribution of the learned order versus a fixed raster order. These omissions weaken the ability to judge whether the gains are robust and generalizable.

minor comments (2)

[Abstract / Introduction] The abstract and introduction repeat the motivation for positional embeddings; condensing this material would improve readability without loss of content.
[§3] Notation for the frequency set and the permutation matrix should be introduced once with a clear equation reference rather than being redefined inline in multiple places.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and will make the necessary revisions to improve the clarity and rigor of the paper.

read point-by-point responses

Referee: [§3] §3 (Method): The description of the learnable patch-order optimization does not specify the loss, any regularization terms, or constraints that would ensure the resulting permutation preserves the translation-equivariance and monotonicity properties asserted for sinusoidal PEs in the introduction. Because the central claim relies on retaining these biases while only re-assigning frequencies, the absence of such details makes it impossible to assess whether the reported accuracy gains and 30-35% gap are independent of the fitting process or simply artifacts of data-driven permutation.

Authors: We agree with the referee that additional details are required in §3 to fully specify the optimization process. The manuscript currently describes LOOPE at a conceptual level, but to address this, we will revise §3 to include the exact loss function used for learning the patch order (a combination of the standard cross-entropy loss and an auxiliary term that measures positional consistency), any regularization applied to encourage smooth permutations, and explicit constraints or post-processing steps that ensure the learned order preserves the monotonicity and shift-invariance of the sinusoidal embeddings. We will also provide a proof sketch or explanation showing that since the frequency vectors remain fixed and sinusoidal, the core inductive biases are retained irrespective of the 1D ordering, with the learning only optimizing the assignment for better spatial alignment. This revision will make it clear that the improvements are not mere artifacts. revision: yes
Referee: [§5.2] §5.2 (Three Cell Experiment): The 30-35% performance difference is presented as a key result, yet the section provides no quantitative tables, exact cell configurations, ViT backbones, training protocols, or ablation controls that would allow verification that the gap arises from improved positional retention rather than confounding factors. This is load-bearing for the claim that the new benchmark is a more sensitive diagnostic tool.

Authors: We acknowledge that §5.2 lacks the detailed quantitative information needed for full reproducibility and verification. In the revised manuscript, we will expand this section with tables reporting exact accuracy numbers for models with and without PE under the Three Cell setup, specify the precise cell positions and sizes used in the experiment, list the ViT architectures and variants tested, detail the training hyperparameters and protocols, and include ablation studies (e.g., varying cell distances or using random orders) to confirm that the large performance gap is attributable to positional retention capabilities. These additions will substantiate the claim that the Three Cell Experiment serves as a more sensitive benchmark compared to standard evaluations. revision: yes
Referee: [§4] §4 (Experiments): The classification accuracy improvements are stated to hold across various ViT architectures, but the manuscript does not report standard deviations over multiple seeds, direct comparisons against strong RPE baselines or other APE variants, or ablations isolating the contribution of the learned order versus a fixed raster order. These omissions weaken the ability to judge whether the gains are robust and generalizable.

Authors: We appreciate this feedback on strengthening the experimental section. We will revise §4 to report mean accuracies with standard deviations computed over multiple random seeds (at least three) for all reported results. We will include comparisons to strong baselines such as relative positional embeddings used in Swin Transformers and other absolute PE methods like learned APE. Furthermore, we will add ablation experiments that directly compare LOOPE's learned order against the standard raster order and other fixed orders to isolate the benefit of the learnable component. These changes will provide a more comprehensive evaluation of the method's robustness and generalizability across architectures. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LOOPE proposal or evaluations

full rationale

The paper introduces LOOPE as an explicitly learnable patch-ordering method that optimizes spatial representation for a fixed set of frequencies in sinusoidal positional embeddings, then reports empirical accuracy gains across ViT architectures and introduces the Three Cell Experiment as a diagnostic benchmark. No first-principles derivation, uniqueness theorem, or mathematical prediction is claimed that reduces to its own inputs by construction. The learnable ordering is optimized as part of the model (standard in ML), with performance measured on standard classification tasks and cross-architecture tests rather than tautologically re-reporting the fit itself. No self-citations, ansatz smuggling, or renaming of known results appear in the provided text to load-bear central claims. The method is self-contained as an empirical proposal with independent experimental grounding.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the premise that patch ordering is an under-explored degree of freedom whose optimization yields better spatial inductive bias than fixed raster order; this premise is treated as an empirical hypothesis rather than a derived result.

free parameters (1)

learnable patch order
Ordering of patches is optimized jointly with embedding frequencies; the abstract gives no count or regularization details.

pith-pipeline@v0.9.0 · 5792 in / 1222 out tokens · 88560 ms · 2026-05-22T18:11:30.026662+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

Attention is all you need

Vaswani Ashish. Attention is all you need. Advances in neural information processing systems, 30:I, 2017. 1, 2, 7

work page 2017
[2]

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, and Yizhou Yu. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024
[3]

Crossvit: Cross-attention multi-scale vision transformer for image classification

Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF in- ternational conference on computer vision , pages 357–366,

work page
[4]

Ef- ficient deep space filling curve

Wanli Chen, Xufeng Yao, Xinyun Zhang, and Bei Yu. Ef- ficient deep space filling curve. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17525–17534, 2023. 3

work page 2023
[5]

8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024

Krzysztof Marcin Choromanski, Shanda Li, Valerii Likhosh- erstov, Kumar Avinava Dubey, Shengjie Luo, Di He, Yiming Yang, Tamas Sarlos, Thomas Weingarten, and Adrian Weller. 8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024. 2

work page 2024
[6]

arXiv preprint arXiv:2102.10882 (2021)

X Chu, Z Tian, B Zhang, X Wang, X Wei, H Xia, and C Shen. Conditional positional encodings for vision transform- ers. arxiv 2021. arXiv preprint arXiv:2102.10882. 1, 2, 6, 7

work page arXiv 2021
[7]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1901
[8]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers) , pages 4171– 4186, 2019. 2

work page 2019
[9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2010
[10]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. 7

work page 2015
[11]

¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck

David Hilbert. ¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck. Mathematische Annalen, 38:459–460, 1891. 3, 6, 7

work page
[12]

Alignerf: High-fidelity neural radiance fields via alignment- aware training

Yifan Jiang, Peter Hedman, Ben Mildenhall, Dejia Xu, Jonathan T Barron, Zhangyang Wang, and Tianfan Xue. Alignerf: High-fidelity neural radiance fields via alignment- aware training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 46–55,

work page
[13]

The impact of positional encoding on length generalization in transform- ers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- san Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transform- ers. Advances in Neural Information Processing Systems, 36: 24892–24928, 2023. 2

work page 2023
[14]

arXiv preprint arXiv:2006.15595 (2020)

G Ke, D He, and TY Liu. Rethinking positional en- coding in language pre-training. arxiv. arXiv preprint arXiv:2006.15595, 2021. 2

work page arXiv 2006
[15]

Learnable fourier features for multi-dimensional spatial po- sitional encoding

Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable fourier features for multi-dimensional spatial po- sitional encoding. Advances in Neural Information Process- ing Systems, 34:15816–15829, 2021. 2, 6, 7

work page 2021
[16]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 2

work page 2022
[17]

Sur une courbe, qui remplit toute une aire plane

Giuseppe Peano. Sur une courbe, qui remplit toute une aire plane. Mathematische Annalen, 36(1):157–160, 1890. 3

work page
[18]

Random features for large- scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large- scale kernel machines. Advances in neural information pro- cessing systems, 20, 2007. 1

work page 2007
[19]

Stand-alone self- attention in vision models

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. Advances in neural information processing systems, 32, 2019. 2

work page 2019
[20]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[21]

Bottleneck transformers for visual recognition

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021. 2

work page 2021
[22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

work page
[23]

Rethinking the in- ception architecture for computer vision, 2015

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision, 2015. 7

work page 2015
[24]

Fourier features let networks learn high frequency functions in low dimen- sional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in neural information processing systems, 33:7537–7547, 2020. 1, 2

work page 2020
[25]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 2, 6

work page 2021
[26]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 6

work page 2021
[27]

Axial-deeplab: Stand- alone axial-attention for panoptic segmentation

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand- alone axial-attention for panoptic segmentation. InEuropean conference on computer vision , pages 108–126. Springer,

work page
[28]

Neural space-filling curves, 2022

Hanyu Wang, Kamal Gupta, Larry Davis, and Abhinav Shri- vastava. Neural space-filling curves, 2022. 3

work page 2022
[29]

Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,

work page arXiv
[30]

Onion curve: A space filling curve with near-optimal clustering,

Pan Xu, Cuong Nguyen, and Srikanta Tirthapura. Onion curve: A space filling curve with near-optimal clustering,

work page
[31]

Positional encoding as spatial inductive bias in gans

Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13569– 13578, 2021. 2

work page 2021
[32]

arXiv preprint arXiv:2312.17044 (2024)

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. arXiv preprint arXiv:2312.17044, 2023. 2 9

work page arXiv 2023
[33]

gilbert: Space-filling curve for rectangu- lar domains of arbitrary size

Jakub ˇCerven´y. gilbert: Space-filling curve for rectangu- lar domains of arbitrary size. https://github.com/ jakubcerveny/gilbert. 3 10

work page

[1] [1]

Attention is all you need

Vaswani Ashish. Attention is all you need. Advances in neural information processing systems, 30:I, 2017. 1, 2, 7

work page 2017

[2] [2]

A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective

Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, and Yizhou Yu. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2

work page 2024

[3] [3]

Crossvit: Cross-attention multi-scale vision transformer for image classification

Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF in- ternational conference on computer vision , pages 357–366,

work page

[4] [4]

Ef- ficient deep space filling curve

Wanli Chen, Xufeng Yao, Xinyun Zhang, and Bei Yu. Ef- ficient deep space filling curve. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17525–17534, 2023. 3

work page 2023

[5] [5]

8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024

Krzysztof Marcin Choromanski, Shanda Li, Valerii Likhosh- erstov, Kumar Avinava Dubey, Shengjie Luo, Di He, Yiming Yang, Tamas Sarlos, Thomas Weingarten, and Adrian Weller. 8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024. 2

work page 2024

[6] [6]

arXiv preprint arXiv:2102.10882 (2021)

X Chu, Z Tian, B Zhang, X Wang, X Wei, H Xia, and C Shen. Conditional positional encodings for vision transform- ers. arxiv 2021. arXiv preprint arXiv:2102.10882. 1, 2, 6, 7

work page arXiv 2021

[7] [7]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. 2

work page internal anchor Pith review Pith/arXiv arXiv 1901

[8] [8]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers) , pages 4171– 4186, 2019. 2

work page 2019

[9] [9]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6

work page internal anchor Pith review Pith/arXiv arXiv 2010

[10] [10]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. 7

work page 2015

[11] [11]

¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck

David Hilbert. ¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck. Mathematische Annalen, 38:459–460, 1891. 3, 6, 7

work page

[12] [12]

Alignerf: High-fidelity neural radiance fields via alignment- aware training

Yifan Jiang, Peter Hedman, Ben Mildenhall, Dejia Xu, Jonathan T Barron, Zhangyang Wang, and Tianfan Xue. Alignerf: High-fidelity neural radiance fields via alignment- aware training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 46–55,

work page

[13] [13]

The impact of positional encoding on length generalization in transform- ers

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- san Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transform- ers. Advances in Neural Information Processing Systems, 36: 24892–24928, 2023. 2

work page 2023

[14] [14]

arXiv preprint arXiv:2006.15595 (2020)

G Ke, D He, and TY Liu. Rethinking positional en- coding in language pre-training. arxiv. arXiv preprint arXiv:2006.15595, 2021. 2

work page arXiv 2006

[15] [15]

Learnable fourier features for multi-dimensional spatial po- sitional encoding

Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable fourier features for multi-dimensional spatial po- sitional encoding. Advances in Neural Information Process- ing Systems, 34:15816–15829, 2021. 2, 6, 7

work page 2021

[16] [16]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 2

work page 2022

[17] [17]

Sur une courbe, qui remplit toute une aire plane

Giuseppe Peano. Sur une courbe, qui remplit toute une aire plane. Mathematische Annalen, 36(1):157–160, 1890. 3

work page

[18] [18]

Random features for large- scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large- scale kernel machines. Advances in neural information pro- cessing systems, 20, 2007. 1

work page 2007

[19] [19]

Stand-alone self- attention in vision models

Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. Advances in neural information processing systems, 32, 2019. 2

work page 2019

[20] [20]

Self-Attention with Relative Position Representations

Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[21] [21]

Bottleneck transformers for visual recognition

Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021. 2

work page 2021

[22] [22]

Roformer: Enhanced transformer with rotary position embedding

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,

work page

[23] [23]

Rethinking the in- ception architecture for computer vision, 2015

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision, 2015. 7

work page 2015

[24] [24]

Fourier features let networks learn high frequency functions in low dimen- sional domains

Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in neural information processing systems, 33:7537–7547, 2020. 1, 2

work page 2020

[25] [25]

Training data-efficient image transformers & distillation through at- tention

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 2, 6

work page 2021

[26] [26]

Going deeper with im- age transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 6

work page 2021

[27] [27]

Axial-deeplab: Stand- alone axial-attention for panoptic segmentation

Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand- alone axial-attention for panoptic segmentation. InEuropean conference on computer vision , pages 108–126. Springer,

work page

[28] [28]

Neural space-filling curves, 2022

Hanyu Wang, Kamal Gupta, Larry Davis, and Abhinav Shri- vastava. Neural space-filling curves, 2022. 3

work page 2022

[29] [29]

Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,

Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,

work page arXiv

[30] [30]

Onion curve: A space filling curve with near-optimal clustering,

Pan Xu, Cuong Nguyen, and Srikanta Tirthapura. Onion curve: A space filling curve with near-optimal clustering,

work page

[31] [31]

Positional encoding as spatial inductive bias in gans

Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13569– 13578, 2021. 2

work page 2021

[32] [32]

arXiv preprint arXiv:2312.17044 (2024)

Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. arXiv preprint arXiv:2312.17044, 2023. 2 9

work page arXiv 2023

[33] [33]

gilbert: Space-filling curve for rectangu- lar domains of arbitrary size

Jakub ˇCerven´y. gilbert: Space-filling curve for rectangu- lar domains of arbitrary size. https://github.com/ jakubcerveny/gilbert. 3 10

work page