LOOPE: Learnable Optimal Patch Order in Positional Embeddings for Vision Transformers
Pith reviewed 2026-05-22 18:11 UTC · model grok-4.3
The pith
Optimizing the sequence order of image patches in positional embeddings improves Vision Transformer accuracy and positional retention.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LOOPE learns an ordering of the 2D patches that optimizes the spatial representation produced by a fixed set of sinusoidal frequencies; when this ordering is used in place of the conventional raster scan, classification accuracy rises across multiple Vision Transformer backbones and the model shows markedly stronger retention of both relative and absolute positional information.
What carries the argument
LOOPE, a learnable patch-ordering module that selects the sequence in which 2D patches are fed to frequency-based positional embeddings so that the resulting vectors better encode grid geometry.
If this is right
- Classification accuracy increases across several standard Vision Transformer architectures when the learned ordering replaces the default raster scan.
- The Three Cell Experiment registers a 30 to 35 percent performance gap attributable to positional information, far larger than the 4 to 6 percent gaps seen in ordinary benchmarks.
- Both relative and absolute positional cues are retained more effectively than with conventional ordering.
- The same ordering can be plugged into existing ViT pipelines without changing the underlying frequency set or attention mechanism.
Where Pith is reading between the lines
- If the learned ordering proves stable across datasets, similar ordering optimization could be applied to other sequence-to-grid tasks such as object detection or semantic segmentation.
- The approach raises the question of whether an ordering discovered on one frequency basis remains near-optimal when the embedding dimension or the number of frequencies changes.
- One could test whether the learned order itself encodes a form of dataset-specific spatial prior that might be inspected or transferred to non-transformer vision models.
Load-bearing premise
That a patch order learned for one fixed set of frequencies on one dataset will still supply useful spatial structure when the same ordering is applied to new data or different Vision Transformer sizes without causing overfitting or breaking shift invariance.
What would settle it
Training a ViT with the LOOPE-derived ordering on a new dataset or architecture and finding no measurable lift in classification accuracy or in the Three Cell Experiment retention scores compared with the standard raster ordering.
Figures
read the original abstract
Positional embeddings (PE) play a crucial role in Vision Transformers (ViTs) by providing spatial information otherwise lost due to the permutation invariant nature of self attention. While absolute positional embeddings (APE) have shown theoretical advantages over relative positional embeddings (RPE), particularly due to the ability of sinusoidal functions to preserve spatial inductive biases like monotonicity and shift invariance, a fundamental challenge arises when mapping a 2D grid to a 1D sequence. Existing methods have mostly overlooked or never explored the impact of patch ordering in positional embeddings. To address this, we propose LOOPE, a learnable patch-ordering method that optimizes spatial representation for a given set of frequencies, providing a principled approach to patch order optimization. Empirical results show that our PE significantly improves classification accuracy across various ViT architectures. To rigorously evaluate the effectiveness of positional embeddings, we introduce the "Three Cell Experiment", a novel benchmarking framework that assesses the ability of PEs to retain relative and absolute positional information across different ViT architectures. Unlike standard evaluations, which typically report a performance gap of 4 to 6% between models with and without PE, our method reveals a striking 30 to 35% difference, offering a more sensitive diagnostic tool to measure the efficacy of PEs. Our experimental analysis confirms that the proposed LOOPE demonstrates enhanced effectiveness in retaining both relative and absolute positional information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces LOOPE, a learnable method for optimizing the ordering of patches when assigning fixed sinusoidal frequency vectors as absolute positional embeddings in Vision Transformers. The central claims are that this ordering improves classification accuracy across ViT architectures and that the newly proposed Three Cell Experiment reveals a 30-35% performance gap in positional retention (versus the typical 4-6% gap), providing a more sensitive diagnostic for PE effectiveness while preserving the monotonicity and shift-invariance properties of sinusoidal encodings.
Significance. If the empirical gains hold under rigorous controls and the learned ordering demonstrably retains the inductive biases of sinusoidal PEs, the work would offer a lightweight, architecture-agnostic improvement to positional encoding in ViTs. The Three Cell Experiment is a potentially useful addition as a more discriminative benchmark. The manuscript receives credit for focusing on an under-explored aspect of patch ordering and for attempting to keep the derivation grounded in the theoretical advantages of absolute sinusoidal embeddings rather than replacing them entirely.
major comments (3)
- [§3] §3 (Method): The description of the learnable patch-order optimization does not specify the loss, any regularization terms, or constraints that would ensure the resulting permutation preserves the translation-equivariance and monotonicity properties asserted for sinusoidal PEs in the introduction. Because the central claim relies on retaining these biases while only re-assigning frequencies, the absence of such details makes it impossible to assess whether the reported accuracy gains and 30-35% gap are independent of the fitting process or simply artifacts of data-driven permutation.
- [§5.2] §5.2 (Three Cell Experiment): The 30-35% performance difference is presented as a key result, yet the section provides no quantitative tables, exact cell configurations, ViT backbones, training protocols, or ablation controls that would allow verification that the gap arises from improved positional retention rather than confounding factors. This is load-bearing for the claim that the new benchmark is a more sensitive diagnostic tool.
- [§4] §4 (Experiments): The classification accuracy improvements are stated to hold across various ViT architectures, but the manuscript does not report standard deviations over multiple seeds, direct comparisons against strong RPE baselines or other APE variants, or ablations isolating the contribution of the learned order versus a fixed raster order. These omissions weaken the ability to judge whether the gains are robust and generalizable.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction repeat the motivation for positional embeddings; condensing this material would improve readability without loss of content.
- [§3] Notation for the frequency set and the permutation matrix should be introduced once with a clear equation reference rather than being redefined inline in multiple places.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments on our manuscript. We address each of the major comments below and will make the necessary revisions to improve the clarity and rigor of the paper.
read point-by-point responses
-
Referee: [§3] §3 (Method): The description of the learnable patch-order optimization does not specify the loss, any regularization terms, or constraints that would ensure the resulting permutation preserves the translation-equivariance and monotonicity properties asserted for sinusoidal PEs in the introduction. Because the central claim relies on retaining these biases while only re-assigning frequencies, the absence of such details makes it impossible to assess whether the reported accuracy gains and 30-35% gap are independent of the fitting process or simply artifacts of data-driven permutation.
Authors: We agree with the referee that additional details are required in §3 to fully specify the optimization process. The manuscript currently describes LOOPE at a conceptual level, but to address this, we will revise §3 to include the exact loss function used for learning the patch order (a combination of the standard cross-entropy loss and an auxiliary term that measures positional consistency), any regularization applied to encourage smooth permutations, and explicit constraints or post-processing steps that ensure the learned order preserves the monotonicity and shift-invariance of the sinusoidal embeddings. We will also provide a proof sketch or explanation showing that since the frequency vectors remain fixed and sinusoidal, the core inductive biases are retained irrespective of the 1D ordering, with the learning only optimizing the assignment for better spatial alignment. This revision will make it clear that the improvements are not mere artifacts. revision: yes
-
Referee: [§5.2] §5.2 (Three Cell Experiment): The 30-35% performance difference is presented as a key result, yet the section provides no quantitative tables, exact cell configurations, ViT backbones, training protocols, or ablation controls that would allow verification that the gap arises from improved positional retention rather than confounding factors. This is load-bearing for the claim that the new benchmark is a more sensitive diagnostic tool.
Authors: We acknowledge that §5.2 lacks the detailed quantitative information needed for full reproducibility and verification. In the revised manuscript, we will expand this section with tables reporting exact accuracy numbers for models with and without PE under the Three Cell setup, specify the precise cell positions and sizes used in the experiment, list the ViT architectures and variants tested, detail the training hyperparameters and protocols, and include ablation studies (e.g., varying cell distances or using random orders) to confirm that the large performance gap is attributable to positional retention capabilities. These additions will substantiate the claim that the Three Cell Experiment serves as a more sensitive benchmark compared to standard evaluations. revision: yes
-
Referee: [§4] §4 (Experiments): The classification accuracy improvements are stated to hold across various ViT architectures, but the manuscript does not report standard deviations over multiple seeds, direct comparisons against strong RPE baselines or other APE variants, or ablations isolating the contribution of the learned order versus a fixed raster order. These omissions weaken the ability to judge whether the gains are robust and generalizable.
Authors: We appreciate this feedback on strengthening the experimental section. We will revise §4 to report mean accuracies with standard deviations computed over multiple random seeds (at least three) for all reported results. We will include comparisons to strong baselines such as relative positional embeddings used in Swin Transformers and other absolute PE methods like learned APE. Furthermore, we will add ablation experiments that directly compare LOOPE's learned order against the standard raster order and other fixed orders to isolate the benefit of the learnable component. These changes will provide a more comprehensive evaluation of the method's robustness and generalizability across architectures. revision: yes
Circularity Check
No significant circularity in LOOPE proposal or evaluations
full rationale
The paper introduces LOOPE as an explicitly learnable patch-ordering method that optimizes spatial representation for a fixed set of frequencies in sinusoidal positional embeddings, then reports empirical accuracy gains across ViT architectures and introduces the Three Cell Experiment as a diagnostic benchmark. No first-principles derivation, uniqueness theorem, or mathematical prediction is claimed that reduces to its own inputs by construction. The learnable ordering is optimized as part of the model (standard in ML), with performance measured on standard classification tasks and cross-architecture tests rather than tautologically re-reporting the fit itself. No self-citations, ansatz smuggling, or renaming of known results appear in the provided text to load-bear central claims. The method is self-contained as an empirical proposal with independent experimental grounding.
Axiom & Free-Parameter Ledger
free parameters (1)
- learnable patch order
Reference graph
Works this paper leans on
-
[1]
Vaswani Ashish. Attention is all you need. Advances in neural information processing systems, 30:I, 2017. 1, 2, 7
work page 2017
-
[2]
Chaoqi Chen, Yushuang Wu, Qiyuan Dai, Hong-Yu Zhou, Mutian Xu, Sibei Yang, Xiaoguang Han, and Yizhou Yu. A survey on graph neural networks and graph transformers in computer vision: A task-oriented perspective. IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 2
work page 2024
-
[3]
Crossvit: Cross-attention multi-scale vision transformer for image classification
Chun-Fu Richard Chen, Quanfu Fan, and Rameswar Panda. Crossvit: Cross-attention multi-scale vision transformer for image classification. In Proceedings of the IEEE/CVF in- ternational conference on computer vision , pages 357–366,
-
[4]
Ef- ficient deep space filling curve
Wanli Chen, Xufeng Yao, Xinyun Zhang, and Bei Yu. Ef- ficient deep space filling curve. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 17525–17534, 2023. 3
work page 2023
-
[5]
8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024
Krzysztof Marcin Choromanski, Shanda Li, Valerii Likhosh- erstov, Kumar Avinava Dubey, Shengjie Luo, Di He, Yiming Yang, Tamas Sarlos, Thomas Weingarten, and Adrian Weller. 8 Learning a fourier transform for linear relative positional en- codings in transformers, 2024. 2
work page 2024
-
[6]
arXiv preprint arXiv:2102.10882 (2021)
X Chu, Z Tian, B Zhang, X Wang, X Wei, H Xia, and C Shen. Conditional positional encodings for vision transform- ers. arxiv 2021. arXiv preprint arXiv:2102.10882. 1, 2, 6, 7
-
[7]
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019. 2
work page internal anchor Pith review Pith/arXiv arXiv 1901
-
[8]
Bert: Pre-training of deep bidirectional trans- formers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers) , pages 4171– 4186, 2019. 2
work page 2019
-
[9]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020. 6
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[10]
Deep residual learning for image recognition, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. 7
work page 2015
-
[11]
¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck
David Hilbert. ¨Uber die stetige abbildung einer linie auf ein fl¨achenst¨uck. Mathematische Annalen, 38:459–460, 1891. 3, 6, 7
-
[12]
Alignerf: High-fidelity neural radiance fields via alignment- aware training
Yifan Jiang, Peter Hedman, Ben Mildenhall, Dejia Xu, Jonathan T Barron, Zhangyang Wang, and Tianfan Xue. Alignerf: High-fidelity neural radiance fields via alignment- aware training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 46–55,
-
[13]
The impact of positional encoding on length generalization in transform- ers
Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Nate- san Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transform- ers. Advances in Neural Information Processing Systems, 36: 24892–24928, 2023. 2
work page 2023
-
[14]
arXiv preprint arXiv:2006.15595 (2020)
G Ke, D He, and TY Liu. Rethinking positional en- coding in language pre-training. arxiv. arXiv preprint arXiv:2006.15595, 2021. 2
-
[15]
Learnable fourier features for multi-dimensional spatial po- sitional encoding
Yang Li, Si Si, Gang Li, Cho-Jui Hsieh, and Samy Bengio. Learnable fourier features for multi-dimensional spatial po- sitional encoding. Advances in Neural Information Process- ing Systems, 34:15816–15829, 2021. 2, 6, 7
work page 2021
-
[16]
Swin transformer v2: Scaling up capacity and resolution
Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 12009–12019, 2022. 2
work page 2022
-
[17]
Sur une courbe, qui remplit toute une aire plane
Giuseppe Peano. Sur une courbe, qui remplit toute une aire plane. Mathematische Annalen, 36(1):157–160, 1890. 3
-
[18]
Random features for large- scale kernel machines
Ali Rahimi and Benjamin Recht. Random features for large- scale kernel machines. Advances in neural information pro- cessing systems, 20, 2007. 1
work page 2007
-
[19]
Stand-alone self- attention in vision models
Prajit Ramachandran, Niki Parmar, Ashish Vaswani, Irwan Bello, Anselm Levskaya, and Jon Shlens. Stand-alone self- attention in vision models. Advances in neural information processing systems, 32, 2019. 2
work page 2019
-
[20]
Self-Attention with Relative Position Representations
Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. Self- attention with relative position representations. arXiv preprint arXiv:1803.02155, 2018. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[21]
Bottleneck transformers for visual recognition
Aravind Srinivas, Tsung-Yi Lin, Niki Parmar, Jonathon Shlens, Pieter Abbeel, and Ashish Vaswani. Bottleneck transformers for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16519–16529, 2021. 2
work page 2021
-
[22]
Roformer: Enhanced transformer with rotary position embedding
Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063,
-
[23]
Rethinking the in- ception architecture for computer vision, 2015
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the in- ception architecture for computer vision, 2015. 7
work page 2015
-
[24]
Fourier features let networks learn high frequency functions in low dimen- sional domains
Matthew Tancik, Pratul Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ra- mamoorthi, Jonathan Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimen- sional domains. Advances in neural information processing systems, 33:7537–7547, 2020. 1, 2
work page 2020
-
[25]
Training data-efficient image transformers & distillation through at- tention
Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv ´e J´egou. Training data-efficient image transformers & distillation through at- tention. In International conference on machine learning , pages 10347–10357. PMLR, 2021. 2, 6
work page 2021
-
[26]
Going deeper with im- age transformers
Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Herv´e J´egou. Going deeper with im- age transformers. In Proceedings of the IEEE/CVF interna- tional conference on computer vision, pages 32–42, 2021. 6
work page 2021
-
[27]
Axial-deeplab: Stand- alone axial-attention for panoptic segmentation
Huiyu Wang, Yukun Zhu, Bradley Green, Hartwig Adam, Alan Yuille, and Liang-Chieh Chen. Axial-deeplab: Stand- alone axial-attention for panoptic segmentation. InEuropean conference on computer vision , pages 108–126. Springer,
-
[28]
Neural space-filling curves, 2022
Hanyu Wang, Kamal Gupta, Larry Davis, and Abhinav Shri- vastava. Neural space-filling curves, 2022. 3
work page 2022
-
[29]
Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,
Kan Wu, Houwen Peng, Minghao Chen, Jianlong Fu, and Hongyang Chao. Rethinking and improving relative posi- tion encoding for vision transformer.CoRR, abs/2107.14222,
-
[30]
Onion curve: A space filling curve with near-optimal clustering,
Pan Xu, Cuong Nguyen, and Srikanta Tirthapura. Onion curve: A space filling curve with near-optimal clustering,
-
[31]
Positional encoding as spatial inductive bias in gans
Rui Xu, Xintao Wang, Kai Chen, Bolei Zhou, and Chen Change Loy. Positional encoding as spatial inductive bias in gans. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13569– 13578, 2021. 2
work page 2021
-
[32]
arXiv preprint arXiv:2312.17044 (2024)
Liang Zhao, Xiachong Feng, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin, and Ting Liu. Length extrapolation of transformers: A survey from the perspective of positional encoding. arXiv preprint arXiv:2312.17044, 2023. 2 9
-
[33]
gilbert: Space-filling curve for rectangu- lar domains of arbitrary size
Jakub ˇCerven´y. gilbert: Space-filling curve for rectangu- lar domains of arbitrary size. https://github.com/ jakubcerveny/gilbert. 3 10
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.