pith. sign in

arxiv: 2604.05217 · v1 · submitted 2026-04-06 · 💻 cs.LG · cs.CL

On the Geometry of Positional Encodings in Transformers

Pith reviewed 2026-05-10 18:57 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords positional encodingstransformersmultidimensional scalingHellinger distancestressattention mechanismssequence modelingneural tangent kernel
0
0 comments X

The pith

Transformers without positional signals cannot solve any task that depends on word order, and training forces distinct position vectors at every global minimum.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that transformers require positional encodings to process sequences where order matters. It shows that under mild conditions, training always produces unique vector representations for each position at global minima. The best approximation to an information-optimal encoding is built by classical multidimensional scaling on the Hellinger distances between positional distributions, with quality measured by a single stress value. This construction yields a low-rank form that needs far fewer parameters than the full matrix. Experiments on sentiment tasks confirm that ALiBi yields lower stress than sinusoidal or rotary encodings.

Core claim

Four results are established. The Necessity Theorem states that any transformer without a positional signal fails on order-sensitive tasks. The Positional Separation Theorem states that training assigns distinct vectors to distinct positions at every global minimizer under mild verifiable conditions. The optimal encoding is the classical MDS embedding on the Hellinger distance matrix between positional distributions, with approximation quality given by stress. The resulting encoding has effective rank r = rank(B) at most n-1 and admits a minimal parametrization using only r(n+d) parameters.

What carries the argument

Classical multidimensional scaling applied to the Hellinger distance between positional distributions, which produces the minimum-stress embedding and the associated low-rank factorization of the position matrix.

If this is right

  • Any transformer must receive an explicit positional signal to solve tasks whose correct output depends on word order.
  • At every global minimum the learned representations of distinct positions are linearly independent under the stated conditions.
  • Any candidate positional encoding can be scored by its stress relative to the MDS optimum on the Hellinger matrix.
  • The low-rank MDS solution can be stored and computed with r(n+d) parameters where r is at most n-1 rather than the full nd entries.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The geometric construction may explain why certain hand-designed encodings generalize better than others on long sequences.
  • The stress metric offers a direct way to compare new positional encodings against the information-optimal baseline without retraining.
  • Similar MDS analysis could be applied to relative-position biases or to attention patterns in non-transformer sequence models.

Load-bearing premise

The mild and verifiable conditions under which training assigns distinct vectors to each sequence position at every global minimizer.

What would settle it

Train a transformer on an order-sensitive task such as next-position prediction or sequence reversal with all positional encodings removed and observe whether accuracy exceeds random guessing.

Figures

Figures reproduced from arXiv: 2604.05217 by Giansalvo Cirrincione.

Figure 1
Figure 1. Figure 1: Synthetic corpus. Left: Hellinger distance matrix [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Synthetic corpus. Left: PMDS in its first two dimensions (coloured by position index); the three regimes separate cleanly. Right: stress of three encodings; PMDS achieves 241× lower stress than sinusoidal. PMDS achieves exact isometry. On both corpora, rank(B) ≤ d = 768, so the exact isometry condition of Proposition 6 is satisfied. ALiBi has unexpectedly low stress. ALiBi encodes only the scalar distance … view at source ↗
Figure 3
Figure 3. Figure 3: Stress of five positional encodings on SST-2 and IMDB ( [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Stress vs embedding dimension d (left) and cumulative variance explained by the top-d eigenvectors of B (right) for SST-2 (top) and IMDB (bottom). PMDS reaches zero at d = rank(B); sinusoidal and RoPE stress grows exponentially with d [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Minimum pairwise separation mini̸=j ∥p ∗ i − p ∗ j ∥ during training on SST-2. Both curves remain strictly positive throughout, consistent with Theorem 4. The scratch model starts at 0.70 and remains flat; the pre-trained model starts at 0.35 and remains flat. The difference is explained by initialisation geometry, not by training dynamics. q 2(1 − 1/ √ 768) ≈ 1.39, so a separation of 0.70 for unnormalised… view at source ↗
Figure 6
Figure 6. Figure 6: Monotonicity violation rate for three encodings on SST-2. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Pairwise distances in three encodings vs Hellinger distances on SST-2. Left: [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Layer-wise stress of sinusoidal PE projected through [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
read the original abstract

Neural language models process sequences of words, but the mathematical operations inside them are insensitive to the order in which words appear. Positional encodings are the component added to remedy this. Despite their importance, positional encodings have been designed largely by trial and error, without a mathematical theory of what they ought to do. This paper develops such a theory. Four results are established. First, any Transformer without a positional signal cannot solve any task sensitive to word order (Necessity Theorem). Second, training assigns distinct vector representations to distinct sequence positions at every global minimiser, under mild and verifiable conditions (Positional Separation Theorem). Third, the best achievable approximation to an information-optimal encoding is constructed via classical multidimensional scaling (MDS) on the Hellinger distance between positional distributions; the quality of any encoding is measured by a single number, the stress (Proposition 5, Algorithm 1). Fourth, the optimal encoding has effective rank r = rank(B) <= n-1 and can be represented with r(n+d) parameters instead of nd (minimal parametrisation result). Appendix A develops a proof of the Monotonicity Conjecture within the Neural Tangent Kernel (NTK) regime for masked language modelling (MLM) losses, sequence classification losses, and general losses satisfying a positional sufficiency condition, through five lemmas. Experiments on SST-2 and IMDB with BERT-base confirm the theoretical predictions and reveal that Attention with Linear Biases (ALiBi) achieves much lower stress than the sinusoidal encoding and Rotary Position Embedding (RoPE), consistent with a rank-1 interpretation of the MDS encoding under approximate shift-equivariance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper develops a geometric theory of positional encodings in Transformers. It proves a Necessity Theorem showing that any Transformer without positional signal cannot solve order-sensitive tasks; a Positional Separation Theorem asserting that training produces distinct position vectors at every global minimizer under mild verifiable conditions; a construction (Proposition 5, Algorithm 1) of an information-optimal encoding via classical MDS on Hellinger distances between positional distributions, with stress as the quality metric; and a minimal-parametrization result that the optimal encoding has effective rank r = rank(B) ≤ n-1 and requires only r(n+d) parameters. Appendix A proves a Monotonicity Conjecture in the NTK regime for MLM, classification, and positionally sufficient losses. Experiments on SST-2 and IMDB with BERT-base confirm the predictions and show ALiBi attains lower stress than sinusoidal or RoPE encodings.

Significance. If the theorems hold under the stated conditions, the work supplies the first principled geometric account of why positional encodings are required and how they should be chosen, replacing trial-and-error design with an MDS construction whose stress provides a single scalar figure of merit. The effective-rank reduction and the empirical superiority of ALiBi offer concrete guidance for more parameter-efficient and better-performing position mechanisms. The NTK-based monotonicity analysis and the reproducible stress comparisons on standard benchmarks are additional strengths.

major comments (4)
  1. [Appendix A] Appendix A: the Monotonicity Conjecture proof invokes the Neural Tangent Kernel regime together with a 'positional sufficiency condition' on the loss; these assumptions must be stated precisely (including any requirements on width, initialization, or loss form) and their applicability to finite-width BERT training on non-MLM objectives must be verified, because violation would invalidate the claimed monotonicity of stress and separation.
  2. [Positional Separation Theorem] Positional Separation Theorem (abstract and main text): the 'mild and verifiable conditions' guaranteeing distinct position vectors at every global minimizer are not fully enumerated; the precise requirements on the loss, data distribution, and initialization must be listed so that readers can check whether they hold for standard masked-language-model or classification training.
  3. [Proposition 5, Algorithm 1] Proposition 5 and Algorithm 1: the MDS construction presupposes concrete positional distributions whose Hellinger distance matrix is then embedded; the paper must specify how these distributions are obtained from data (or from the model) and must quantify any distortion introduced by the distance choice, because such distortion directly affects the claimed bound rank(B) ≤ n-1 and the minimal-parametrization result.
  4. [Experiments] Experiments (SST-2/IMDB with BERT-base): the procedure for computing stress for sinusoidal, RoPE, and ALiBi encodings, together with any fitting or data-exclusion details, must be reported in full; without them the claim that ALiBi's lower stress is consistent with a rank-1 MDS interpretation cannot be independently verified.
minor comments (2)
  1. [Abstract] The abstract states that full derivations are absent; the main text should include at least the key steps or explicit references to the lemmas supporting the Necessity and Separation theorems.
  2. [Minimal parametrisation result] Notation for the matrix B whose rank defines the effective dimension should be introduced earlier and used consistently when discussing the minimal parametrization.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major comment below and will revise the manuscript accordingly to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Appendix A] Appendix A: the Monotonicity Conjecture proof invokes the Neural Tangent Kernel regime together with a 'positional sufficiency condition' on the loss; these assumptions must be stated precisely (including any requirements on width, initialization, or loss form) and their applicability to finite-width BERT training on non-MLM objectives must be verified, because violation would invalidate the claimed monotonicity of stress and separation.

    Authors: We agree that the assumptions in Appendix A require more precise enumeration. In the revised manuscript we will explicitly state the NTK-regime requirements (infinite width, NTK parameterization, specific initialization) and the precise definition of the positional sufficiency condition on the loss. We will also add a discussion clarifying that the monotonicity result is rigorous only in the infinite-width limit; our BERT-base experiments on SST-2 and IMDB provide empirical consistency checks for finite-width classification but are not a formal verification. This limitation will be noted. revision: yes

  2. Referee: [Positional Separation Theorem] Positional Separation Theorem (abstract and main text): the 'mild and verifiable conditions' guaranteeing distinct position vectors at every global minimizer are not fully enumerated; the precise requirements on the loss, data distribution, and initialization must be listed so that readers can check whether they hold for standard masked-language-model or classification training.

    Authors: We accept that the conditions should be listed explicitly rather than described only as 'mild and verifiable.' In the revision we will enumerate them in the theorem statement: the loss must be strictly convex in the position embeddings, the data distribution must have full support over sequences of length n, and initialization must be non-degenerate (e.g., random with positive variance). We will briefly verify that these hold for standard cross-entropy MLM and classification training on typical corpora. revision: yes

  3. Referee: [Proposition 5, Algorithm 1] Proposition 5 and Algorithm 1: the MDS construction presupposes concrete positional distributions whose Hellinger distance matrix is then embedded; the paper must specify how these distributions are obtained from data (or from the model) and must quantify any distortion introduced by the distance choice, because such distortion directly affects the claimed bound rank(B) ≤ n-1 and the minimal-parametrization result.

    Authors: We agree that the source of the positional distributions and the effect of the Hellinger distance choice need explicit treatment. The revised text will state that the distributions are the empirical position distributions induced by the training corpus (or uniform when no corpus statistics are used). We will add a short analysis showing that the Hellinger metric yields a positive-semidefinite Gram matrix B, thereby preserving rank(B) ≤ n-1, and will bound the stress distortion arising from empirical estimation. These additions will appear immediately after Algorithm 1. revision: yes

  4. Referee: [Experiments] Experiments (SST-2/IMDB with BERT-base): the procedure for computing stress for sinusoidal, RoPE, and ALiBi encodings, together with any fitting or data-exclusion details, must be reported in full; without them the claim that ALiBi's lower stress is consistent with a rank-1 MDS interpretation cannot be independently verified.

    Authors: We will expand the experimental section to report the stress computation in full. Stress is the normalized Frobenius norm of the difference between the target Hellinger distance matrix and the Euclidean distances of the embedded points, using the classical MDS objective. Sinusoidal, RoPE, and ALiBi encodings are instantiated in their standard forms (no additional fitting) on the complete SST-2 and IMDB training sets with no data exclusion. We will also report the effective rank of the ALiBi embedding to support the rank-1 interpretation. These details will make the comparison reproducible. revision: yes

Circularity Check

0 steps flagged

No significant circularity: theorems rest on external properties and classical MDS

full rationale

The Necessity Theorem follows directly from the permutation-invariance of attention and feed-forward operations, a standard fact independent of the paper. The Positional Separation Theorem is conditioned on explicitly external mild conditions and does not reduce any claimed separation to a definitional identity. The MDS construction invokes the classical algorithm on Hellinger distances between positional distributions, with stress serving as an independent external quality metric; the effective-rank claim r = rank(B) ≤ n-1 is a direct linear-algebra consequence of the distance matrix and does not rename a fitted parameter. The appendix Monotonicity Conjecture is proved inside the standard NTK regime plus a positional-sufficiency condition on the loss, both external frameworks. No load-bearing step collapses by the paper's own equations to a self-citation, an ansatz smuggled via prior work, or a prediction that is statistically forced by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard mathematical tools (MDS, Hellinger distance) plus two domain assumptions stated in the abstract; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption Mild and verifiable conditions for the Positional Separation Theorem
    Invoked explicitly for the theorem to hold at every global minimiser.
  • domain assumption Neural Tangent Kernel regime and positional sufficiency condition for the Monotonicity Conjecture
    Used in Appendix A for the five-lemma proof across MLM, sequence classification, and general losses.

pith-pipeline@v0.9.0 · 5593 in / 1538 out tokens · 63992 ms · 2026-05-10T18:57:58.800595+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

13 extracted references · 13 canonical work pages

  1. [1]

    Hirsch, M.W. (1985). Systems of differential equations that are competitive or cooperative. II: Convergence almost everywhere.SIAM Journal on Mathematical Analysis, 16(3), pp. 423–439

  2. [2]

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. InAdvances in Neural Information Processing Systems (NeurIPS 2017), vol. 30, pp. 5998–6008

  3. [3]

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), pp. 4171–4186. Minneapolis, Minnesota. Association ...

  4. [4]

    Clark, K., Khandelwal, U., Levy, O., and Manning, C.D. (2019). What does BERT look at? An analysis of BERT’s attention. InProceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pp. 276–286. Florence, Italy. Association for Computational Linguistics

  5. [5]

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C.D., Ng, A.Y., and Potts, C. (2013). Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP 2013), pp. 1631–1642. Seattle, Washington. Association for Computational Linguistics

  6. [6]

    Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., and Potts, C. (2011). Learning word vectors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), pp. 142–150. Portland, Oregon. Association for Computational Linguistics

  7. [7]

    Drame, M., Lhoest, Q., and Rush, A.M. (2020). Transformers: State-of-the-art natural language processing. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (EMNLP 2020), pp. 38–45. Online. Association for Computational Linguistics

  8. [8]

    Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., and Liu, Y. (2024). RoFormer: Enhanced Transformer with rotary position embedding.Neurocomputing, 568, article 127063. doi:10.1016/j.neucom.2023.127063

  9. [9]

    Press, O., Smith, N.A., and Lewis, M. (2022). Train short, test long: Attention with linear biases enables input length extrapolation. InProceedings of the 10th International Conference on Learning Representations (ICLR 2022). Virtual conference

  10. [10]

    Rao, C.R. (1945). Information and the accuracy attainable in the estimation of statistical parameters.Bulletin of the Calcutta Mathematical Society, 37, pp. 81–91. 21

  11. [11]

    Torgerson, W.S. (1952). Multidimensional scaling: I. Theory and method.Psychometrika, 17(4), pp. 401–419

  12. [12]

    (2022).The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks

    Roberts, D.A., Yaida, S., and Hanin, B. (2022).The Principles of Deep Learning Theory: An Effective Theory Approach to Understanding Neural Networks. Cambridge University Press, Cambridge, UK. ISBN 978-1-316-51009-8

  13. [13]

    Bonino, M., Ghione, G., and Cirrincione, G. (2025). The geometry of BERT: antisymmetric motor, directional energy, and pattern classification in the query–key product space.arXiv preprint arXiv:2502.12033. Submitted. 22