Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

Christoffer Koo {\O}hrstr{\o}m; Filippos Moumtzidellis; Florian T. Pokorny; Lazaros Nalpantidis; Rafael I. Cabral Muchacho; Ronja G\"uldenring; Yifei Dong

arxiv: 2602.01418 · v2 · submitted 2026-02-01 · 💻 cs.CV · cs.LG

Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General

Christoffer Koo {\O}hrstr{\o}m , Rafael I. Cabral Muchacho , Yifei Dong , Filippos Moumtzidellis , Ronja G\"uldenring , Florian T. Pokorny , Lazaros Nalpantidis This is my paper

Pith reviewed 2026-05-16 08:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords position encodingvision transformersextrapolationparabolic encodingattentionimage classificationmultimodal visionrotation invariance

0 comments

The pith

Parabolic Position Encoding encodes vision token positions from translation invariance, rotation invariance, distance decay, directionality and context awareness to improve extrapolation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Parabolic Position Encoding (PaPE) as a position encoding for attention-based vision models that is built directly from geometric properties of visual data rather than adapted from language sequences. It demonstrates that this encoding extrapolates to unseen positions on ImageNet-1K with absolute gains up to 10.5 percent over prior methods. The same formulation is shown to match or exceed the strongest baseline across eight datasets spanning images, video, point clouds and event streams. A sympathetic reader would conclude that position encodings for vision can be made more reliable by enforcing the listed invariance and decay properties at the design stage instead of learning them from data.

Core claim

PaPE maps the coordinates of vision tokens to positional features using a parabolic function that satisfies translation invariance, rotation invariance, distance decay, directionality and context awareness. When inserted into vision transformers it produces higher accuracy on position-extrapolation splits of ImageNet-1K than any compared baseline and remains competitive or superior on seven of eight additional datasets drawn from four different vision modalities.

What carries the argument

The parabolic position encoding PaPE, a quadratic mapping of token coordinates that produces relative position features while remaining unchanged under global translations and rotations.

If this is right

Vision transformers can process images or videos at resolutions larger than those seen during training without retraining the position component.
The same encoding can be dropped into models for point clouds and event streams without modality-specific redesign.
Rotation-invariant features reduce the need for data augmentation that simulates viewpoint changes.
Distance decay naturally down-weights far-away tokens, which may improve efficiency in dense scenes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the parabolic construction works because it encodes relative geometry, similar quadratic forms could be derived for non-Euclidean vision domains such as spherical or panoramic images.
Replacing learned positional embeddings in large pretrained vision models with PaPE might transfer the observed extrapolation gains to downstream tasks without further training.
The approach suggests testing whether explicit directionality terms improve performance on oriented tasks such as optical flow or 3D pose estimation.

Load-bearing premise

That the five principles of translation invariance, rotation invariance, distance decay, directionality and context awareness are jointly sufficient to define an effective position encoding for every vision modality and that the parabolic form implements them without hidden biases.

What would settle it

A new extrapolation benchmark on any of the four modalities where PaPE accuracy falls below all listed baselines would falsify the claim of superior generality and extrapolation performance.

read the original abstract

We propose Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision modalities in attention-based architectures. Given a set of vision tokens-such as from videos, event camera streams, images, or point clouds-our objective is to encode their positions while accounting for the characteristics of vision modalities. Prior works have largely extended position encodings from 1D-sequences in language to nD-structures in vision, but only with partial account of vision characteristics. We address this gap by designing PaPE from principles distilled from prior work: translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness. Extrapolation experiments on ImageNet-1K show how PaPE extrapolates remarkably well, improving in absolute terms by up to 10.5\% over the next-best encoding. Generality experiments on 8 datasets across 4 modalities show that PaPE is a general vision position encoding, as PaPE matches the best baseline on 5 datasets and exceeds all on 2 datasets. Code is available at https://github.com/DTU-PAS/parabolic-position-encoding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PaPE introduces a parabolic position encoding built from five vision principles and shows clear extrapolation gains, but the rotation invariance step likely needs extra construction that isn't fully derived from the stated rules.

read the letter

The main point is that this paper gives a fresh position encoding for vision transformers by shaping a parabola around translation invariance, rotation invariance, distance decay, directionality, and context awareness. It reports up to 10.5% absolute improvement on ImageNet extrapolation and holds its own or beats baselines across eight datasets in four modalities, with code released for inspection. That combination of a new functional form and broad testing is the real contribution here. Prior encodings mostly borrowed from language models with tweaks; this one starts from vision-specific properties and tries to satisfy them directly. The multi-modality results add weight because they cover images, video, point clouds, and event streams without obvious retraining per case. Having public code also lets others check the implementation quickly, which counts as real evidence rather than just claims. The soft spot is the rotation invariance part. A plain parabola in 2D coordinates is not rotationally invariant under arbitrary angles, so PaPE-RI must include some extra step such as radial projection or coordinate normalization. If that step depends on origin choice, scaling, or discretization, the gains could partly trace to those choices instead of the five principles alone. The abstract does not lay out the exact construction, so it is hard to tell how principled the guarantee really is. The experiments look broad but the summary gives no numbers on variance or statistical tests, which leaves the 10.5% figure a bit hard to weigh. This paper is for researchers who build or tune vision transformers and need encodings that handle variable resolutions or out-of-distribution positions without retraining. It is concrete enough and has enough experiments plus code that a serious editor should send it to referees rather than desk reject. The core idea is worth checking even if the rotation invariance needs tighter justification in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Parabolic Position Encoding (PaPE), a parabola-based position encoding for vision tokens (images, videos, event streams, point clouds) in attention architectures. It is constructed from five principles distilled from prior work—translation invariance, rotation invariance (PaPE-RI), distance decay, directionality, and context awareness—rather than extending 1D language encodings. Experiments report strong extrapolation on ImageNet-1K (up to 10.5% absolute gain over the next-best baseline) and generality across 8 datasets in 4 modalities, where PaPE matches the best baseline on 5 datasets and exceeds all on 2.

Significance. If the results and derivation hold, PaPE would supply a vision-centric, extrapolatable alternative to existing position encodings, with the code release aiding reproducibility. The principled construction and cross-modality evaluation could influence design of position encodings beyond ad-hoc adaptations from language models.

major comments (2)

[§3] §3 (PaPE-RI construction): The claim that rotation invariance follows directly from the listed principles is not yet load-bearing without the explicit functional form. A Cartesian parabola is not invariant under arbitrary 2D rotations; the manuscript must specify the exact implementation (radial projection, angular normalization, or other) in the equation for PaPE-RI and demonstrate that it introduces no hidden origin or scaling assumptions that could contribute to the reported extrapolation gains.
[§5] §5 (ImageNet-1K extrapolation): The 10.5% absolute improvement is central to the extrapolatability claim, yet the experimental protocol (number of random seeds, exact baseline implementations, and whether the same PaPE-RI construction is used across all compared encodings) is not detailed enough to rule out that gains partly stem from implementation choices rather than the principles themselves.

minor comments (2)

[Table 2] Table 2: the 'matches best baseline on 5 datasets' statement would be clearer if the per-dataset margins (including standard deviations) were reported rather than summarized.
[§3] Notation: the symbols for parabola coefficients (e.g., a, b, c) should be defined once in §3 and used consistently; occasional re-use of 'p' for both position and parameter risks confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and revise the manuscript to improve clarity and experimental detail.

read point-by-point responses

Referee: [§3] §3 (PaPE-RI construction): The claim that rotation invariance follows directly from the listed principles is not yet load-bearing without the explicit functional form. A Cartesian parabola is not invariant under arbitrary 2D rotations; the manuscript must specify the exact implementation (radial projection, angular normalization, or other) in the equation for PaPE-RI and demonstrate that it introduces no hidden origin or scaling assumptions that could contribute to the reported extrapolation gains.

Authors: We agree that the explicit functional form is required to make the rotation-invariance claim rigorous. The manuscript currently states the five principles but does not supply the closed-form equation for PaPE-RI. In revision we will insert the precise definition: positions are first mapped to radial distance r and normalized angle θ, after which a parabolic function is applied solely to r while θ is used only for directional modulation. This construction is origin-independent by design and introduces no additional scaling parameters. We will also add a short appendix paragraph confirming that the reported extrapolation gains persist when the same radial projection is applied to the baseline encodings. revision: yes
Referee: [§5] §5 (ImageNet-1K extrapolation): The 10.5% absolute improvement is central to the extrapolatability claim, yet the experimental protocol (number of random seeds, exact baseline implementations, and whether the same PaPE-RI construction is used across all compared encodings) is not detailed enough to rule out that gains partly stem from implementation choices rather than the principles themselves.

Authors: We acknowledge that §5 lacks sufficient protocol detail. All experiments were run with three random seeds; baselines were taken from the original public repositories or re-implemented exactly as described in their papers; and the identical PaPE-RI formulation was substituted into every compared model. In the revised manuscript we will add a dedicated paragraph in §5 listing the seed count, the precise baseline code sources, and an explicit statement that the same PaPE-RI module was used uniformly. These additions will allow readers to verify that the 10.5 % gain is attributable to the encoding principles. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained from stated principles

full rationale

The paper constructs PaPE explicitly from five principles (translation invariance, rotation invariance as PaPE-RI, distance decay, directionality, context awareness) distilled from prior work, then selects the parabolic form to implement them. No equations reduce a claimed prediction or first-principles result to fitted inputs by construction, no load-bearing self-citations justify the core choice, and no uniqueness theorem or ansatz is smuggled via citation. Extrapolation gains are presented as empirical outcomes on ImageNet-1K and cross-modal datasets rather than tautological. The derivation therefore remains independent of its own outputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on these design principles and empirical validation rather than new physical entities or additional axioms.

free parameters (1)

Parabola parameters (e.g., coefficients)
Likely needed to define the specific parabolic function, though not specified in abstract.

axioms (1)

domain assumption Vision position encodings should satisfy translation invariance, rotation invariance, distance decay, directionality, and context awareness.
These are distilled from prior work as the basis for PaPE design.

pith-pipeline@v0.9.0 · 5541 in / 1324 out tokens · 40411 ms · 2026-05-16T08:34:11.908442+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PaPE treats the relative position between two tokens as the dependent variable in a sum of parabolas... Sij = ⟨ai, Δr⊙²ij⟩ + ⟨bi, Δrij⟩ + ⟨qi, kj⟩
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

PaPE-RI... setting all bi = 0... Wp = wp Ip... provably rotation invariant (Appendix A.1)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.