pith. sign in

arxiv: 2509.10534 · v3 · pith:NGILMKJHnew · submitted 2025-09-05 · 💻 cs.LG · cs.AI· cs.CL

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

classification 💻 cs.LG cs.AIcs.CL
keywords popepositionropewhatcontentcoordinateembeddingsextrapolation
0
0 comments X
read the original abstract

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. What-Where Transformer: A Slot-Centric Visual Backbone for Concurrent Representation and Localization

    cs.CV 2026-05 unverdicted novelty 7.0

    The What-Where Transformer achieves explicit what-where separation in a ViT-style backbone via concurrent token and attention-map streams, yielding emergent object discovery from attention maps and better weakly-super...

  2. Short Data, Long Context: Distilling Positional Knowledge in Transformers

    cs.CL 2026-04 unverdicted novelty 6.0

    Long-context retrieval transfers to student models through logit-based distillation on packed short sequences, aided by phase-wise RoPE scaling and observable positional propagation to output logits.