pith. sign in

arxiv: 2310.04418 · v2 · pith:42Y3KP3Rnew · submitted 2023-10-06 · 💻 cs.LG

Functional Interpolation for Relative Positions Improves Long Context Transformers

classification 💻 cs.LG
keywords longermodelspositionrelativecontextcontextsencodingfire
0
0 comments X
read the original abstract

Preventing the performance decay of Transformers on inputs longer than those used for training has been an important challenge in extending the context length of these models. Though the Transformer architecture has fundamentally no limits on the input sequence lengths it can process, the choice of position encoding used during training can limit the performance of these models on longer inputs. We propose a novel functional relative position encoding with progressive interpolation, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We next empirically show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Group Representational Position Encoding

    cs.LG 2025-12 unverdicted novelty 7.0

    GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

  2. PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention

    cs.LG 2026-06 unverdicted novelty 6.0

    PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.

  3. Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

    cs.CL 2026-05 unverdicted novelty 6.0

    Explicitly disentangling semantic and positional streams in a Transformer encoder reveals that absolute positional representations collapse to a 2D document-structure manifold, attention heads specialize by role, and ...

  4. Towards Understanding Self-Pretraining for Sequence Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

  5. Remember to Forget: Gated Adaptive Positional Encoding

    cs.LG 2026-05 unverdicted novelty 6.0

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  6. Three-Phase Transformer

    cs.CL 2026-04 unverdicted novelty 6.0

    Three-Phase Transformer partitions hidden states into N cyclic channels with phase-respecting RMSNorm and Givens rotations plus an orthogonal Gabriel's horn DC injection, delivering 7.2% lower perplexity and 1.93x fas...

  7. Gated Linear Attention Transformers with Hardware-Efficient Training

    cs.LG 2023-12 unverdicted novelty 6.0

    Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

  8. A Measure-Theoretic Analysis of Reasoning: Structural Generalization and Approximation Limits

    cs.LG 2026-05 unverdicted novelty 5.0

    Applies optimal transport to bound OOD generalization error in Transformers via Lipschitz continuity and TC^0 circuit depth lower bounds for Dyck-k backtracking, supported by evaluations on 54 configurations.

  9. Phi-4-reasoning Technical Report

    cs.AI 2025-04 unverdicted novelty 4.0

    A 14B reasoning model trained via supervised fine-tuning on selected prompts and o3-mini traces, plus outcome RL, outperforms larger open models like DeepSeek-R1-Distill-Llama-70B on math, coding, planning and related...