pith. sign in

hub Mixed citations

Extending Context Window of Large Language Models via Positional Interpolation

Mixed citation behavior. Most common role is background (65%).

73 Pith papers citing it
Background 65% of classified citations
abstract

We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.

hub tools

citation-role summary

background 14 method 4 baseline 2

citation-polarity summary

clear filters

representative citing papers

Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

q-bio.QM · 2026-04-09 · unverdicted · novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.

Screening Is Enough

cs.LG · 2026-04-01 · unverdicted · novelty 7.0

Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

Group Representational Position Encoding

cs.LG · 2025-12-08 · unverdicted · novelty 7.0

GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

ACE: Pluggable Adaptive Context Elasticizer across Agents

cs.AI · 2026-06-30 · unverdicted · novelty 6.0

ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.

Simulating Human Memory with Language Models

cs.CL · 2026-05-25 · unverdicted · novelty 6.0

Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.

citing papers explorer

Showing 11 of 11 citing papers after filters.

  • Tensor Cache: Eviction-conditioned Associative Memory for Transformers cs.LG · 2026-05-21 · unverdicted · none · ref 49 · internal anchor

    Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.

  • Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks cs.LG · 2026-05-05 · unverdicted · none · ref 1 · 2 links · internal anchor

    Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.

  • Screening Is Enough cs.LG · 2026-04-01 · unverdicted · none · ref 19 · internal anchor

    Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.

  • Group Representational Position Encoding cs.LG · 2025-12-08 · unverdicted · none · ref 4 · internal anchor

    GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.

  • RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways cs.LG · 2026-06-09 · unverdicted · none · ref 4 · internal anchor

    RoVE rotates value embeddings simultaneously with keys in attention to make values position-dependent, reframing RoPE as attentive convolution and reporting gains on long-context tasks in 124M and 354M GPT-2 models.

  • PJ-RoPE: A Fourier-Jet-Affine Position Space for Relative Attention cs.LG · 2026-06-03 · unverdicted · none · ref 6 · internal anchor

    PJ-RoPE organizes relative-position mechanisms as a learnable Fourier-Jet-Affine space derived from lag-shift dynamics, extending RoPE and ALiBi with explicit jets and sector selection.

  • Remember to Forget: Gated Adaptive Positional Encoding cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor

    GAPE augments RoPE with query- and key-dependent gates to stabilize attention and improve long-context performance in language models.

  • TabICL: A Tabular Foundation Model for In-Context Learning on Large Data cs.LG · 2025-02-08 · unverdicted · none · ref 73 · internal anchor

    TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datasets over 10K samples.

  • VIP-COP: Context Optimization for Tabular Foundation Models cs.LG · 2026-05-13 · unverdicted · none · ref 3 · internal anchor

    VIP-COP is a black-box method that optimizes context for tabular foundation models by ranking and selecting high-value samples and features via online KernelSHAP regression, outperforming baselines on large high-dimensional data.

  • Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context cs.LG · 2026-05-31 · unverdicted · none · ref 6 · internal anchor

    Soft-NBCE uses temperature-scaled softmax over chunk entropies for soft fusion plus KL-distillation to a full-context teacher, yielding higher F1 on LongBench multi-hop tasks than hard NBCE at O(L^2/n) memory.

  • PRISM: Position-encoded Regressive Inverse Spectral Model for Multilayer Thin-Film Design cs.LG · 2026-05-26 · unverdicted · none · ref 1 · internal anchor

    PRISM is a position-encoded autoregressive transformer that solves the inverse design of multilayer thin films via spectrum prefix conditioning and cumulative-depth RoPE, reporting over 50% MAE reduction versus baselines with fewer parameters.