LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
hub Mixed citations
Extending Context Window of Large Language Models via Positional Interpolation
Mixed citation behavior. Most common role is background (65%).
abstract
We present Position Interpolation (PI) that extends the context window sizes of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal fine-tuning (within 1000 steps), while demonstrating strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B. Meanwhile, the extended model by Position Interpolation preserve quality relatively well on tasks within its original context window. To achieve this goal, Position Interpolation linearly down-scales the input position indices to match the original context window size, rather than extrapolating beyond the trained context length which may lead to catastrophically high attention scores that completely ruin the self-attention mechanism. Our theoretical study shows that the upper bound of interpolation is at least $\sim 600 \times$ smaller than that of extrapolation, further demonstrating its stability. Models extended via Position Interpolation retain its original architecture and can reuse most pre-existing optimization and infrastructure.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
Persona Attack uses step-by-step memory injections to achieve up to 95% success in making LLMs ignore safety alignments, with effectiveness depending on model memory and instruction combinations.
Tensor Cache augments sliding-window attention with an eviction-fed outer-product associative memory and a training correction to improve long-context performance under bounded memory.
MemLens benchmark shows long-context LVLMs lose accuracy with length while memory agents lose visual fidelity, with multi-session reasoning below 30% for most systems and neither approach solving the task alone.
ExtraVAR enables resolution extrapolation in visual autoregressive models by stage-aware RoPE remapping and entropy-driven attention scaling, suppressing repetition and detail loss.
Jordan-RoPE realizes a distance-modulated phase basis via non-semisimple Jordan blocks, generating features such as d e^{iωd} for relative positional encoding.
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and showing strong masked language modeling results with or without positional embeddings.
Multiscreen replaces softmax attention with screening to provide absolute query-key relevance, resulting in models with 30% fewer parameters that maintain stable performance at long contexts.
SHARP applies a spectrum-aware dynamic RoPE scaling schedule that promotes resolution more strongly in early denoising stages and relaxes it later, outperforming static baselines on quality metrics for remote sensing images.
GRAPE unifies RoPE and ALiBi as special cases of group actions on positions, providing a principled design space for positional encodings via SO(d) rotations and GL unipotent transformations.
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
Infini-attention combines compressive memory with masked local attention and long-term linear attention inside each Transformer block to support infinite context length with bounded resources.
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
ACE is a pluggable module that elastically orchestrates historical agent steps as raw, abstract, or dropped to maintain compact yet recoverable context for LLM agents handling long trajectories.
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.
SEGA adaptively scales RoPE attention components using spectral-energy guidance from the latent to improve structural coherence and fine details in high-resolution DiT synthesis.
EndPrompt induces long-context generalization in LLaMA models via a two-segment short-sequence construction with terminal positional anchoring, outperforming full fine-tuning and prior methods on RULER and LongBench while using less compute.
Continued pre-training with balanced long-document VQA data extends a 7B LVLM to 128K context, improving long-document VQA by 7.1% and generalizing to 512K without further training.
Attention to goal tokens declines in multi-turn LLM interactions while residual representations often retain decodable goal information, and the gap between these predicts whether goal-conditioned behavior survives.
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
citing papers explorer
-
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
LongBench is the first bilingual multi-task benchmark for long context understanding in LLMs, containing 21 datasets in 6 categories with average lengths of 6711 words (English) and 13386 characters (Chinese).
-
Cognitive Fatigue in Autoregressive Transformers: Formalization and Measurement
Autoregressive transformers exhibit measurable cognitive fatigue during extended generation, quantified by the Fatigue Index that predicts degradation (AUROC 0.95) and repetition (rho 0.94).
-
LayerNorm Induces Recency Bias in Transformer Decoders
Stacked causal self-attention combined with LayerNorm induces recency bias in Transformer decoders, reversing the earlier-token bias seen in attention alone.
-
Semantic Integrity Matters: Benchmarking and Preserving High-Density Reasoning in KV Cache Compression
KV cache compression causes task-dependent degradation in high-density reasoning due to disrupted CoT links; ShotKV mitigates this by preserving few-shot examples as indivisible semantic units through phase separation, delivering 9-18% accuracy gains and 11% latency reduction.
-
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
LongRoPE extends LLM context windows to 2048k tokens via search for non-uniform positional interpolation, progressive fine-tuning from 256k, and short-context readjustment.
-
Mitigating Position Bias in Transformers via Layer-Specific Positional Embedding Scaling
LPES uses per-layer scaling factors optimized by a genetic algorithm with Bézier curves to balance attention and improve long-context LLM performance by up to 11.2% on key-value retrieval.
-
Simulating Human Memory with Language Models
Language models show superior memory to humans on psych experiments but can be adjusted via prompting and compaction to forget more human-like, yielding better user simulators.
-
EndPrompt: Efficient Long-Context Extension via Terminal Anchoring
EndPrompt induces long-context generalization in LLaMA models via a two-segment short-sequence construction with terminal positional anchoring, outperforming full fine-tuning and prior methods on RULER and LongBench while using less compute.
-
Self-Consolidating Language Models: Continual Knowledge Incorporation from Context
SCoL trains LLMs via meta-reinforcement learning to generate layer-specific update instructions that improve knowledge acquisition and retention from context streams over standard baselines.
-
Stacked from One: Multi-Scale Self-Injection for Context Window Extension
SharedLLM stacks two copies of a short-context LLM so the lower one compresses context into query-aware multi-grained tokens that are injected only at the lowest layers of the upper one, enabling generalization from 8K training to 128K+ inputs.
-
Positional Encoding via Token-Aware Phase Attention
TAPA adds a learnable phase function to attention to preserve long-range token interactions, enabling direct continual pretraining, length extrapolation, lower perplexity, and stronger retrieval than RoPE-style methods.
-
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
-
Llemma: An Open Language Model For Mathematics
Continued pretraining of Code Llama on Proof-Pile-2 yields Llemma, an open math-specialized LLM that beats known open base models on MATH and supports tool use plus formal proving out of the box.
-
YaRN: Efficient Context Window Extension of Large Language Models
YaRN extends the context window of RoPE-based LLMs like LLaMA more efficiently than prior methods, using 10x fewer tokens and 2.5x fewer steps while surpassing state-of-the-art performance and enabling extrapolation beyond fine-tuning lengths.
-
A Recipe for Long-Context Reasoning in Large Language Models via On-Policy Optimization and Distillation
Combines GRPO with teacher-guided on-policy distillation and introduces LongBlocks dataset to yield more stable long-context reasoning than either method alone.
-
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
-
E2LLM: Encoder Elongated Large Language Models for Long-Context Understanding and Reasoning
E2LLM uses encoder-based soft prompt compression for long contexts to improve LLM reasoning on tasks like summarization and QA while maintaining efficiency.
-
Context Recycling for Long-Horizon LLM Inference
ContextForge recycles context in long-horizon LLM tasks via query generation, memory retrieval, and synthesis, yielding reduced token use and improved consistency on a 15-turn healthcare benchmark while preserving accuracy.
-
ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools
GLM-4 models rival or exceed GPT-4 on MMLU, GSM8K, MATH, BBH, GPQA, HumanEval, IFEval, long-context tasks, and Chinese alignment while adding autonomous tool use for web, code, and image generation.
-
A Comprehensive Overview of Large Language Models
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.