Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Manaal Faruqui; Siddharth Gopal; Tsendsuren Munkhdalai

arxiv: 2404.07143 · v2 · pith:W4NCQJXFnew · submitted 2024-04-10 · 💻 cs.CL · cs.AI· cs.LG· cs.NE

Leave No Context Behind: Efficient Infinite Context Transformers with Infini-attention

Tsendsuren Munkhdalai , Manaal Faruqui , Siddharth Gopal This is my paper

Pith reviewed 2026-05-21 18:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.NE

keywords Infini-attentioninfinite contextcompressive memorylong-context language modelingTransformer attentionLLM efficiencystreaming inference

0 comments

The pith

Infini-attention lets Transformer LLMs process arbitrarily long inputs with fixed memory and computation costs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a technique to let Transformer-based language models handle inputs of unlimited length without memory or compute growing with sequence length. The approach adds a compressive memory layer inside the attention mechanism so that both short-range masked attention and long-range linear attention operate inside the same block. If the method works as described, models could retain full context across entire books or multi-hour conversations while keeping inference costs constant. Readers would care because current attention scales quadratically, forcing truncation that discards earlier information and limits real-world use on long documents or extended dialogues.

Core claim

The authors present Infini-attention, which folds a compressive memory into standard attention and simultaneously performs masked local attention together with long-term linear attention inside one Transformer block, achieving bounded memory and computation for infinite-length inputs.

What carries the argument

Infini-attention, which augments vanilla attention with a compressive memory that stores and retrieves information across arbitrarily long histories while combining local and linear attention paths.

If this is right

1B and 8B parameter models can perform passkey retrieval over 1-million-token contexts.
The same models can summarize entire books up to 500,000 tokens long.
Inference stays fast and supports streaming even as input length grows without bound.
Only a small fixed number of extra parameters are added for the compressive memory.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique could be combined with existing context-compression methods to further reduce the memory footprint in practice.
It opens the possibility of training or fine-tuning on full-length documents rather than truncated windows.
Similar compressive-memory ideas might transfer to non-language sequence tasks such as long video or audio modeling.

Load-bearing premise

The compressive memory can preserve every piece of task-relevant information from arbitrarily long sequences without irreversible loss.

What would settle it

Run a retrieval task in which a unique key is placed in the first few tokens of a one-million-token sequence; if the model cannot recover that key after processing the full input, the bounded-memory infinite-context claim fails.

read the original abstract

This work introduces an efficient method to scale Transformer-based Large Language Models (LLMs) to infinitely long inputs with bounded memory and computation. A key component in our proposed approach is a new attention technique dubbed Infini-attention. The Infini-attention incorporates a compressive memory into the vanilla attention mechanism and builds in both masked local attention and long-term linear attention mechanisms in a single Transformer block. We demonstrate the effectiveness of our approach on long-context language modeling benchmarks, 1M sequence length passkey context block retrieval and 500K length book summarization tasks with 1B and 8B LLMs. Our approach introduces minimal bounded memory parameters and enables fast streaming inference for LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Infini-attention folds a compressive memory into the attention block to keep context length unbounded while holding memory fixed, and the reported results on 1M retrieval and 500K summarization are the main evidence to weigh.

read the letter

The main thing to know is that this paper shows a practical way to run transformers on very long inputs without memory growing with length. They add a compressive memory to standard attention and run both masked local attention and long-term linear attention inside the same block. That combination is what lets them claim bounded memory and computation for infinite context, plus streaming inference with only a few extra parameters.

Referee Report

2 major / 2 minor

Summary. The paper introduces Infini-attention, which augments standard attention with a compressive memory module that combines masked local attention and long-term linear attention within each Transformer block. This enables scaling LLMs to arbitrarily long inputs using only bounded memory and compute. Experiments report strong results on long-context language modeling, 1M-token passkey retrieval, and 500K-token book summarization with 1B and 8B models, while adding only a small number of bounded memory parameters and supporting streaming inference.

Significance. If the compressive memory retains task-relevant information without irreversible loss, the work would be significant for efficient long-context modeling, as it offers a practical path to infinite-context LLMs with constant memory overhead and demonstrates results on large models and challenging retrieval/summarization tasks.

major comments (2)

[§3.2] §3.2 (Infini-attention and compressive memory): the central claim of bounded-memory infinite context requires that the fixed-size memory matrix never discards task-critical details from early history. No information-theoretic bound, ablation of the memory-update rule, or scaling curve beyond the tested lengths is provided to support this; the reported 1M retrieval and 500K summarization results therefore do not yet verify the 'retain all task-relevant information' precondition.
[§4.3] §4.3 (1M passkey retrieval experiments): performance is stated as positive, yet the section supplies neither error bars across runs, nor controls that isolate the contribution of the compressive memory versus the local attention component, leaving the robustness of the infinite-context claim under-specified.

minor comments (2)

[§3] Notation for the linear attention and memory update equations could be expanded with an explicit step-by-step derivation to improve reproducibility.
[Abstract] The abstract asserts 'infinitely long inputs' while all reported lengths are finite (1M and 500K); a brief statement on the extrapolation argument would clarify the scope.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions made to strengthen the manuscript's claims regarding bounded-memory infinite context.

read point-by-point responses

Referee: [§3.2] §3.2 (Infini-attention and compressive memory): the central claim of bounded-memory infinite context requires that the fixed-size memory matrix never discards task-critical details from early history. No information-theoretic bound, ablation of the memory-update rule, or scaling curve beyond the tested lengths is provided to support this; the reported 1M retrieval and 500K summarization results therefore do not yet verify the 'retain all task-relevant information' precondition.

Authors: We appreciate the referee's emphasis on rigorously supporting the retention property. A formal information-theoretic bound on retention would indeed strengthen the theoretical foundation but requires substantial new analysis that is beyond the scope of this work. Our design uses a compressive memory updated via linear attention to accumulate information across the full history in bounded space. The 1M-token passkey retrieval task directly tests retention of early-sequence details, and the reported high accuracy provides empirical evidence that critical information is preserved. We have added an ablation study of the memory-update rule and scaling curves for context lengths from 4K to 1M tokens in the revised manuscript. revision: partial
Referee: [§4.3] §4.3 (1M passkey retrieval experiments): performance is stated as positive, yet the section supplies neither error bars across runs, nor controls that isolate the contribution of the compressive memory versus the local attention component, leaving the robustness of the infinite-context claim under-specified.

Authors: We agree that error bars and component-isolating controls would improve the robustness assessment. In the revised manuscript we now report mean performance with standard deviations across five independent runs for the 1M passkey task. We have also added ablation experiments that compare the full Infini-attention model against variants without the compressive memory and against local-attention-only baselines, thereby quantifying the contribution of each mechanism to long-context retrieval accuracy. revision: yes

standing simulated objections not resolved

A formal information-theoretic bound proving that the fixed-size compressive memory retains all task-critical details without irreversible loss for arbitrarily long inputs.

Circularity Check

0 steps flagged

No significant circularity in Infini-attention proposal

full rationale

The paper proposes a new architectural mechanism called Infini-attention that combines compressive memory with masked local attention and long-term linear attention inside a single Transformer block. This is presented as an engineering solution for bounded-memory infinite context, with effectiveness demonstrated via empirical results on 1M-token passkey retrieval and 500K book summarization using 1B/8B models. No derivation chain reduces a claimed result to its own inputs by construction, no parameters are fitted to target metrics and then relabeled as predictions, and no load-bearing uniqueness theorems or ansatzes are imported via self-citation. The approach is self-contained against external benchmarks and introduces new components rather than re-deriving performance from prior fitted quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the new compressive memory preserving sufficient information across unbounded lengths; no explicit free parameters or invented entities beyond the memory module itself are stated in the abstract.

invented entities (1)

compressive memory no independent evidence
purpose: store long-term context in bounded size
Introduced as the key addition to vanilla attention to achieve infinite context.

pith-pipeline@v0.9.0 · 5658 in / 1054 out tokens · 43515 ms · 2026-05-21T18:10:25.016546+00:00 · methodology

discussion (0)

Forward citations

Cited by 23 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LSTM-MAS: A Long Short-Term Memory Inspired Multi-Agent System for Long-Context Understanding
cs.CL 2026-01 unverdicted novelty 7.0

LSTM-MAS uses a chained multi-agent architecture modeled on LSTM input, forget, and output gates to improve long-context QA performance and reduce hallucinations compared with prior multi-agent baselines.
MIRIX: Multi-Agent Memory System for LLM-Based Agents
cs.CL 2025-07 unverdicted novelty 7.0

MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
A Single-Layer Model Can Do Language Modeling
cs.CL 2026-05 unverdicted novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
Priming: Hybrid State Space Models From Pre-trained Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasonin...
The Impossibility Triangle of Long-Context Modeling
cs.CL 2026-05 unverdicted novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Universal Transformers Need Memory: Depth-State Trade-offs in Adaptive Recursive Reasoning
cs.LG 2026-04 conditional novelty 6.0

Memory tokens are required for non-trivial performance in adaptive Universal Transformers on Sudoku-Extreme, with 8-32 tokens yielding stable 57% exact-match accuracy while trading off against ponder depth.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adds linearly scaling compressive cross-channel attention to Transformers, cutting average forecast error by 5.4% and ranking first among multivariate baselines.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
HyMem: Hybrid Memory Architecture with Dynamic Retrieval Scheduling
cs.AI 2026-02 unverdicted novelty 6.0

HyMem introduces dual-granular memory storage with a lightweight summary module for fast responses and selective activation of a deep LLM module for complex queries, outperforming full-context baselines by 92.6% lower...
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression
cs.LG 2025-11 unverdicted novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
SAM 3D: 3Dfy Anything in Images
cs.CV 2025-11 unverdicted novelty 6.0

SAM 3D reconstructs 3D objects from single images with geometry, texture, and pose using human-model annotated data at scale and synthetic-to-real training, achieving 5:1 human preference wins.
Kimi Linear: An Expressive, Efficient Attention Architecture
cs.CL 2025-10 unverdicted novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
Lizard: An Efficient Linearization Framework for Large Language Models
cs.CL 2025-07 unverdicted novelty 6.0

Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
PrefixMemory-Tuning: Modernizing Prefix-Tuning by Decoupling the Prefix from Attention
cs.CL 2025-06 unverdicted novelty 6.0

PrefixMemory-Tuning decouples the prefix from attention to overcome performance limits of traditional prefix-tuning and reaches competitive results with modern PEFT methods on LLM adaptation benchmarks.
Titans: Learning to Memorize at Test Time
cs.LG 2024-12 unverdicted novelty 6.0

Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
A Systematic Failure Analysis of Vision Foundation Models for Open Set Iris Presentation Attack Detection
cs.CV 2026-05 accept novelty 5.0

Vision foundation models transfer across similar iris datasets but fail to generalize to unseen presentation attacks and cross-spectral shifts in open-set PAD.
On Problems of Implicit Context Compression for Software Engineering Agents
cs.SE 2026-05 unverdicted novelty 5.0

In-Context Autoencoder succeeds on single-shot common-knowledge and code tasks but fails on multi-step agentic coding tasks.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading
cs.CL 2026-05 unverdicted novelty 5.0

MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs
cs.CV 2026-05 unverdicted novelty 5.0

Fre-Res compresses video tokens by preserving spatial anchors and representing temporal dynamics with low-frequency residual tokens derived from 1D-DCT on inter-frame residuals, plus a Spatial-Guided Absorber to reinj...
Efficient Reasoning with Hidden Thinking
cs.CL 2025-01 unverdicted novelty 5.0

Heima compresses verbose CoT into hidden thinking tokens via information-theoretic analysis and an adaptive interpreter, claiming maintained or improved zero-shot accuracy on reasoning benchmarks.
Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design
cs.AI 2026-05 unverdicted novelty 4.0

The paper defines Computational Token Economics and introduces the Token Economics Trilemma as a framework for trade-offs in granularity, real-time performance, and optimality, while outlining a research agenda for th...

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · cited by 22 Pith papers · 17 internal anchors

[1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901
[5]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending con- text window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788,

work page arXiv
[7]

Generating Long Sequences with Sparse Transformers

10 Preprint. Under review. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904
[8]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhut- dinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[10]

Longnet: Scaling transformers to 1,000,000,000 tokens

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486,

work page arXiv
[11]

Data engineering for scaling language models to 128k context

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171,

work page arXiv
[12]

In-context autoencoder for context compression in a large language model

Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945,

work page arXiv
[13]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler- ating the science of language models. arXiv preprint arXiv:2402.00838,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Transformerfam: Feedback attention is working memory

Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, and Pedro Moreno Mengibar. Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173,

work page arXiv
[16]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Booksum: A collection of datasets for long-form narrative summarization, 2022

Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209,

work page arXiv
[18]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

11 Preprint. Under review. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,

work page internal anchor Pith review Pith/arXiv arXiv 1910
[19]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 1911
[23]

Parallel context windows improve in-context learning of large language models

Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947,

work page arXiv
[24]

Enhancing the transformer with explicit relational encoding for math problem solving

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J ¨urgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611,

work page arXiv 1910
[25]

Under review

12 Preprint. Under review. Imanol Schlag, Tsendsuren Munkhdalai, and J ¨urgen Schmidhuber. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831,

work page arXiv 2011
[26]

Efficient attention: Attention with linear complexities

Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243,

work page arXiv
[27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Memorizing transformers

Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv preprint arXiv:2203.08913,

work page arXiv
[29]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,

work page arXiv
[30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Primera: Pyramid- based masked sentence pre-training for multi-document summarization

Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. Primera: Pyramid- based masked sentence pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499,

work page arXiv
[32]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039,

work page arXiv

[1] [1]

PaLM 2 Technical Report

Rohan Anil, Andrew M Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report. arXiv preprint arXiv:2305.10403,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Neural Machine Translation by Jointly Learning to Align and Translate

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document trans- former. arXiv preprint arXiv:2004.05150,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[4] [4]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901,

work page 1901

[5] [5]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending con- text window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fine-tuning of long-context large language models. arXiv prepri...

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Adapting language models to compress contexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. arXiv preprint arXiv:2305.14788,

work page arXiv

[7] [7]

Generating Long Sequences with Sparse Transformers

10 Preprint. Under review. Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509,

work page internal anchor Pith review Pith/arXiv arXiv 1904

[8] [8]

Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs)

Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning by exponential linear units (elus). arXiv preprint arXiv:1511.07289,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhut- dinov. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[10] [10]

Longnet: Scaling transformers to 1,000,000,000 tokens

Jiayu Ding, Shuming Ma, Li Dong, Xingxing Zhang, Shaohan Huang, Wenhui Wang, Nanning Zheng, and Furu Wei. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486,

work page arXiv

[11] [11]

Data engineering for scaling language models to 128k context

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171,

work page arXiv

[12] [12]

In-context autoencoder for context compression in a large language model

Tao Ge, Jing Hu, Xun Wang, Si-Qing Chen, and Furu Wei. In-context autoencoder for context compression in a large language model. arXiv preprint arXiv:2307.06945,

work page arXiv

[13] [13]

Neural Turing Machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

OLMo: Accelerating the Science of Language Models

Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, et al. Olmo: Acceler- ating the science of language models. arXiv preprint arXiv:2402.00838,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Transformerfam: Feedback attention is working memory

Dongseong Hwang, Weiran Wang, Zhuoyuan Huo, Khe Chai Sim, and Pedro Moreno Mengibar. Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173,

work page arXiv

[16] [16]

Neural GPUs Learn Algorithms

Łukasz Kaiser and Ilya Sutskever. Neural gpus learn algorithms. arXiv preprint arXiv:1511.08228,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

Booksum: A collection of datasets for long-form narrative summarization, 2022

Wojciech Kry´sci´nski, Nazneen Rajani, Divyansh Agarwal, Caiming Xiong, and Dragomir Radev. Booksum: A collection of datasets for long-form narrative summarization. arXiv preprint arXiv:2105.08209,

work page arXiv

[18] [18]

BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

11 Preprint. Under review. Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Ves Stoyanov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461,

work page internal anchor Pith review Pith/arXiv arXiv 1910

[19] [19]

Ring Attention with Blockwise Transformers for Near-Infinite Context

Hao Liu, Matei Zaharia, and Pieter Abbeel. Ring attention with blockwise transformers for near-infinite context. arXiv preprint arXiv:2310.01889,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

YaRN: Efficient Context Window Extension of Large Language Models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

Ofir Press, Noah A Smith, and Mike Lewis. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Compressive Transformers for Long-Range Sequence Modelling

Jack W Rae, Anna Potapenko, Siddhant M Jayakumar, and Timothy P Lillicrap. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507,

work page internal anchor Pith review Pith/arXiv arXiv 1911

[23] [23]

Parallel context windows improve in-context learning of large language models

Nir Ratner, Yoav Levine, Yonatan Belinkov, Ori Ram, Omri Abend, Ehud Karpas, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947,

work page arXiv

[24] [24]

Enhancing the transformer with explicit relational encoding for math problem solving

Imanol Schlag, Paul Smolensky, Roland Fernandez, Nebojsa Jojic, J ¨urgen Schmidhuber, and Jianfeng Gao. Enhancing the transformer with explicit relational encoding for math problem solving. arXiv preprint arXiv:1910.06611,

work page arXiv 1910

[25] [25]

Under review

12 Preprint. Under review. Imanol Schlag, Tsendsuren Munkhdalai, and J ¨urgen Schmidhuber. Learning associative inference using fast weight memory. arXiv preprint arXiv:2011.07831,

work page arXiv 2011

[26] [26]

Efficient attention: Attention with linear complexities

Zhuoran Shen, Mingyuan Zhang, Haiyu Zhao, Shuai Yi, and Hongsheng Li. Efficient attention: Attention with linear complexities. arXiv preprint arXiv:1812.01243,

work page arXiv

[27] [27]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Memorizing transformers

Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv preprint arXiv:2203.08913,

work page arXiv

[29] [29]

Infllm: Unveiling the intrinsic capacity of llms for under- standing extremely long sequences with training-free memory

Chaojun Xiao, Pengle Zhang, Xu Han, Guangxuan Xiao, Yankai Lin, Zhengyan Zhang, Zhiyuan Liu, Song Han, and Maosong Sun. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617,

work page arXiv

[30] [30]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Primera: Pyramid- based masked sentence pre-training for multi-document summarization

Wen Xiao, Iz Beltagy, Giuseppe Carenini, and Arman Cohan. Primera: Pyramid- based masked sentence pre-training for multi-document summarization. arXiv preprint arXiv:2110.08499,

work page arXiv

[32] [32]

Effective long-context scaling of foundation models

Wenhan Xiong, Jingyu Liu, Igor Molybog, Hejia Zhang, Prajjwal Bhargava, Rui Hou, Louis Martin, Rashi Rungta, Karthik Abinav Sankararaman, Barlas Oguz, et al. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039,

work page arXiv