arxiv: 2007.14062 · v2 · submitted 2020-07-28 · 💻 cs.LG · cs.CL· stat.ML

Recognition: 1 theorem link

· Lean Theorem

Big Bird: Transformers for Longer Sequences

Manzil Zaheer , Guru Guruganesh , Avinava Dubey , Joshua Ainslie , Chris Alberti , Santiago Ontanon , Philip Pham , Anirudh Ravula

show 3 more authors

Qifan Wang Li Yang Amr Ahmed

Authors on Pith no claims yet

Pith reviewed 2026-05-17 01:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords transformerssparse attentionlong sequencesuniversal approximatorTuring completeNLPgenomics

0 comments

The pith

BigBird's sparse attention preserves universal approximation and Turing completeness while scaling transformers to much longer sequences.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a sparse attention mechanism called BigBird that replaces the quadratic full attention in transformers with a linear combination of global, local, and random attention. It proves that this design remains a universal approximator of sequence functions and Turing complete, matching the theoretical power of the original quadratic model. Because the mechanism scales linearly, it supports sequences up to eight times longer on the same hardware. This directly improves results on question answering and summarization and opens applications in genomics where long contexts matter.

Core claim

BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization.

What carries the argument

The sparse attention pattern that mixes a small set of global tokens attending to the whole sequence, local sliding windows, and random attention connections.

If this is right

Sequences up to eight times longer become feasible on comparable hardware.
Performance improves on question answering and summarization.
New applications become practical for genomics data requiring long contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same sparse pattern might be tested on other long-context domains such as video or multi-document reasoning.
If the proof holds, similar sparse designs could be applied to other attention-based architectures without losing universality.

Load-bearing premise

The chosen combination of global, local, and random attention tokens is sufficient to retain the expressive power of full attention for the tasks considered.

What would settle it

A concrete sequence function or Turing-machine simulation that full attention can realize but BigBird cannot realize even with the global-local-random pattern on sufficiently long inputs.

read the original abstract

Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis reveals some of the benefits of having $O(1)$ global tokens (such as CLS), that attend to the entire sequence as part of the sparse attention mechanism. The proposed sparse attention can handle sequences of length up to 8x of what was previously possible using similar hardware. As a consequence of the capability to handle longer context, BigBird drastically improves performance on various NLP tasks such as question answering and summarization. We also propose novel applications to genomics data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BigBird's mix of global, local, and random attention claims linear scaling while keeping universal approximation and Turing completeness, but the abstract gives no proof details to check.

read the letter

The main thing here is that BigBird swaps full quadratic attention for a sparse pattern using a constant number of global tokens, local windows, and random connections. The abstract says this keeps the model a universal approximator of sequence functions and Turing complete, just like dense attention, while cutting memory use enough to handle sequences eight times longer on similar hardware. It also reports gains on question answering and summarization plus some genomics uses.

Referee Report

2 major / 1 minor

Summary. The paper proposes BigBird, a sparse attention mechanism for Transformers that reduces the quadratic dependency on sequence length to linear. It claims to prove that BigBird is a universal approximator of sequence functions and is Turing complete (preserving these properties of full attention), highlights benefits of O(1) global tokens such as CLS, reports handling sequences up to 8x longer on similar hardware, and shows performance gains on NLP tasks including question answering and summarization along with novel genomics applications.

Significance. If the universality and Turing-completeness claims hold for the specific global+local+random sparsity pattern, the work would be significant for enabling efficient long-sequence modeling in attention-based architectures while retaining expressive power, with direct implications for NLP and sequence tasks in other domains such as genomics.

major comments (2)

[Abstract] Abstract: the central claim that BigBird 'is a universal approximator of sequence functions and is Turing complete' is asserted via 'theoretical analysis,' but the provided manuscript supplies neither the construction of the sparse pattern nor the proof strategy. This prevents verification of whether the combination of O(1) global tokens, local windows, and random connections actually guarantees the necessary information flow for arbitrary sequence functions rather than a restricted subclass.
[Abstract] Abstract: the statement that the analysis 'reveals some of the benefits of having O(1) global tokens (such as CLS)' is presented without any indication of how these tokens are formally shown to contribute to universality or to simulate the information propagation of dense attention.

minor comments (1)

[Abstract] The abstract refers to 'the proposed sparse attention' and 'the capability to handle longer context' but does not define the exact sparsity pattern or the baseline hardware constraints used for the 8x claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for raising these points about the presentation of our theoretical claims. The full manuscript contains the requested construction and proofs in the main body and appendix; we address each comment below and will revise the abstract for clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that BigBird 'is a universal approximator of sequence functions and is Turing complete' is asserted via 'theoretical analysis,' but the provided manuscript supplies neither the construction of the sparse pattern nor the proof strategy. This prevents verification of whether the combination of O(1) global tokens, local windows, and random connections actually guarantees the necessary information flow for arbitrary sequence functions rather than a restricted subclass.

Authors: The construction of the sparse attention pattern (a fixed combination of global tokens, local sliding windows, and random connections) is given in Section 3. The proof that this specific pattern remains a universal approximator and Turing complete appears in Section 4, with complete proofs in the appendix. The argument proceeds by showing that the pattern preserves sufficient connectivity for information to propagate across arbitrary distances in a constant number of layers, thereby simulating the expressive power of dense attention. We will add a parenthetical reference to these sections in the abstract. revision: partial
Referee: [Abstract] Abstract: the statement that the analysis 'reveals some of the benefits of having O(1) global tokens (such as CLS)' is presented without any indication of how these tokens are formally shown to contribute to universality or to simulate the information propagation of dense attention.

Authors: Section 4 formalizes the role of the O(1) global tokens: each token attends to a constant number of global tokens that in turn attend to the full sequence, which is shown to be sufficient for the required long-range information flow in the universality proof. This is the precise mechanism by which the sparse pattern recovers the expressive power of dense attention. We will insert a short clause in the abstract pointing to this analysis. revision: partial

Circularity Check

0 steps flagged

Theoretical claims of universality and Turing completeness presented independently

full rationale

The abstract states that BigBird preserves universal approximation and Turing completeness of full attention via theoretical analysis, with benefits of O(1) global tokens noted. No equations, derivations, self-citations, or reductions to fitted parameters are supplied in the available text. The claims are asserted as independent results rather than constructed from or equivalent to the sparse attention inputs by definition, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the particular sparse attention graph (global + local + random) preserves the approximation and computability properties of dense attention; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Standard properties of attention mechanisms and sequence-to-sequence functions hold under the sparse mask.
Invoked when claiming universal approximation and Turing completeness for the sparse variant.

pith-pipeline@v0.9.0 · 5482 in / 1151 out tokens · 21312 ms · 2026-05-17T01:49:52.952447+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
Measuring Investor Learning in Private Markets: A Sequential LLM-Bayesian Analysis of Expert Network Calls
cs.CE 2025-12 unverdicted novelty 7.0

A new LLM-Bayesian framework extracts investor beliefs from expert calls, showing these calls raise investment probability by 7-9 points and improve simulated portfolio returns by 15%.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity
cs.LG 2021-01 accept novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Longformer: The Long-Document Transformer
cs.CL 2020-04 accept novelty 7.0

Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling
cs.CL 2026-03 unverdicted novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
Gated Linear Attention Transformers with Hardware-Efficient Training
cs.LG 2023-12 unverdicted novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
ST-MoE: Designing Stable and Transferable Sparse Expert Models
cs.CL 2022-02 unverdicted novelty 6.0

ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost ...
Deformable DETR: Deformable Transformers for End-to-End Object Detection
cs.CV 2020-10 accept novelty 6.0

Deformable DETR achieves higher accuracy than DETR, especially on small objects, while converging in one-tenth the training epochs by using sparse deformable attention on image features.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Automatic Reflection Level Classification in Hungarian Student Essays
cs.CL 2026-05 unverdicted novelty 5.0

Classical machine learning models outperform Hungarian transformers slightly in overall performance (71% vs 68% average score) for classifying reflection levels in student essays, though transformers handle rare class...
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
Sessa: Selective State Space Attention
cs.LG 2026-04 unverdicted novelty 5.0

Sessa integrates attention within recurrent paths to achieve power-law memory tails and flexible non-decaying selective retrieval, outperforming baselines on long-context tasks.
SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection
cs.CL 2026-04 unverdicted novelty 3.0

A multi-head RoBERTa model with overlapping chunking and max-pooling achieves Macro-F1 of 0.80 on 3-way clarity classification and 0.51 on 9-way evasion strategy detection, ranking 11th in both subtasks of SemEval-202...