{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2020:L4DYAMP4CO2UVHDPGSULJ2LCZG","short_pith_number":"pith:L4DYAMP4","schema_version":"1.0","canonical_sha256":"5f078031fc13b54a9c6f34a8b4e962c9ba684e49e63756b2309e9e3feca98726","source":{"kind":"arxiv","id":"2007.14062","version":2},"attestation_state":"computed","paper":{"title":"Big Bird: Transformers for Longer Sequences","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"BigBird's sparse attention preserves universal approximation and Turing completeness while scaling transformers to much longer sequences.","cross_cats":["cs.CL","stat.ML"],"primary_cat":"cs.LG","authors_text":"Amr Ahmed, Anirudh Ravula, Avinava Dubey, Chris Alberti, Guru Guruganesh, Joshua Ainslie, Li Yang, Manzil Zaheer, Philip Pham, Qifan Wang, Santiago Ontanon","submitted_at":"2020-07-28T08:34:04Z","abstract_excerpt":"Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis re"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":false,"formal_links_present":true},"canonical_record":{"source":{"id":"2007.14062","kind":"arxiv","version":2},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.LG","submitted_at":"2020-07-28T08:34:04Z","cross_cats_sorted":["cs.CL","stat.ML"],"title_canon_sha256":"fe93aca2c56e0c722572fa55194f449de0bdf4351c41094903d025e445c17514","abstract_canon_sha256":"715aa306ef7fc43efc29c6b8f31619884065434b7f7310d8bce8aba1b16c8446"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:45.914063Z","signature_b64":"AnL1cMm6eigkbbnK5WhK7aW29OXyAWv+IgiH2+XS0Qkm8+H+noUAHEF7Qaw50sQFN+rDJaPYAkkWZMl/17KWAw==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"5f078031fc13b54a9c6f34a8b4e962c9ba684e49e63756b2309e9e3feca98726","last_reissued_at":"2026-05-17T23:38:45.913475Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:45.913475Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"Big Bird: Transformers for Longer Sequences","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"BigBird's sparse attention preserves universal approximation and Turing completeness while scaling transformers to much longer sequences.","cross_cats":["cs.CL","stat.ML"],"primary_cat":"cs.LG","authors_text":"Amr Ahmed, Anirudh Ravula, Avinava Dubey, Chris Alberti, Guru Guruganesh, Joshua Ainslie, Li Yang, Manzil Zaheer, Philip Pham, Qifan Wang, Santiago Ontanon","submitted_at":"2020-07-28T08:34:04Z","abstract_excerpt":"Transformers-based models, such as BERT, have been one of the most successful deep learning models for NLP. Unfortunately, one of their core limitations is the quadratic dependency (mainly in terms of memory) on the sequence length due to their full attention mechanism. To remedy this, we propose, BigBird, a sparse attention mechanism that reduces this quadratic dependency to linear. We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model. Along the way, our theoretical analysis re"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The chosen combination of global, local, and random attention tokens is sufficient to retain the expressive power of full attention for the tasks considered.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"BigBird replaces full attention in Transformers with a sparse pattern that achieves linear complexity while remaining a universal approximator and Turing complete.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"BigBird's sparse attention preserves universal approximation and Turing completeness while scaling transformers to much longer sequences.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"a43d586c044c5b0640e66c8f8ff2fc137ca2ae8d106081e8e9b85beb0f6edd1e"},"source":{"id":"2007.14062","kind":"arxiv","version":2},"verdict":{"id":"f3c96626-36fe-4079-9630-4fb0f76dd032","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T01:49:52.952447Z","strongest_claim":"We show that BigBird is a universal approximator of sequence functions and is Turing complete, thereby preserving these properties of the quadratic, full attention model.","one_line_summary":"BigBird replaces full attention in Transformers with a sparse pattern that achieves linear complexity while remaining a universal approximator and Turing complete.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The chosen combination of global, local, and random attention tokens is sufficient to retain the expressive power of full attention for the tasks considered.","pith_extraction_headline":"BigBird's sparse attention preserves universal approximation and Turing completeness while scaling transformers to much longer sequences."},"references":{"count":0,"sample":[],"resolved_work":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57","internal_anchors":0},"formal_canon":{"evidence_count":1,"snapshot_sha256":"c27924985aba7b2fe8a9a869dcd1a43127386e0ddea32c1434d04c96b90a4484"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2007.14062","created_at":"2026-05-17T23:38:45.913565+00:00"},{"alias_kind":"arxiv_version","alias_value":"2007.14062v2","created_at":"2026-05-17T23:38:45.913565+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2007.14062","created_at":"2026-05-17T23:38:45.913565+00:00"},{"alias_kind":"pith_short_12","alias_value":"L4DYAMP4CO2U","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_16","alias_value":"L4DYAMP4CO2UVHDP","created_at":"2026-05-18T12:33:33.725879+00:00"},{"alias_kind":"pith_short_8","alias_value":"L4DYAMP4","created_at":"2026-05-18T12:33:33.725879+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":20,"internal_anchor_count":20,"sample":[{"citing_arxiv_id":"2605.21042","citing_title":"Dynamic Video Generation: Shaping Video Generation Across Time and Space","ref_index":42,"is_internal_anchor":true},{"citing_arxiv_id":"2605.18807","citing_title":"Block-Based Double Decoders","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2512.20900","citing_title":"Measuring Investor Learning in Private Markets: A Sequential LLM-Bayesian Analysis of Expert Network Calls","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.03263","citing_title":"LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2312.06635","citing_title":"Gated Linear Attention Transformers with Hardware-Efficient Training","ref_index":103,"is_internal_anchor":true},{"citing_arxiv_id":"2604.14191","citing_title":"Attention to Mamba: A Recipe for Cross-Architecture Distillation","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2101.03961","citing_title":"Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2202.08906","citing_title":"ST-MoE: Designing Stable and Transferable Sparse Expert Models","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08587","citing_title":"Kaczmarz Linear Attention","ref_index":50,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22442","citing_title":"HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05806","citing_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2010.04159","citing_title":"Deformable DETR: Deformable Transformers for End-to-End Object Detection","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05806","citing_title":"Retrieval from Within: An Intrinsic Capability of Attention-Based Models","ref_index":40,"is_internal_anchor":true},{"citing_arxiv_id":"2605.07363","citing_title":"MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference","ref_index":30,"is_internal_anchor":true},{"citing_arxiv_id":"2004.05150","citing_title":"Longformer: The Long-Document Transformer","ref_index":130,"is_internal_anchor":true},{"citing_arxiv_id":"2604.18580","citing_title":"Sessa: Selective State Space Attention","ref_index":6,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20311","citing_title":"Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction","ref_index":72,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02568","citing_title":"StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k","ref_index":35,"is_internal_anchor":true},{"citing_arxiv_id":"2604.26375","citing_title":"SG-UniBuc-NLP at SemEval-2026 Task 6: Multi-Head RoBERTa with Chunking for Long-Context Evasion Detection","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2605.02402","citing_title":"Automatic Reflection Level Classification in Hungarian Student Essays","ref_index":55,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":1,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG","json":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG.json","graph_json":"https://pith.science/api/pith-number/L4DYAMP4CO2UVHDPGSULJ2LCZG/graph.json","events_json":"https://pith.science/api/pith-number/L4DYAMP4CO2UVHDPGSULJ2LCZG/events.json","paper":"https://pith.science/paper/L4DYAMP4"},"agent_actions":{"view_html":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG","download_json":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG.json","view_paper":"https://pith.science/paper/L4DYAMP4","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2007.14062&json=true","fetch_graph":"https://pith.science/api/pith-number/L4DYAMP4CO2UVHDPGSULJ2LCZG/graph.json","fetch_events":"https://pith.science/api/pith-number/L4DYAMP4CO2UVHDPGSULJ2LCZG/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG/action/timestamp_anchor","attest_storage":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG/action/storage_attestation","attest_author":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG/action/author_attestation","sign_citation":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG/action/citation_signature","submit_replication":"https://pith.science/pith/L4DYAMP4CO2UVHDPGSULJ2LCZG/action/replication_record"}},"created_at":"2026-05-17T23:38:45.913565+00:00","updated_at":"2026-05-17T23:38:45.913565+00:00"}