{"work":{"id":"4150b761-b8bf-4d9b-a2f8-cb2d1b73d378","openalex_id":null,"doi":null,"arxiv_id":"2111.00396","raw_key":null,"title":"Efficiently Modeling Long Sequences with Structured State Spaces","authors":null,"authors_text":"Albert Gu, Karan Goel, Christopher R\\'e","year":2021,"venue":"cs.LG","abstract":"A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \\( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \\), and showed that for appropriate choices of the state matrix \\( A \\), this system could handle long-range dependencies mathematically and empirically. However, this method has prohibitive computation and memory requirements, rendering it infeasible as a general sequence modeling solution. We propose the Structured State Space sequence model (S4) based on a new parameterization for the SSM, and show that it can be computed much more efficiently than prior approaches while preserving their theoretical strengths. Our technique involves conditioning \\( A \\) with a low-rank correction, allowing it to be diagonalized stably and reducing the SSM to the well-studied computation of a Cauchy kernel. S4 achieves strong empirical results across a diverse range of established benchmarks, including (i) 91\\% accuracy on sequential CIFAR-10 with no data augmentation or auxiliary losses, on par with a larger 2-D ResNet, (ii) substantially closing the gap to Transformers on image and language modeling tasks, while performing generation $60\\times$ faster (iii) SoTA on every task from the Long Range Arena benchmark, including solving the challenging Path-X task of length 16k that all prior work fails on, while being as efficient as all competitors.","external_url":"https://arxiv.org/abs/2111.00396","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T07:36:41.903329+00:00","pith_arxiv_id":"2111.00396","created_at":"2026-05-08T21:39:23.462644+00:00","updated_at":"2026-05-25T07:36:41.903329+00:00","title_quality_ok":true,"display_title":"Efficiently Modeling Long Sequences with Structured State Spaces","render_title":"Efficiently Modeling Long Sequences with Structured State Spaces"},"hub":{"state":{"work_id":"4150b761-b8bf-4d9b-a2f8-cb2d1b73d378","tier":"super_hub","tier_reason":"100+ Pith inbound or 10,000+ external citations","pith_inbound_count":110,"external_cited_by_count":null,"distinct_field_count":20,"first_pith_cited_at":"2023-07-17T16:40:01+00:00","last_pith_cited_at":"2026-05-20T15:36:20+00:00","author_build_status":"needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-04T08:27:26.436116+00:00","tier_text":"super_hub"},"tier":"super_hub","role_counts":[{"context_role":"background","n":20},{"context_role":"method","n":5},{"context_role":"baseline","n":1}],"polarity_counts":[{"context_polarity":"background","n":20},{"context_polarity":"use_method","n":5},{"context_polarity":"baseline","n":1}],"runs":{"ask_index":{"job_type":"ask_index","status":"succeeded","result":{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","claims":[{"claim_text":"A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \\( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \\), and showed that for appropriate choices of the ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"uncontrollable components in Transformers. •Composable Systems: Designing modular, predictable LLM control architectures. Recent work such as LiSeCo explores activation-level control and provides theoretical guarantees [30], but its effectiveness is limited by the need for supervised probes andrepresentativetrainingdata.Structuredstatespacemod- els (SSMs) [240] have recently emerged as efficient alter- natives to Transformers, capturing complex dependencies Page 16 of 28 via linear, well-charact","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"org/abs/2312.00752. [5] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URLhttps://arxiv.org/abs/2008.07669. [6] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396. [7] Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces, 2022. URLhttps://arxiv.org/abs/22","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"(O(n2)) with respect to the input size n. This leads to high computational demands during both training and in- ference. Consequently, these limitations restrict their prac- 1 arXiv:2605.08073v1 [cs.CV] 8 May 2026 APREPRINT- MAY11, 2026 tical effectiveness in high-resolution image reconstruction applications. Recently, the State Space Model (SSM) [12] has garnered significant attention in Natural Language Processing (NLP) and high-level vision tasks [61, 33, 52] for its innovative, highly effici","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling 2 Related Work Hybrid long-context models trained from scratch.Recent work has explored training hybrid architectures from scratch that combine softmax attention with more efficient sequence modeling primitives such as state-space models (SSMs) or linear attention to overcome the quadratic cost of attention. Foundational approaches include S4 [17] and Mamba [16], as well as alternative long-context mechanisms such as RetNet [ 42","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[17] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Im- plicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462-7473, 2020. [18] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474-1487, 2020. [19] Albert Gu, Karan Goel, and Christopher Ré. Efficiently m","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"14052. [11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2. [12] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396. [13] Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, and Yoon Kim. Log-linear attention, 2026. URLhttps://arxiv.org/abs/2506.04761. 1","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Efficiently Modeling Long Sequences with Structured State Spaces because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":5,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-21T22:23:10.221878+00:00"},"author_expand":{"job_type":"author_expand","status":"succeeded","result":{"authors_linked":[{"id":"4a59877d-50ae-4a8f-808a-3bdb071ca137","orcid":null,"display_name":"Albert Gu"},{"id":"56dc8228-729c-4d0d-9a5c-e3c9d28510b9","orcid":null,"display_name":"Karan Goel"},{"id":"e9d678aa-1f9b-413a-82e3-a11d00600301","orcid":null,"display_name":"Christopher R\\'e"}]},"error":null,"updated_at":"2026-05-21T22:23:10.387204+00:00"},"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T11:59:59.697731+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":28},{"title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","work_id":"d8eba076-0449-4f6a-aae1-5a7260677f0f","shared_citers":14},{"title":"Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model","work_id":"bd81352e-a64f-4720-9f76-ddda0ea9af83","shared_citers":10},{"title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","work_id":"884939d3-e283-4625-bff4-b7e0e4cc2a6e","shared_citers":9},{"title":"Retentive Network: A Successor to Transformer for Large Language Models","work_id":"5b0449ac-92b0-41f2-8b4f-586c2b5a08b6","shared_citers":9},{"title":"Simplified state space layers for sequence modeling","work_id":"d4f90830-6ceb-4c9e-b206-eb32ac063f45","shared_citers":9},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"Hungry hungry hippos: Towards language modeling with state space models","work_id":"d5653b0c-f12c-4141-9343-d65df1fb4214","shared_citers":6},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":6},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":5},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":5},{"title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","work_id":"35cc586b-44f1-4948-a84b-866e8335e649","shared_citers":5},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":5},{"title":"Jamba: A Hybrid Transformer-Mamba Language Model","work_id":"129df0fe-8a66-4077-8991-3557cfa38274","shared_citers":5},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":5},{"title":"RWKV-7 “Goose” with expressive dynamic state evolution","work_id":"daf20308-c801-4745-a147-a1f1de4368e0","shared_citers":5},{"title":"An empirical study of mamba-based language models","work_id":"7a48323c-5291-43c0-93ea-ac7a32dd7fef","shared_citers":4},{"title":"arXiv preprint arXiv:2401.04722 (2024)","work_id":"0615c428-0818-4497-8890-1ff89b52d6eb","shared_citers":4},{"title":"arXiv preprint arXiv:2603.15569 , year=","work_id":"74e489b6-3a26-4e84-9578-7f3d90119238","shared_citers":4},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":4},{"title":"Gated linear attention transformers with hardware-efficient training","work_id":"65a18a30-6e80-4b64-a026-bb0368e38872","shared_citers":4}],"time_series":[{"n":1,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":52,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T12:10:05.773295+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T11:59:53.293012+00:00"},"role_polarity":{"job_type":"role_polarity","status":"succeeded","result":{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","claims":[{"claim_text":"A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \\( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \\), and showed that for appropriate choices of the ","claim_type":"abstract","evidence_strength":"source_metadata"},{"claim_text":"uncontrollable components in Transformers. •Composable Systems: Designing modular, predictable LLM control architectures. Recent work such as LiSeCo explores activation-level control and provides theoretical guarantees [30], but its effectiveness is limited by the need for supervised probes andrepresentativetrainingdata.Structuredstatespacemod- els (SSMs) [240] have recently emerged as efficient alter- natives to Transformers, capturing complex dependencies Page 16 of 28 via linear, well-charact","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"org/abs/2312.00752. [5] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Re. Hippo: Recurrent memory with optimal polynomial projections, 2020. URLhttps://arxiv.org/abs/2008.07669. [6] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396. [7] Ankit Gupta, Albert Gu, and Jonathan Berant. Diagonal state spaces are as effective as structured state spaces, 2022. URLhttps://arxiv.org/abs/22","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"(O(n2)) with respect to the input size n. This leads to high computational demands during both training and in- ference. Consequently, these limitations restrict their prac- 1 arXiv:2605.08073v1 [cs.CV] 8 May 2026 APREPRINT- MAY11, 2026 tical effectiveness in high-resolution image reconstruction applications. Recently, the State Space Model (SSM) [12] has garnered significant attention in Natural Language Processing (NLP) and high-level vision tasks [61, 33, 52] for its innovative, highly effici","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling 2 Related Work Hybrid long-context models trained from scratch.Recent work has explored training hybrid architectures from scratch that combine softmax attention with more efficient sequence modeling primitives such as state-space models (SSMs) or linear attention to overcome the quadratic cost of attention. Foundational approaches include S4 [17] and Mamba [16], as well as alternative long-context mechanisms such as RetNet [ 42","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"[17] Vincent Sitzmann, Julien Martel, Alexander Bergman, David Lindell, and Gordon Wetzstein. Im- plicit neural representations with periodic activation functions. Advances in neural information processing systems, 33:7462-7473, 2020. [18] Albert Gu, Tri Dao, Stefano Ermon, Atri Rudra, and Christopher Ré. Hippo: Recurrent memory with optimal polynomial projections. Advances in neural information processing systems, 33:1474-1487, 2020. [19] Albert Gu, Karan Goel, and Christopher Ré. Efficiently m","claim_type":"method","confidence":0.9,"evidence_strength":"citation_context"},{"claim_text":"14052. [11] Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InFirst Conference on Language Modeling, 2024. URL https://openreview.net/forum? id=tEYskw1VY2. [12] Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces, 2022. URLhttps://arxiv.org/abs/2111.00396. [13] Han Guo, Songlin Yang, Tarushii Goel, Eric P. Xing, Tri Dao, and Yoon Kim. Log-linear attention, 2026. URLhttps://arxiv.org/abs/2506.04761. 1","claim_type":"background","confidence":0.9,"evidence_strength":"citation_context"}],"why_cited":"Pith tracks Efficiently Modeling Long Sequences with Structured State Spaces because it crossed a citation-hub threshold. Current citing contexts most often use it as background evidence (19 contexts).","role_counts":[{"n":19,"context_role":"background"},{"n":5,"context_role":"method"},{"n":1,"context_role":"baseline"}]},"error":null,"updated_at":"2026-05-21T22:23:10.218829+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","claims":[{"claim_text":"A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \\( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \\), and showed that for appropriate choices of the ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Efficiently Modeling Long Sequences with Structured State Spaces because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T12:09:56.551749+00:00"}},"summary":{"title":"Efficiently Modeling Long Sequences with Structured State Spaces","claims":[{"claim_text":"A central goal of sequence modeling is designing a single principled model that can address sequence data across a range of modalities and tasks, particularly on long-range dependencies. Although conventional models including RNNs, CNNs, and Transformers have specialized variants for capturing long dependencies, they still struggle to scale to very long sequences of $10000$ or more steps. A promising recent approach proposed modeling sequences by simulating the fundamental state space model (SSM) \\( x'(t) = Ax(t) + Bu(t), y(t) = Cx(t) + Du(t) \\), and showed that for appropriate choices of the ","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Efficiently Modeling Long Sequences with Structured State Spaces because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"Mamba: Linear-Time Sequence Modeling with Selective State Spaces","work_id":"4ee75248-1199-492c-a52f-6661e0f4adff","shared_citers":28},{"title":"Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality","work_id":"d8eba076-0449-4f6a-aae1-5a7260677f0f","shared_citers":14},{"title":"Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model","work_id":"bd81352e-a64f-4720-9f76-ddda0ea9af83","shared_citers":10},{"title":"Gated Delta Networks: Improving Mamba2 with Delta Rule","work_id":"884939d3-e283-4625-bff4-b7e0e4cc2a6e","shared_citers":9},{"title":"Retentive Network: A Successor to Transformer for Large Language Models","work_id":"5b0449ac-92b0-41f2-8b4f-586c2b5a08b6","shared_citers":9},{"title":"Simplified state space layers for sequence modeling","work_id":"d4f90830-6ceb-4c9e-b206-eb32ac063f45","shared_citers":9},{"title":"Decoupled Weight Decay Regularization","work_id":"07ef7360-d385-4033-83f7-8384a6325204","shared_citers":8},{"title":"An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale","work_id":"e96730e3-129b-4db6-b981-15ab7932e297","shared_citers":6},{"title":"Hungry hungry hippos: Towards language modeling with state space models","work_id":"d5653b0c-f12c-4141-9343-d65df1fb4214","shared_citers":6},{"title":"Longformer: The Long-Document Transformer","work_id":"abea7a44-6668-4de7-aab6-f53a6e5aa088","shared_citers":6},{"title":"Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge","work_id":"28ea1282-d657-4c61-a83c-f1249be6d6b1","shared_citers":6},{"title":"Adam: A Method for Stochastic Optimization","work_id":"1910796d-9b52-4683-bf5c-de9632c1028b","shared_citers":5},{"title":"Efficient Streaming Language Models with Attention Sinks","work_id":"a8d25452-c237-48c9-88a4-682717c3979a","shared_citers":5},{"title":"Fast Transformer Decoding: One Write-Head is All You Need","work_id":"160ea164-b1d4-4adb-8ccb-a4655d8a0bb4","shared_citers":5},{"title":"Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free","work_id":"35cc586b-44f1-4948-a84b-866e8335e649","shared_citers":5},{"title":"Generating Long Sequences with Sparse Transformers","work_id":"c5b81688-45ee-4a9a-b095-e6290f45cb6c","shared_citers":5},{"title":"Jamba: A Hybrid Transformer-Mamba Language Model","work_id":"129df0fe-8a66-4077-8991-3557cfa38274","shared_citers":5},{"title":"Rethinking Attention with Performers","work_id":"4c26d308-8b72-4a98-8e73-950617a75f50","shared_citers":5},{"title":"RWKV-7 “Goose” with expressive dynamic state evolution","work_id":"daf20308-c801-4745-a147-a1f1de4368e0","shared_citers":5},{"title":"An empirical study of mamba-based language models","work_id":"7a48323c-5291-43c0-93ea-ac7a32dd7fef","shared_citers":4},{"title":"arXiv preprint arXiv:2401.04722 (2024)","work_id":"0615c428-0818-4497-8890-1ff89b52d6eb","shared_citers":4},{"title":"arXiv preprint arXiv:2603.15569 , year=","work_id":"74e489b6-3a26-4e84-9578-7f3d90119238","shared_citers":4},{"title":"DeepSeek-V3 Technical Report","work_id":"57d2791d-2219-4c31-a077-afc04b12a75c","shared_citers":4},{"title":"Gated linear attention transformers with hardware-efficient training","work_id":"65a18a30-6e80-4b64-a026-bb0368e38872","shared_citers":4}],"time_series":[{"n":1,"year":2023},{"n":1,"year":2024},{"n":1,"year":2025},{"n":52,"year":2026}],"dependency_candidates":[]},"authors":[{"id":"4a59877d-50ae-4a8f-808a-3bdb071ca137","orcid":null,"display_name":"Albert Gu","source":"manual","import_confidence":0.72},{"id":"e9d678aa-1f9b-413a-82e3-a11d00600301","orcid":null,"display_name":"Christopher R\\'e","source":"manual","import_confidence":0.72},{"id":"56dc8228-729c-4d0d-9a5c-e3c9d28510b9","orcid":null,"display_name":"Karan Goel","source":"manual","import_confidence":0.72}]}}