hub Canonical reference

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu · 2024 · cs.LG · arXiv 2402.19427

Canonical reference. 75% of citing Pith papers cite this work as background.

53 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 53 citing papers arXiv PDF

abstract

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 1 method 1

citation-polarity summary

background 9 baseline 1 unclear 1 use method 1

representative citing papers

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

Morphing into Hybrid Attention Models

cs.CL · 2026-06-29 · unverdicted · novelty 7.0

FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.

CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention

cs.CL · 2026-06-25 · conditional · novelty 7.0

CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.

LoopMoE: Unifying Iterative Computation with Mixture-of-Experts for Language Modeling

cs.LG · 2026-06-03 · unverdicted · novelty 7.0

LoopMoE is a looped MoE language model that outperforms matched vanilla MoE on 8 of 9 downstream benchmarks at 3B scale and continues to outperform at 9B scale under strictly controlled budgets.

Forget Attention: Importance-Aware Attention Is All You Need

cs.AI · 2026-06-01 · unverdicted · novelty 7.0

SISA adds an SSM importance term inside the attention score and runs the full operation as one SDPA call on augmented Q/K vectors, reporting better LAMBADA and perfect NIAH at small scale.

Trading Complexity for Expressivity Through Structured Generalized Linear Token Mixing

cs.LG · 2026-05-29 · unverdicted · novelty 7.0

Presents a structured generalized linear token mixing framework that extends recurrence equations to multiple past states, enabling new patterns with provable complexity-expressivity trade-offs for causal generation.

Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference

cs.CL · 2026-05-25 · unverdicted · novelty 7.0

A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

cs.NE · 2026-04-21 · unverdicted · novelty 7.0

MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.

On the Expressive Power and Limitations of Multi-Layer SSMs

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

Multi-layer SSMs cannot perform certain compositional tasks, offline CoT adds little power, but online CoT equates them to streaming algorithms and equates width with precision.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

The Context-Ready Transformer

cs.CL · 2026-06-25 · unverdicted · novelty 6.0

Context-ready transformer adds a correction network to pre-contextualize tokens in a D-layer block, turning the model recurrent for inference while allowing K-step unrolled parallel training, with reported gains over standard transformers.

Architecture-Aware Reinforcement Learning Makes Sliding-Window Attention Competitive in Math Reasoning

cs.AI · 2026-06-10 · unverdicted · novelty 6.0

Reinforcement learning after SFT conversion narrows the performance gap between sliding-window attention and full self-attention on math reasoning benchmarks while preserving linear complexity.

Free Parametrization of L_2-Bounded Structured State-Space Controllers for Nonlinear Control with Stability Guarantees

eess.SY · 2026-06-09 · unverdicted · novelty 6.0

A new free parametrization of L2-bounded LTI systems creates L2RU SSM layers that enforce stability by design, allowing unconstrained nonlinear controller optimization with guarantees via small-gain theorem.

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

cs.LG · 2026-06-07 · unverdicted · novelty 6.0

Introduces a Memory-as-a-Layer adapter that writes dialogue history into neural memory and reads it as a residual update to improve conversational speech emotion recognition on audio LLMs.

Blurry Window Attention

cs.LG · 2026-05-31 · unverdicted · novelty 6.0

Blurry Window Attention stores a frequency window and reconstructs blurry KV history via Dirichlet kernel interpolation, achieving 8x better state efficiency than sliding window attention on the MQAR synthetic task.

Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.

Memory by Design: Probabilistic Sequence Layers

stat.ML · 2026-05-29 · unverdicted · novelty 6.0

The design-model framework unifies sub-quadratic sequence models as Bayesian filters and introduces a covariance-tracking Bayesian Layer that improves retrieval robustness beyond training regimes on MQAR and RULER benchmarks.

Interdomain Attention: Beyond Token-Level Key-Value Memory

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

Interdomain Attention integrates SSMs into attention via finite feature maps and basis projections to enable query-conditioned attention over fixed states, showing gains over SSM baselines and matching softmax at 1.3B scale with length-flat scaling.

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.

The Routing and Filtering Structure of Attention

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

citing papers explorer

Showing 10 of 10 citing papers after filters.

Morphing into Hybrid Attention Models cs.CL · 2026-06-29 · unverdicted · none · ref 15 · internal anchor
FlashMorph formulates hybrid layer selection as budget-constrained optimization, trains per-layer gates on synthetic retrieval data with linearization regularization, then discretizes and distills to produce efficient hybrid architectures.
CARVE: Content-Aware Recurrent with Value Efficiency for Chunk-Parallel Linear Attention cs.CL · 2026-06-25 · conditional · none · ref 8 · internal anchor
CARVE introduces key-axis content-aware gating and value-efficient scalar writes in recurrent linear attention, outperforming GDN-2 on perplexity and retrieval tasks while cutting parameters and memory.
Do Language Models Need Sleep? Offline Recurrence for Improved Online Inference cs.CL · 2026-05-25 · unverdicted · none · ref 18 · internal anchor
A sleep mechanism with N offline recurrent passes consolidates context into fast weights, improving performance on reasoning tasks where standard transformers fail.
The Context-Ready Transformer cs.CL · 2026-06-25 · unverdicted · none · ref 20 · internal anchor
Context-ready transformer adds a correction network to pre-contextualize tokens in a D-layer block, turning the model recurrent for inference while allowing K-step unrolled parallel training, with reported gains over standard transformers.
Harmonic: Hierarchical State Space Models for Efficient Long-Context Language Modeling cs.CL · 2026-05-30 · unverdicted · none · ref 4 · internal anchor
Hierarchical SSM architecture Harmonic outperforms Transformers and Mamba on long-context language modeling up to 64K tokens and removes RoPE limits at 1B scale while maintaining O(L) compute.
A Single-Layer Model Can Do Language Modeling cs.CL · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 101 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling cs.CL · 2026-03-12 · unverdicted · none · ref 2 · internal anchor
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer