hub Canonical reference

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu · 2024 · cs.LG · arXiv 2402.19427

Canonical reference. 75% of citing Pith papers cite this work as background.

37 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 37 citing papers arXiv PDF

abstract

Recurrent neural networks (RNNs) have fast inference and scale efficiently on long sequences, but they are difficult to train and hard to scale. We propose Hawk, an RNN with gated linear recurrences, and Griffin, a hybrid model that mixes gated linear recurrences with local attention. Hawk exceeds the reported performance of Mamba on downstream tasks, while Griffin matches the performance of Llama-2 despite being trained on over 6 times fewer tokens. We also show that Griffin can extrapolate on sequences significantly longer than those seen during training. Our models match the hardware efficiency of Transformers during training, and during inference they have lower latency and significantly higher throughput. We scale Griffin up to 14B parameters, and explain how to shard our models for efficient distributed training.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 10 baseline 1 method 1

citation-polarity summary

background 9 baseline 1 unclear 1 use method 1

representative citing papers

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models

cs.LG · 2026-03-22 · conditional · novelty 8.0

Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

cs.LG · 2024-07-05 · conditional · novelty 8.0

TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.

How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences

cs.LG · 2026-05-06 · unverdicted · novelty 7.0

In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.

Scalable Memristive-Friendly Reservoir Computing for Time Series Classification

cs.NE · 2026-04-21 · unverdicted · novelty 7.0

MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.

On the Expressive Power and Limitations of Multi-Layer SSMs

cs.LG · 2026-04-16 · unverdicted · novelty 7.0

Multi-layer SSMs cannot perform certain compositional tasks, offline CoT adds little power, but online CoT equates them to streaming algorithms and equates width with precision.

Selective Rotary Position Embedding

cs.CL · 2025-11-21 · unverdicted · novelty 7.0

Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

Towards Understanding Self-Pretraining for Sequence Classification

cs.LG · 2026-05-20 · unverdicted · novelty 6.0

Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.

Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models

cs.LG · 2026-05-18 · unverdicted · novelty 6.0

Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.

The Routing and Filtering Structure of Attention

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.

A Single-Layer Model Can Do Language Modeling

cs.CL · 2026-05-11 · unverdicted · novelty 6.0

A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers

cs.LG · 2026-05-06 · unverdicted · novelty 6.0

A recurrent Vision Transformer hypernetwork injects context into Flux Neural Operators to infer and solve unseen conservation laws while preserving robustness and long-time stability.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

Optimal Decay Spectra for Linear Recurrences

cs.LG · 2026-04-08 · unverdicted · novelty 6.0

PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval across Mamba-2, RWKV-7 and similar models.

Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space

cs.CL · 2026-04-06 · unverdicted · novelty 6.0

PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.

Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models

cs.AR · 2026-04-04 · unverdicted · novelty 6.0

Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.

LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling

cs.CL · 2026-03-12 · unverdicted · novelty 6.0

LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

cs.LG · 2026-02-20 · unverdicted · novelty 6.0 · 2 refs

RAT+ pretrains a dense recurrent-augmented attention model once and enables flexible switching to dilated or hybrid sparse attention at inference after short adaptation, with small accuracy loss at high dilation factors.

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

cs.AI · 2026-01-22 · unverdicted · novelty 6.0

AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

Short window attention enables long-term memorization

cs.LG · 2025-09-29 · unverdicted · novelty 6.0

Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.

SpikingBrain: Spiking Brain-inspired Large Models

cs.LG · 2025-09-05 · unverdicted · novelty 6.0

SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.

citing papers explorer

Showing 37 of 37 citing papers.

When Does Content-Based Routing Work? Representation Requirements for Selective Attention in Hybrid Sequence Models cs.LG · 2026-03-22 · conditional · none · ref 19 · internal anchor
Content-based routing succeeds only when models provide bidirectional context and perform pairwise comparisons, with bidirectional Mamba plus rank-1 projection reaching 99.7% precision at linear inference cost.
Learning to (Learn at Test Time): RNNs with Expressive Hidden States cs.LG · 2024-07-05 · conditional · none · ref 18 · internal anchor
TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
How Long Does Infinite Width Last? Signal Propagation in Long-Range Linear Recurrences cs.LG · 2026-05-06 · unverdicted · none · ref 8 · internal anchor
In linear recurrent models, infinite-width signal propagation remains accurate only for depths t much smaller than sqrt(width n), with a critical regime at t ~ c sqrt(n) where finite-width effects emerge and dominate for larger t.
Scalable Memristive-Friendly Reservoir Computing for Time Series Classification cs.NE · 2026-04-21 · unverdicted · none · ref 13 · internal anchor
MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.
On the Expressive Power and Limitations of Multi-Layer SSMs cs.LG · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
Multi-layer SSMs cannot perform certain compositional tasks, offline CoT adds little power, but online CoT equates them to streaming algorithms and equates width with precision.
Selective Rotary Position Embedding cs.CL · 2025-11-21 · unverdicted · none · ref 14 · internal anchor
Selective RoPE adds input-dependent rotations to generalize RoPE, showing implicit positional structure in softmax attention and improving performance on language modeling, copying, state tracking, and retrieval when added to gated transformers.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 29 · internal anchor
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
Towards Understanding Self-Pretraining for Sequence Classification cs.LG · 2026-05-20 · unverdicted · none · ref 101 · internal anchor
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models cs.LG · 2026-05-18 · unverdicted · none · ref 12 · internal anchor
Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
The Routing and Filtering Structure of Attention cs.LG · 2026-05-12 · unverdicted · none · ref 4 · internal anchor
Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.
A Single-Layer Model Can Do Language Modeling cs.CL · 2026-05-11 · unverdicted · none · ref 3 · internal anchor
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
Priming: Hybrid State Space Models From Pre-trained Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 20 · internal anchor
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
A Robust Foundation Model for Conservation Laws: Injecting Context into Flux Neural Operators via Recurrent Vision Transformers cs.LG · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
A recurrent Vision Transformer hypernetwork injects context into Flux Neural Operators to infer and solve unseen conservation laws while preserving robustness and long-time stability.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 9 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 5 · internal anchor
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
Optimal Decay Spectra for Linear Recurrences cs.LG · 2026-04-08 · unverdicted · none · ref 4 · internal anchor
PoST reparameterizes decay spectra in linear recurrences with geometric log-spacing and position-adaptive scaling to achieve O(exp(-cN/log t)) decay, improving zero-shot language modeling and long-context retrieval across Mamba-2, RWKV-7 and similar models.
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space cs.CL · 2026-04-06 · unverdicted · none · ref 101 · internal anchor
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
Mambalaya: Einsum-Based Fusion Optimizations on State-Space Models cs.AR · 2026-04-04 · unverdicted · none · ref 18 · internal anchor
Mambalaya delivers 4.9x prefill and 1.9x generation speedups on Mamba layers over prior accelerators by systematically fusing inter-Einsum operations.
LPC-SM: Local Predictive Coding and Sparse Memory for Long-Context Language Modeling cs.CL · 2026-03-12 · unverdicted · none · ref 2 · internal anchor
LPC-SM is a hybrid architecture separating local attention, persistent memory, predictive correction, and control with ONT for memory writes, showing loss reductions on 158M-parameter models up to 4096-token contexts.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference cs.LG · 2026-02-20 · unverdicted · none · ref 8 · 2 links · internal anchor
RAT+ pretrains a dense recurrent-augmented attention model once and enables flexible switching to dilated or hybrid sparse attention at inference after short adaptation, with small accuracy loss at high dilation factors.
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models cs.AI · 2026-01-22 · unverdicted · none · ref 6 · internal anchor
AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 11 · internal anchor
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Short window attention enables long-term memorization cs.LG · 2025-09-29 · unverdicted · none · ref 9 · internal anchor
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
SpikingBrain: Spiking Brain-inspired Large Models cs.LG · 2025-09-05 · unverdicted · none · ref 7 · internal anchor
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
Lizard: An Efficient Linearization Framework for Large Language Models cs.CL · 2025-07-11 · unverdicted · none · ref 6 · internal anchor
Lizard linearizes Transformer LLMs via subquadratic attention and adaptive learnable modules, recovering near-original performance while outperforming prior linearization methods on MMLU and associative recall.
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent cs.CL · 2025-07-03 · unverdicted · none · ref 39 · internal anchor
MemAgent uses multi-conversation RL to train a memory agent that reads text in segments and overwrites memory, extrapolating from 8K training to 3.5M token QA with under 5% loss and 95%+ on 512K RULER.
Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention cs.CL · 2025-02-16 · unverdicted · none · ref 58 · internal anchor
NSA is a hardware-aligned sparse attention mechanism that enables end-to-end trainable long-context modeling by combining coarse token compression with fine-grained selection.
Titans: Learning to Memorize at Test Time cs.LG · 2024-12-31 · unverdicted · none · ref 29 · internal anchor
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation cs.CL · 2024-10-17 · unverdicted · none · ref 10 · internal anchor
LightTransfer identifies lazy layers in LLMs like LLaMA and replaces their attention with streaming attention to form hybrid models, delivering up to 2.17x throughput with under 1.5% drop on LongBench and strong results on reasoning benchmarks.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 13 · internal anchor
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
HexagonalWarriorMamba: Superior Threshold-Dependent Multi-label Classification of 12-Lead ECG Cardiac Abnormalities cs.CV · 2026-05-18 · unverdicted · none · ref 30 · internal anchor
HexagonalWarriorMamba applies a hierarchical Mamba architecture with 2D selective scanning to 12-lead ECGs treated as single-channel images, outperforming prior methods on threshold-dependent metrics for the 26-label PhysioNet 2021 multi-label task.
MemReread: Enhancing Agentic Long-Context Reasoning via Memory-Guided Rereading cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
MemReread improves agent long-context reasoning by triggering rereading on insufficient final memory to recover discarded indirect facts, outperforming baselines at linear complexity.
Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 10 · internal anchor
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention cs.LG · 2026-05-07 · unverdicted · none · ref 4 · internal anchor
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
SpikingBrain2.0: Brain-Inspired Foundation Models for Efficient Long-Context and Cross-Platform Inference cs.LG · 2026-04-24 · unverdicted · none · ref 11 · internal anchor
SpikingBrain2.0 is a 5B hybrid spiking-Transformer that recovers most base model performance while delivering 10x TTFT speedup at 4M context and supporting over 10M tokens on limited GPUs via dual sparse attention and dual quantization paths.
TAPNext++: What's Next for Tracking Any Point (TAP)? cs.CV · 2026-04-12 · unverdicted · none · ref 6 · internal anchor
TAPNext++ trains recurrent transformers on 1024-frame sequences with geometric augmentations and occluded-point supervision to achieve new state-of-the-art point tracking on long videos while adding a re-detection metric.
Hybrid Architectures for Language Models: Systematic Analysis and Design Insights cs.CL · 2025-10-06 · unverdicted · none · ref 11 · internal anchor
This work systematically compares inter-layer and intra-layer hybridization strategies for combining self-attention and Mamba-style state space models, evaluating them on language modeling, downstream tasks, long-context performance, scaling, and efficiency to derive optimal design recipes.

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer