hub Canonical reference

Zamba: A compact 7B SSM hybrid model

· 2024 · arXiv 2405.16712

Canonical reference. 80% of citing Pith papers cite this work as background.

20 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 20 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 other 1

citation-polarity summary

background 4 unclear 1

representative citing papers

Mixture of Layers with Hybrid Attention

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.

Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

Hidden State Poisoning Attacks against Mamba-based Language Models

cs.CL · 2026-01-05 · unverdicted · novelty 7.0

Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

cs.LG · 2024-05-31 · unverdicted · novelty 7.0

Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.

The Routing and Filtering Structure of Attention

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer

cs.CV · 2026-04-07 · unverdicted · novelty 6.0

PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.

When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models

cs.AI · 2026-01-22 · unverdicted · novelty 6.0

AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

SpikingBrain: Spiking Brain-inspired Large Models

cs.LG · 2025-09-05 · unverdicted · novelty 6.0

SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

DataComp-LM: In search of the next generation of training sets for language models

cs.LG · 2024-06-17 · unverdicted · novelty 6.0

DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.

An Empirical Study of Mamba-based Language Models

cs.LG · 2024-06-12 · accept · novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models

cs.LG · 2026-04-15 · unverdicted · novelty 5.0

KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.

An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling

cs.LG · 2025-02-25 · unverdicted · novelty 5.0

EEGM2 is a Mamba-2 integrated self-supervised model for EEG that claims linear complexity and state-of-the-art performance on long-sequence modeling and classification tasks.

The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency

cs.AR · 2026-04-09 · unverdicted · novelty 4.0

Mamba-3 architectural changes optimized for hyperscale GPUs cause 28% higher edge latency at 880M parameters and 48% at 15M parameters compared to earlier versions.

Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba

cs.LG · 2025-03-22 · unverdicted · novelty 0.0

A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.

citing papers explorer

Showing 20 of 20 citing papers.

Mixture of Layers with Hybrid Attention cs.LG · 2026-05-10 · unverdicted · none · ref 4
Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.
Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control cs.LG · 2026-05-08 · unverdicted · none · ref 13
Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.
Hidden State Poisoning Attacks against Mamba-based Language Models cs.CL · 2026-01-05 · unverdicted · none · ref 6
Short input phrases can irreversibly overwrite hidden states in Mamba models, impairing information retrieval on a new benchmark while leaving pure Transformer models unaffected.
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality cs.LG · 2024-05-31 · unverdicted · none · ref 37
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
The Routing and Filtering Structure of Attention cs.LG · 2026-05-12 · unverdicted · none · ref 5
Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.
Priming: Hybrid State Space Models From Pre-trained Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 26
Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.
ZAYA1-8B Technical Report cs.AI · 2026-05-06 · unverdicted · none · ref 114
ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling cs.CL · 2026-04-27 · unverdicted · none · ref 15
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 6
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
PoM: A Linear-Time Replacement for Attention with the Polynomial Mixer cs.CV · 2026-04-07 · unverdicted · none · ref 25
PoM is a new linear-complexity token mixer using learned polynomials that matches attention performance in transformers while enabling efficient long-sequence processing.
When to Think Fast and Slow? AMOR: Adaptive Entropy Gate for Hybrid Models cs.AI · 2026-01-22 · unverdicted · none · ref 8
AMOR uses output entropy to gate attention in recurrent hybrids, matching full attention performance at roughly 22% attention invocations across 180M-1.5B models.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 17
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
SpikingBrain: Spiking Brain-inspired Large Models cs.LG · 2025-09-05 · unverdicted · none · ref 10
SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention cs.CL · 2025-06-16 · unverdicted · none · ref 10
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.
DataComp-LM: In search of the next generation of training sets for language models cs.LG · 2024-06-17 · unverdicted · none · ref 75
DCLM-Baseline dataset lets a 7B model reach 64% 5-shot MMLU accuracy after 2.6T tokens, beating prior open-data models by 6.6 points on MMLU with 40% less compute.
An Empirical Study of Mamba-based Language Models cs.LG · 2024-06-12 · accept · none · ref 15
An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.
A KL Lens on Quantization: Fast, Forward-Only Sensitivity for Mixed-Precision SSM-Transformer Models cs.LG · 2026-04-15 · unverdicted · none · ref 7
KL divergence provides a superior forward-only metric for identifying quantization-sensitive parts in SSM-Transformer hybrids, outperforming MSE and SQNR and supporting practical mixed-precision deployment on edge devices.
An Efficient Self-Supervised Framework for Long-Sequence EEG Modeling cs.LG · 2025-02-25 · unverdicted · none · ref 20
EEGM2 is a Mamba-2 integrated self-supervised model for EEG that claims linear complexity and state-of-the-art performance on long-sequence modeling and classification tasks.
The Hyperscale Lottery: How State-Space Models Have Sacrificed Edge Efficiency cs.AR · 2026-04-09 · unverdicted · none · ref 6
Mamba-3 architectural changes optimized for hyperscale GPUs cause 28% higher edge latency at 880M parameters and 48% at 15M parameters compared to earlier versions.
Advancing Intelligent Sequence Modeling: Evolution, Trade-offs, and Applications of State- Space Architectures from S4 to Mamba cs.LG · 2025-03-22 · unverdicted · none · ref 67
A survey tracing the evolution of state-space models like S4 and Mamba, their efficiency trade-offs, and applications in NLP, vision, and other domains.

Zamba: A compact 7B SSM hybrid model

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer