pith. sign in

hub Canonical reference

Zamba: A compact 7B SSM hybrid model

Canonical reference. 80% of citing Pith papers cite this work as background.

20 Pith papers citing it
Background 80% of classified citations

hub tools

citation-role summary

background 4 other 1

citation-polarity summary

polarities

background 4 unclear 1

representative citing papers

Mixture of Layers with Hybrid Attention

cs.LG · 2026-05-10 · unverdicted · novelty 7.0

Mixture of Layers replaces monolithic transformer blocks with routed thin parallel blocks using hybrid attention that combines a shared softmax block for global context with Gated DeltaNet linear attention in the routed blocks.

The Routing and Filtering Structure of Attention

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

Attention decomposes into low-rank routing and symmetric filtering; disentangled S-D attention reveals a spectral cascade allowing early-layer linearization at under 5% perplexity cost.

Priming: Hybrid State Space Models From Pre-trained Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

Priming transfers knowledge from pre-trained Transformers to hybrid SSM-attention models, recovering performance with minimal additional tokens and showing Gated KalmaNet outperforming Mamba-2 on long-context reasoning at 32B scale.

ZAYA1-8B Technical Report

cs.AI · 2026-05-06 · unverdicted · novelty 6.0

ZAYA1-8B is a reasoning MoE model with 700M active parameters that matches larger models on math and coding benchmarks and reaches 91.9% on AIME'25 via Markovian RSA test-time compute.

Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.

SpikingBrain: Spiking Brain-inspired Large Models

cs.LG · 2025-09-05 · unverdicted · novelty 6.0

SpikingBrain-7B and SpikingBrain-76B achieve Transformer-comparable performance after continual pre-training on 150B tokens, with over 100x TTFT speedup on 4M-token sequences and 69.15% sparsity from event-driven spiking.

MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

cs.CL · 2025-06-16 · unverdicted · novelty 6.0

MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training in three weeks on 512 GPUs.

An Empirical Study of Mamba-based Language Models

cs.LG · 2024-06-12 · accept · novelty 6.0

An 8B Mamba-2-Hybrid with 43% Mamba-2, 7% attention, and 50% MLP layers exceeds an 8B Transformer by 2.65 points on average across 12 tasks and matches it on 23 long-context tasks while enabling up to 8x faster inference.

citing papers explorer

Showing 20 of 20 citing papers.