Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
TransXSSM: A Hybrid
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
citing papers explorer
-
Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models
Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
- Robust Filter Attention: Self-Attention as Precision-Weighted State Estimation