Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
TransXSSM: A hybrid transformer state space model with unified rotary position embedding
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 1polarities
background 1representative citing papers
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.
citing papers explorer
-
Flash PD-SSM: Memory-Optimized Structured Sparse State-Space Models
Flash PD-SSM achieves FSA-level expressivity by discretely selecting one matrix from a trainable set of structured sparse transition matrices at each time step while preserving the runtime and memory efficiency of standard structured SSMs.
-
The Transformer as a Polar State Estimator
The standard Transformer block arises as a first-order approximation to a polar state estimator on the hypersphere, with a Polar Transformer retaining higher-order terms.
-
Long-Context Aware Upcycling: A New Frontier for Hybrid LLM Scaling
HyLo upcycles Transformer LLMs into hybrids with MLA and Mamba2/Gated DeltaNet blocks via staged training and distillation, extending context to 2M tokens and outperforming prior upcycled hybrids on long-context benchmarks.