TTT layers treat the hidden state as a trainable model updated at test time, allowing linear-complexity sequence models to scale perplexity reduction with context length unlike Mamba.
hub Mixed citations
xlstm: Extended long short-term memory
Mixed citation behavior. Most common role is background (50%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
RULER shows most long-context LMs drop sharply in performance on complex tasks as length and difficulty increase, with only half maintaining results at 32K tokens.
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
MARS parallel reservoirs achieve up to 21x training speedups and outperform LRU, S5, and Mamba on long sequence benchmarks while remaining gradient-free and compact.
FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
A 130M-parameter 1-layer GPN achieves FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34).
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
ContractShield achieves 89% Hamming score and 91% F1-score for five vulnerability types in obfuscated smart contracts via hierarchical cross-modal fusion of semantic, temporal, and structural features with only 1-3% performance drop.
CoGate-LSTM adds prototype-guided cosine feature-space gating to a character-level BiLSTM with multi-source embeddings and focal loss, reaching 0.881 macro-F1 on Jigsaw toxic comments while using 7.3M parameters and outperforming fine-tuned BERT by 6.9 points on minority labels.
Short sliding windows in hybrid attention-xLSTM models boost long-context performance by encouraging long-term memory use, and stochastic window sizing improves both short and long tasks.
Titans combine attention for current context with a learnable neural memory for long-term history, achieving better performance and scaling to over 2M-token contexts on language, reasoning, genomics, and time-series tasks.
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
PR-LSTM replaces linear recurrence with recursive gated merging over a balanced binary tree to achieve log-depth parallelism without restricting transitions to linear or associative forms.
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
DCL-xLSTM neural network detects lensed GW events with AUC over 0.99 using training on PM and SIS lens models in the millihertz band.
Gated DeltaNet integrates gating and delta rules into linear transformers, outperforming Mamba2 and DeltaNet on language modeling, reasoning, retrieval, and long-context tasks.
AGA3DNet improves 3D brain MRI subtype classification by feeding anatomy-guided Gaussian priors derived from radiology reports into a 3D CNN and multi-view xLSTM architecture.
Engineering report detailing HPC infrastructure, software choices, and performance measurements for training a 7B LLM using 3D parallelism on JUWELS Booster.
Position paper claiming that distributed training across massive edge devices can overcome data depletion and centralized compute monopolies in LLM scaling.
citing papers explorer
-
Parallel Scan Recurrent Neural Quantum States for Scalable Variational Monte Carlo
PSR-NQS makes recurrent neural quantum states scalable for variational Monte Carlo by using parallel scan recurrence, reaching accurate results on 52x52 two-dimensional lattices.
-
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Transformers and SSMs are unified through structured state space duality, producing a 2-8X faster Mamba-2 model that remains competitive with Transformers.
-
Phase-Associative Memory: Sequence Modeling in Complex Hilbert Space
PAM, a complex-valued associative memory model, exhibits steeper power-law scaling in loss and perplexity than a matched real-valued baseline when trained on WikiText-103 from 5M to 100M parameters.
-
Gated Linear Attention Transformers with Hardware-Efficient Training
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.