LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.
hub Canonical reference
Reformer: The Efficient Transformer
Canonical reference. 75% of citing Pith papers cite this work as background.
abstract
Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
SWM framework uses LLMs to model social belief dynamics from events via temporal pattern mining and ELBO optimization, outperforming time-series models on a new 12k-point benchmark from Kalshi and Polymarket prediction markets.
CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.
Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.
A task-aware policy learned via reinforcement learning allocates high-resolution pixels on dual-stream sensors in real time, outperforming fixed or non-predictive baselines under tight pixel budgets in both simulation and 200 MP hardware tests.
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
TSCOMP is the first large-scale benchmark that deconstructs deep multivariate time series forecasters into fine-grained components, builds a corpus of over 20,000 evaluations, and shows that corpus-driven component selection outperforms state-of-the-art holistic models.
Empirical scaling study of ECG models finds SSL scales robustly while ResNets show 1.3-2.5x better parameter efficiency and SSL up to 16x better data efficiency than supervised baselines on out-of-distribution tasks.
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
citing papers explorer
-
Fast Cross-Operator Optimization of Attention Dataflow
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.