hub Canonical reference

Reformer: The Efficient Transformer

· 2020 · cs.LG · arXiv 2001.04451

Canonical reference. 75% of citing Pith papers cite this work as background.

60 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 60 citing papers arXiv PDF

abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 dataset 1 method 1

citation-polarity summary

background 9 baseline 1 unclear 1 use dataset 1

representative citing papers

LongSpike: Fractional Order Spiking State Space Models for Efficient Long Sequence Learning

cs.LG · 2026-06-11 · unverdicted · novelty 7.0

LongSpike integrates fractional-order state-space modeling into spiking neural networks, enabling better long-sequence performance than prior SNNs on LRA, WikiText-103, and Speech Commands benchmarks while retaining sparse computation.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

Cross-Stage Attention Propagation for Efficient Semantic Segmentation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.

Fast Cross-Operator Optimization of Attention Dataflow

cs.AR · 2026-04-03 · unverdicted · novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

cs.LG · 2025-12-14 · unverdicted · novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

RWKV: Reinventing RNNs for the Transformer Era

cs.CL · 2023-05-22 · unverdicted · novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

Building Social World Models with Large Language Models

cs.SI · 2026-06-09 · unverdicted · novelty 6.0

SWM framework uses LLMs to model social belief dynamics from events via temporal pattern mining and ELBO optimization, outperforming time-series models on a new 12k-point benchmark from Kalshi and Polymarket prediction markets.

Chiaroscuro Attention: Spending Compute in the Dark

cs.CL · 2026-06-06 · unverdicted · novelty 6.0

CHIAR-Former routes tokens via spectral entropy to DCT mixing or attention, yielding 35-40% FLOP savings at 400M parameters with modest perplexity increase on WikiText-103.

Do Transformers Need Three Projections? Systematic Study of QKV Variants

cs.LG · 2026-06-01 · conditional · novelty 6.0

Q-K=V projection sharing in transformers matches standard QKV performance with 50% KV cache reduction and combines with GQA/MQA for up to 96.9% reduction across vision and language tasks.

Policy-based Foveated Imaging and Perception

cs.CV · 2026-06-01 · unverdicted · novelty 6.0

A task-aware policy learned via reinforcement learning allocates high-resolution pixels on dual-stream sensors in real time, outperforming fixed or non-predictive baselines under tight pixel budgets in both simulation and 200 MP hardware tests.

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

TSCOMP is the first large-scale benchmark that deconstructs deep multivariate time series forecasters into fine-grained components, builds a corpus of over 20,000 evaluations, and shows that corpus-driven component selection outperforms state-of-the-art holistic models.

How Do Electrocardiogram Models Scale?

cs.LG · 2026-05-17 · conditional · novelty 6.0

Empirical scaling study of ECG models finds SSL scales robustly while ResNets show 1.3-2.5x better parameter efficiency and SSL up to 16x better data efficiency than supervised baselines on out-of-distribution tasks.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

Spectral Transformer Neural Processes

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Fast Cross-Operator Optimization of Attention Dataflow cs.AR · 2026-04-03 · unverdicted · none · ref 39 · internal anchor
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Reformer: The Efficient Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer