hub Canonical reference

Reformer: The Efficient Transformer

· 2020 · cs.LG · arXiv 2001.04451

Canonical reference. 75% of citing Pith papers cite this work as background.

49 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 49 citing papers arXiv PDF

abstract

Large Transformer models routinely achieve state-of-the-art results on a number of tasks but training these models can be prohibitively costly, especially on long sequences. We introduce two techniques to improve the efficiency of Transformers. For one, we replace dot-product attention by one that uses locality-sensitive hashing, changing its complexity from O($L^2$) to O($L\log L$), where $L$ is the length of the sequence. Furthermore, we use reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of $N$ times, where $N$ is the number of layers. The resulting model, the Reformer, performs on par with Transformer models while being much more memory-efficient and much faster on long sequences.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 dataset 1 method 1

citation-polarity summary

background 9 baseline 1 unclear 1 use dataset 1

representative citing papers

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals

cs.CR · 2026-05-08 · unverdicted · novelty 7.0

TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.

Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

cs.LG · 2026-04-22 · unverdicted · novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

Cross-Stage Attention Propagation for Efficient Semantic Segmentation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.

Fast Cross-Operator Optimization of Attention Dataflow

cs.AR · 2026-04-03 · unverdicted · novelty 7.0

MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

cs.LG · 2025-12-14 · unverdicted · novelty 7.0

Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.

SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators

cs.AI · 2025-11-05 · unverdicted · novelty 7.0

SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

cs.CL · 2024-10-14 · unverdicted · novelty 7.0

LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.

RWKV: Reinventing RNNs for the Transformer Era

cs.CL · 2023-05-22 · unverdicted · novelty 7.0

RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.

Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity

cs.LG · 2021-01-11 · accept · novelty 7.0

Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.

Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers

cs.CV · 2026-05-26 · unverdicted · novelty 6.0

Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.

Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

TSCOMP is the first large-scale benchmark that deconstructs deep multivariate time series forecasters into fine-grained components, builds a corpus of over 20,000 evaluations, and shows that corpus-driven component selection outperforms state-of-the-art holistic models.

How Do Electrocardiogram Models Scale?

cs.LG · 2026-05-17 · conditional · novelty 6.0

Empirical scaling study of ECG models finds SSL scales robustly while ResNets show 1.3-2.5x better parameter efficiency and SSL up to 16x better data efficiency than supervised baselines on out-of-distribution tasks.

Elastic Attention Cores for Scalable Vision Transformers

cs.CV · 2026-05-12 · unverdicted · novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.

Spectral Transformer Neural Processes

cs.LG · 2026-05-10 · unverdicted · novelty 6.0

STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.

The Impossibility Triangle of Long-Context Modeling

cs.CL · 2026-05-06 · unverdicted · novelty 6.0

No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.

HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

cs.CL · 2026-04-21 · unverdicted · novelty 6.0

DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba

cs.AI · 2026-04-18 · unverdicted · novelty 6.0

GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

cs.LG · 2026-02-08 · unverdicted · novelty 6.0

SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.

Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression

cs.LG · 2025-11-26 · unverdicted · novelty 6.0

Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.

Kimi Linear: An Expressive, Efficient Attention Architecture

cs.CL · 2025-10-30 · unverdicted · novelty 6.0

Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.

MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training

cs.CL · 2025-10-21 · conditional · novelty 6.0

MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.

MoBA: Mixture of Block Attention for Long-Context LLMs

cs.LG · 2025-02-18 · unverdicted · novelty 6.0

MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.

Gated Linear Attention Transformers with Hardware-Efficient Training

cs.LG · 2023-12-11 · unverdicted · novelty 6.0

Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.

citing papers explorer

Showing 49 of 49 citing papers.

TENNOR: Trustworthy Execution for Neural Networks through Obliviousness and Retrievals cs.CR · 2026-05-08 · unverdicted · none · ref 58 · internal anchor
TENNOR enables efficient private training of wide neural networks in TEEs by recasting sparsification as doubly oblivious LSH retrievals and introducing MP-WTA to cut hash table memory by 50x while preserving accuracy.
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling cs.LG · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
Cross-Stage Attention Propagation for Efficient Semantic Segmentation cs.CV · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.
Fast Cross-Operator Optimization of Attention Dataflow cs.AR · 2026-04-03 · unverdicted · none · ref 39 · internal anchor
MMEE encodes dataflow decisions in matrix form for fast exhaustive search, delivering 40-69% lower latency and energy use than prior methods while running 64-343x faster.
Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics cs.LG · 2025-12-14 · unverdicted · none · ref 14 · internal anchor
Exact Flow Linear Attention derives a closed-form exact update for delta-rule linear attention from continuous-time dynamics, removing Euler discretization error while preserving linear complexity and structure.
SnapStream: Efficient Long Sequence Decoding on Dataflow Accelerators cs.AI · 2025-11-05 · unverdicted · none · ref 13 · internal anchor
SnapStream deploys sparse KV attention in a production inference system on dataflow accelerators, delivering 4x on-chip memory savings for DeepSeek-671B at 128k context with up to 1832 tokens/sec and minimal accuracy loss on LongBench-v2, AIME24, and LiveCodeBench.
LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory cs.CL · 2024-10-14 · unverdicted · none · ref 76 · internal anchor
LongMemEval benchmarks long-term memory in chat assistants, revealing 30% accuracy drops across sustained interactions and proposing indexing-retrieval-reading optimizations that boost performance.
RWKV: Reinventing RNNs for the Transformer Era cs.CL · 2023-05-22 · unverdicted · none · ref 9 · internal anchor
RWKV uses a linear attention mechanism to deliver Transformer-level performance with RNN-style inference efficiency, demonstrated at up to 14 billion parameters.
Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity cs.LG · 2021-01-11 · accept · none · ref 18 · internal anchor
Switch Transformers use top-1 expert routing in a Mixture of Experts setup to scale to trillion-parameter language models with constant compute and up to 4x speedup over T5-XXL.
Tensor Memory: Fixed-Size Recurrent State for Long-Horizon Transformers cs.CV · 2026-05-26 · unverdicted · none · ref 19 · internal anchor
Tensor Memory augments Transformers with a constant-size 3D voxel grid using differentiable soft writes at predicted locations, local interaction, and gated recurrent dynamics to decouple memory capacity from sequence length.
Beyond Holistic Models: Systematic Component-level Benchmarking of Deep Multivariate Time-Series Forecasting cs.LG · 2026-05-26 · unverdicted · none · ref 28 · internal anchor
TSCOMP is the first large-scale benchmark that deconstructs deep multivariate time series forecasters into fine-grained components, builds a corpus of over 20,000 evaluations, and shows that corpus-driven component selection outperforms state-of-the-art holistic models.
How Do Electrocardiogram Models Scale? cs.LG · 2026-05-17 · conditional · none · ref 12 · internal anchor
Empirical scaling study of ECG models finds SSL scales robustly while ResNets show 1.3-2.5x better parameter efficiency and SSL up to 16x better data efficiency than supervised baselines on out-of-distribution tasks.
Elastic Attention Cores for Scalable Vision Transformers cs.CV · 2026-05-12 · unverdicted · none · ref 60 · internal anchor
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintaining competitive performance.
Spectral Transformer Neural Processes cs.LG · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
STNPs extend TNPs with a spectral aggregator that estimates context spectra, forms spectral mixtures, and injects task-adaptive frequency features to better handle periodicity.
The Impossibility Triangle of Long-Context Modeling cs.CL · 2026-05-06 · unverdicted · none · ref 16 · internal anchor
No model can achieve efficiency, compactness, and recall capacity scaling with sequence length at once, as any two imply a strict bound of O(poly(d)/log V) on recallable facts.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 23 · internal anchor
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing cs.CL · 2026-04-21 · unverdicted · none · ref 29 · internal anchor
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
GAMMA-Net: Adaptive Long-Horizon Traffic Spatio-Temporal Forecasting Model based on Interleaved Graph Attention and Multi-Axis Mamba cs.AI · 2026-04-18 · unverdicted · none · ref 7 · internal anchor
GAMMA-Net combines Graph Attention Networks and multi-axis Mamba to outperform prior models in long-horizon traffic forecasting, with up to 16.25% lower MAE on benchmarks like METR-LA and PEMS datasets.
SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm cs.LG · 2026-02-08 · unverdicted · none · ref 14 · internal anchor
SiameseNorm is a two-stream architecture that reconciles Pre-Norm and Post-Norm in Transformers by coupling streams via shared residual blocks, yielding performance gains with maintained stability on language, vision, and diffusion models.
Gated KalmaNet: A Fading Memory Layer Through Test-Time Ridge Regression cs.LG · 2025-11-26 · unverdicted · none · ref 30 · internal anchor
Gated KalmaNet uses exact Kalman gain computation with adaptive gating and Chebyshev iteration to improve SSM performance on long-context tasks over prior approximations like DeltaNet.
Kimi Linear: An Expressive, Efficient Attention Architecture cs.CL · 2025-10-30 · unverdicted · none · ref 52 · internal anchor
Kimi Linear hybridizes linear attention with a new KDA module to beat full attention on tasks while slashing KV cache by 75% and speeding decoding up to 6x.
MTraining: Distributed Dynamic Sparse Attention for Efficient Ultra-Long Context Training cs.CL · 2025-10-21 · conditional · none · ref 64 · internal anchor
MTraining scales LLM training to 512K-token contexts on 32 A100 GPUs by integrating dynamic sparse training patterns with balanced and hierarchical sparse ring attention, achieving up to 6x throughput gains without accuracy loss on long-context benchmarks.
MoBA: Mixture of Block Attention for Long-Context LLMs cs.LG · 2025-02-18 · unverdicted · none · ref 15 · internal anchor
MoBA routes attention over blocks via MoE-style gating to enable dynamic, bias-light long-context attention that matches full attention performance at lower cost.
Gated Linear Attention Transformers with Hardware-Efficient Training cs.LG · 2023-12-11 · unverdicted · none · ref 44 · internal anchor
Gated linear attention Transformers achieve competitive language modeling results with linear-time inference, superior length generalization, and higher training throughput than Mamba.
MemGPT: Towards LLMs as Operating Systems cs.AI · 2023-10-12 · unverdicted · none · ref 12 · internal anchor
MemGPT uses OS-inspired virtual context management to extend LLM context windows for large document analysis and long-term multi-session chat.
H$_2$O: Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models cs.LG · 2023-06-24 · unverdicted · none · ref 7 · internal anchor
H2O evicts non-heavy-hitter tokens from the KV cache using a dynamic submodular policy, retaining recent and frequent-co-occurrence tokens to reduce memory while preserving accuracy.
PaLM: Scaling Language Modeling with Pathways cs.CL · 2022-04-05 · accept · none · ref 70 · internal anchor
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
ST-MoE: Designing Stable and Transferable Sparse Expert Models cs.CL · 2022-02-17 · unverdicted · none · ref 66 · internal anchor
ST-MoE introduces stability techniques for sparse expert models, allowing a 269B-parameter model to achieve state-of-the-art transfer learning results across reasoning, summarization, and QA tasks at the compute cost of a 32B dense model.
Aligning AI With Shared Human Values cs.CY · 2020-08-05 · conditional · none · ref 16 · internal anchor
Introduces ETHICS benchmark showing current language models have promising but incomplete ability to predict basic human ethical judgments on text scenarios.
HuggingFace's Transformers: State-of-the-art Natural Language Processing cs.CL · 2019-10-09 · accept · none · ref 161 · internal anchor
Hugging Face releases an open-source Python library that supplies a unified API and pretrained weights for major Transformer architectures used in natural language processing.
Online Irregular Multivariate Time Series Forecasting via Uncertainty-Driven Dual-Expert Calibration cs.LG · 2026-05-27 · unverdicted · none · ref 12 · internal anchor
Under-Cali is an uncertainty-driven dual-expert calibration framework for online adaptation in irregular multivariate time series forecasting that freezes the base model.
Kaczmarz Linear Attention cs.LG · 2026-05-09 · unverdicted · none · ref 19 · internal anchor
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack performance.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k cs.LG · 2026-05-04 · accept · none · ref 17 · internal anchor
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Learning Fingerprints for Medical Time Series with Redundancy-Constrained Information Maximization cs.LG · 2026-04-30 · unverdicted · none · ref 29 · internal anchor
A self-supervised method learns a fixed set of disentangled fingerprint tokens from medical time series by combining reconstruction loss with a total coding rate diversity penalty, framed as a disentangled rate-distortion problem.
Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models cs.LG · 2026-04-24 · unverdicted · none · ref 63 · internal anchor
Toeplitz MLP Mixers replace attention with masked Toeplitz multiplications for sub-quadratic complexity while retaining more sequence information and outperforming on copying and in-context tasks.
climt-paraformer: Stable Emulation of Convective Parameterization using a Temporal Memory-aware Transformer physics.ao-ph · 2026-04-22 · unverdicted · none · ref 41 · internal anchor
A temporal memory-aware Transformer emulator for the Emanuel convective parameterization shows lower offline errors and 10-year stability in single-column model tests compared to memory-less MLP and LSTM baselines.
PestVL-Net: Enabling Multimodal Pest Learning via Fine-grained Vision-Language Interaction cs.CV · 2026-04-19 · unverdicted · none · ref 19 · internal anchor
PestVL-Net combines an RWKV visual backbone with saliency-guided window partitioning and MLLM-derived linguistic priors via multimodal chain-of-thought to enable fine-grained multimodal pest recognition on dedicated datasets.
MedMamba: Recasting Mamba for Medical Time Series Classification eess.SP · 2026-04-17 · unverdicted · none · ref 32 · internal anchor
MedMamba introduces a principle-guided bidirectional multi-scale Mamba model that outperforms prior methods on EEG, ECG, and activity classification benchmarks while delivering 4.6x inference speedup.
Deja Vu in Plots: Leveraging Cross-Session Evidence with Retrieval-Augmented LLMs for Live Streaming Risk Assessment cs.AI · 2026-01-22 · unverdicted · none · ref 18 · internal anchor
CS-VAR uses an LLM to reason over cross-session behavioral evidence and transfer insights to a small model for efficient, structured live streaming risk assessment with claimed SOTA results.
Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging cs.LG · 2025-10-24 · unverdicted · none · ref 13 · internal anchor
SAL-T enhances the linformer with spatially aware kinematic partitioning and convolutions to match full-attention transformer performance on jet tagging while keeping linear complexity and lower latency.
Bandwidth-Aware LLM Inference on Heterogeneous Many-Core Supercomputers cs.DC · 2026-05-25 · unverdicted · none · ref 27 · internal anchor
THInfer achieves 62-84% higher throughput than GPU baselines for Llama 7B-30B models on MT-3000 through bandwidth-focused co-design, and runs 70B models where GPU frameworks fail.
RAFNet: Region-Aware Fusion Network for Pansharpening cs.CV · 2026-05-04 · unverdicted · none · ref 36 · internal anchor
RAFNet uses wavelet-based directional separation, K-means regional clustering, and clustered sparse attention to create adaptive kernels and efficient frequency aggregation, outperforming prior pansharpening networks on benchmark datasets.
State Space Models for Bioacoustics: A Comparative Evaluation with Transformers cs.SD · 2025-12-03 · unverdicted · none · ref 20 · internal anchor
BioMamba matches Transformer performance on bioacoustics tasks while using significantly less VRAM.
AOI: Context-Aware Multi-Agent Operations via Dynamic Scheduling and Hierarchical Memory Compression cs.MA · 2025-12-15 · unverdicted · none · ref 7 · internal anchor
AOI is a multi-agent system that dynamically schedules operations and compresses context hierarchically to achieve 72% compression while preserving 93% critical information and cutting repair times by 34%.
Positional Encoding in Transformer-Based Time Series Models: A Survey cs.LG · 2025-02-17 · unverdicted · none · ref 41 · internal anchor
A survey of positional encoding methods in transformer-based time series models that evaluates fixed, learnable, relative, and hybrid approaches on classification tasks and links effectiveness to data characteristics.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 159 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.
Spectral Priors vs. Attention: Investigating the Utility of Attention Mechanisms in EEG-Based Diagnosis cs.LG · 2026-05-14 · unreviewed · ref 8 · internal anchor
Understand and Accelerate Memory Processing Pipeline for Large Language Model Inference cs.DC · 2026-03-30 · unreviewed · ref 12 · internal anchor
LIFT: A Novel Framework for Enhancing Long-Context Understanding of LLMs via Long Input Fine-Tuning cs.CL · 2025-02-20 · unreviewed · ref 4 · internal anchor

Reformer: The Efficient Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer