Reformer: The efficient transformer

Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya · 2020

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

citation-role summary

method 1

citation-polarity summary

background 1

representative citing papers

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

cs.CL · 2026-05-18 · unverdicted · novelty 6.0

DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.

Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes

cs.LG · 2025-06-10 · unverdicted · novelty 6.0

BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.

Simplified State Space Layers for Sequence Modeling

cs.LG · 2022-08-09 · accept · novelty 6.0

S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.

citing papers explorer

Showing 4 of 4 citing papers.

RelFlexformer: Efficient Attention 3D-Transformers for Integrable Relative Positional Encodings cs.LG · 2026-05-11 · unverdicted · none · ref 31
RelFlexformers enable flexible integrable 3D RPE in attention via NU-FFT, generalizing prior methods to heterogeneous token positions with O(L log L) complexity.
DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention cs.CL · 2026-05-18 · unverdicted · none · ref 42
DashAttention introduces differentiable adaptive sparse hierarchical attention via α-entmax block selection, achieving full-attention accuracy at 75% sparsity with improved Pareto performance over NSA and InfLLMv2.
Scalable Spatiotemporal Inference with Biased Scan Attention Transformer Neural Processes cs.LG · 2025-06-10 · unverdicted · none · ref 24
BSA-TNP is a new neural process model with KRBlocks and biased scan attention that claims to match top accuracy while scaling inference to over 1M points in under a minute on a single GPU and supporting translation invariance.
Simplified State Space Layers for Sequence Modeling cs.LG · 2022-08-09 · accept · none · ref 117
S5 uses a single MIMO state space model with S4-derived initialization to match S4 efficiency and reach 87.4% average accuracy on the Long Range Arena benchmark.

Reformer: The efficient transformer

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer