arxiv: 2009.14794 · v4 · submitted 2020-09-30 · 💻 cs.LG · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Rethinking Attention with Performers

Krzysztof Choromanski , Valerii Likhosherstov , David Dohan , Xingyou Song , Andreea Gane , Tamas Sarlos , Peter Hawkins , Jared Davis

show 5 more authors

Afroz Mohiuddin Lukasz Kaiser David Belanger Lucy Colwell Adrian Weller

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML

keywords transformersattention mechanismsrandom featureslinear complexitykernel approximationefficient modelsFAVOR+

0 comments

The pith

Performers approximate full softmax attention in Transformers using linear time and space with provable accuracy guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Performers as Transformer architectures that estimate regular full-rank softmax attention without quadratic costs. It relies on a new FAVOR+ estimator based on positive orthogonal random features to achieve this linear scaling while preserving theoretical properties like unbiased estimation and uniform convergence. A reader would care because the method works without assuming sparsity or low-rank structure in the data, opening attention models to longer sequences on standard hardware. The authors demonstrate this on tasks from pixel prediction to protein sequences and show competitive results against other efficient attention variants.

Core claim

What carries the argument

FAVOR+, the Fast Attention Via positive Orthogonal Random features estimator that constructs an unbiased or nearly-unbiased approximation to the softmax attention matrix using random features.

Load-bearing premise

The random-feature approximation must stay close enough to the true attention matrix that downstream task performance does not degrade noticeably.

What would settle it

Training identical Performer and standard Transformer models on a large-scale task and observing that the Performer version achieves substantially lower accuracy would indicate the approximation fails to preserve performance.

read the original abstract

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Performers give a linear-time softmax attention approximation with actual proofs and competitive results on text, pixels, and proteins.

read the letter

This paper's main contribution is a random feature method called FAVOR+ that approximates softmax attention in linear time with unbiased estimates and convergence guarantees, and the tests show it works well enough to be competitive with full attention on real tasks. The new part is the specific choice of positive orthogonal random features for the attention kernel. Previous random feature approaches for kernels existed, but adapting them here with orthogonality and positivity gives better properties for attention, like lower variance. The authors provide proofs for the unbiasedness and bounds on the estimation error that hold uniformly. This is stronger than many heuristic linear attention methods. They also use it to run large-scale comparisons between softmax and other kernels, which regular Transformers couldn't do because of the quadratic cost. The experiments are a strength too. They cover image pixels, text, and proteins, and Performers hold their own against both dense and other efficient attention variants. The theory and practice line up reasonably. On the downside, the approximation still has variance that depends on the number of random features, so for very high precision you pay in compute. The paper doesn't explore the absolute worst-case sequences where the approximation might fail, but the reported results suggest it's stable in practice. No obvious circular reasoning or unfalsifiable parts. This is relevant for anyone building models that need long-range dependencies without quadratic blowup. Readers working on efficient ML architectures or kernel methods in deep learning will get something concrete from it. The work shows clear thinking on the problem and engages with the literature on attention approximations. It deserves peer review. The combination of new method, theory, and empirical validation is enough to warrant detailed comments from experts.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Performers, a class of Transformer architectures that approximate standard full-rank softmax attention via the FAVOR+ estimator (Fast Attention Via positive Orthogonal Random features). FAVOR+ constructs an unbiased or nearly-unbiased approximation to the attention matrix using positive orthogonal random features, achieving linear time and space complexity without sparsity or low-rank assumptions. The paper supplies explicit proofs of unbiasedness, variance bounds that decrease with the number of features, and uniform convergence results. Experiments on pixel prediction, text, and protein modeling tasks demonstrate that the approximation error remains small enough for end-to-end performance to remain competitive with dense Transformers and other efficient attention baselines, while also enabling large-scale comparisons across different kernelizable attention mechanisms.

Significance. If the central claims hold, the work is significant because it supplies a theoretically grounded route to linear-complexity attention that does not rely on structural priors. The explicit unbiasedness proofs, variance bounds, and uniform convergence results (without sparsity or low-rank assumptions) constitute a clear strength, as does the empirical demonstration that the approximation preserves downstream performance across diverse modalities. The ability to evaluate multiple attention kernels at scale is a useful byproduct for the broader field.

major comments (2)

[§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.
[Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.

minor comments (3)

[§3.1] The notation for the positive orthogonal random features (Definition 1) mixes matrix and vector indexing; a single consistent notation would improve readability.
[Figure 3] Figure 3 (attention matrix visualizations) would benefit from an additional panel showing the absolute error |A - Â| for a representative long sequence to make the uniform-convergence claim visually concrete.
[Abstract] The abstract claims 'provable accuracy' while the body distinguishes unbiased from nearly-unbiased regimes; a short clarifying sentence in the abstract would align the two.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses

Referee: [§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.

Authors: We appreciate this observation. While Theorem 3.2 in §3.2 highlights the decrease in variance with m, the full dependence on n and the maximum attention weight is derived explicitly in the proof (Appendix, Theorem 3.3). To address the concern, we will revise the main text in §3.2 to state the complete bound and include a direct reference to the appendix. This will allow readers to compute the required m for sequences of length 10^4–10^5 without additional derivation. revision: yes
Referee: [Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.

Authors: We agree that standard deviations over multiple seeds would strengthen the presentation of robustness. Due to the substantial computational cost of training large models on these datasets, we performed single runs per configuration. However, the variance of the FAVOR+ estimator is theoretically controlled and decreases with m (see §3.2), and our choice of m ensures the approximation error is negligible compared to the observed gaps. In the revision we will add a discussion of these variance bounds in the experimental section to contextualize the results. For the smaller-scale pixel tasks we will also report standard deviations from additional runs. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is the FAVOR+ estimator for approximating softmax attention via positive orthogonal random features. It supplies explicit unbiasedness proofs, variance bounds decreasing with feature count, and uniform convergence results derived from kernel approximation theory. These are not reductions to fitted parameters, self-citations, or ansatzes smuggled from prior author work; the construction is presented as novel and externally grounded. Experiments confirm the approximation suffices for competitive performance, but the theoretical claims stand independently of the empirical results. This matches the default expectation of a self-contained derivation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical validity of positive orthogonal random features as an unbiased estimator for the softmax kernel; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Positive orthogonal random features yield an unbiased or nearly-unbiased estimator of the softmax attention kernel
This is the core premise enabling the linear-complexity approximation and provable accuracy claims.

pith-pipeline@v0.9.0 · 5542 in / 1292 out tokens · 47462 ms · 2026-05-12T09:11:12.722504+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

RoFormer: Enhanced Transformer with Rotary Position Embedding
cs.CL 2021-04 accept novelty 8.0

RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
cs.LG 2026-05 unverdicted novelty 7.0

QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
Complex-Valued Phase-Coherent Transformer
cs.LG 2026-05 unverdicted novelty 7.0

PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex T...
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 7.0

Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
cs.LG 2026-05 unverdicted novelty 7.0

FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
cs.LG 2026-04 unverdicted novelty 7.0

Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
Elastic Attention Cores for Scalable Vision Transformers
cs.CV 2026-05 unverdicted novelty 6.0

VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
Search Your Block Floating Point Scales!
cs.LG 2026-05 unverdicted novelty 6.0

ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
cs.CL 2026-05 conditional novelty 6.0

EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
RT-Transformer: The Transformer Block as a Spherical State Estimator
cs.LG 2026-05 unverdicted novelty 6.0

Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
cs.LG 2026-05 unverdicted novelty 6.0

A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
Training Transformers for KV Cache Compressibility
cs.LG 2026-05 unverdicted novelty 6.0

KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
cs.LG 2026-05 unverdicted novelty 6.0

Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
Linear-Time Global Visual Modeling without Explicit Attention
cs.CV 2026-05 unverdicted novelty 6.0

Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
GateMOT: Q-Gated Attention for Dense Object Tracking
cs.CV 2026-04 unverdicted novelty 6.0

GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization
cs.SI 2026-04 unverdicted novelty 6.0

DetailDPO cuts detail-level hallucination errors in LLMs on long regulatory documents by 42-61% using targeted contrastive pairs on a new 13,000-pair benchmark.
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
cs.LG 2026-04 unverdicted novelty 6.0

HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
cs.LG 2026-04 unverdicted novelty 6.0

MICA adds linearly scaling compressive cross-channel attention to Transformers, cutting average forecast error by 5.4% and ranking first among multivariate baselines.
Attention to Mamba: A Recipe for Cross-Architecture Distillation
cs.CL 2026-04 unverdicted novelty 6.0

A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
YOLOv12: Attention-Centric Real-Time Object Detectors
cs.CV 2025-02 unverdicted novelty 6.0

YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
cs.LG 2023-09 accept novelty 6.0

DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
Token Merging: Your ViT But Faster
cs.CV 2022-10 unverdicted novelty 6.0

Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
PaLM: Scaling Language Modeling with Pathways
cs.CL 2022-04 accept novelty 6.0

PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging
cs.LG 2026-05 unverdicted novelty 5.0

Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.
One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 5.0

Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.
Kaczmarz Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
cs.LG 2026-05 unverdicted novelty 5.0

MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
cs.LG 2026-05 accept novelty 5.0

Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
cs.CL 2026-04 unverdicted novelty 5.0

Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
cs.MM 2026-04 unverdicted novelty 5.0

A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
cs.LG 2026-04 unverdicted novelty 5.0

FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
NFTDELTA: Detecting Permission Control Vulnerabilities in NFT Contracts through Multi-View Learning
cs.CR 2026-04 unverdicted novelty 5.0

NFTDELTA detects permission control vulnerabilities in NFT contracts by combining sequence and graph views of function CFGs, reporting 241 confirmed issues across 795 collections with 97.92% average precision.
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
cs.LG 2026-04 unverdicted novelty 5.0

VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
A Geometric Algebra-informed NeRF Framework for Generalizable Wireless Channel Prediction
cs.NI 2026-04 unverdicted novelty 5.0

GAI-NeRF combines geometric algebra attention and an adaptive ray tracing module inside a NeRF model to deliver more accurate and generalizable wireless channel predictions across varied indoor environments.
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
cs.CV 2026-04 unverdicted novelty 3.0

RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 36 Pith papers · 5 internal anchors

[1]

CoRR , volume =

Andrew Cotter and Joseph Keshet and Nathan Srebro , title =. CoRR , volume =. 2011 , url =

work page 2011
[3]

CoRR , volume =

Jin Li , title =. CoRR , volume =. 2019 , url =

work page 2019
[4]

bioRxiv , pages=

End-to-end multitask learning, from protein language to protein features without alignments , author=. bioRxiv , pages=. 2019 , publisher=

work page 2019
[5]

CoRR , volume =

Zhuoran Shen and Mingyuan Zhang and Shuai Yi and Junjie Yan and Haiyu Zhao , title =. CoRR , volume =. 2018 , url =

work page 2018
[6]

Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =

Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =. 2020 , url =

work page 2020
[7]

Li and Madian Khabsa and Han Fang and Hao Ma , title =

Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma , title =. CoRR , volume =. 2020 , url =

work page 2020
[8]

Nguyen and Katrin Kirchhoff , title =

Julian Salazar and Davis Liang and Toan Q. Nguyen and Katrin Kirchhoff , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,

work page
[10]

Eguchi and Po

Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po. ProGen: Language Modeling for Protein Generation , journal =. 2020 , url =

work page 2020
[11]

Ciprian Chelba and Tomas Mikolov and Mike Schuster and Qi Ge and Thorsten Brants and Phillipp Koehn and Tony Robinson , title =

work page
[12]

Jun Fu and Jing Liu and Haijie Tian and Yong Li and Yongjun Bao and Zhiwei Fang and Hanqing Lu , title =

work page
[13]

2018 , booktitle =

Compiling machine learning programs via high-level tracing , author =. 2018 , booktitle =

work page 2018
[14]

Deep reinforcement learning with relational inductive biases , booktitle =

Vin. Deep reinforcement learning with relational inductive biases , booktitle =

work page
[15]

9th International Conference on Learning Representations,

Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,

work page
[16]

Cell , volume=

Induction of potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory syncytial virus , author=. Cell , volume=. 2019 , publisher=

work page 2019
[17]

and Ma, Jerry and Fergus, Rob , year =

Rives, Alexander and Goyal, Siddharth and Meier, Joshua and Guo, Demi and Ott, Myle and Zitnick, C. and Ma, Jerry and Fergus, Rob , year =. bioArxiv , title =

work page
[18]

Science , volume=

Protein interaction networks revealed by proteome coevolution , author=. Science , volume=. 2019 , publisher=

work page 2019
[19]

Proceedings of the National Academy of Sciences , volume=

Inferring interaction partners from protein sequences , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

work page 2016
[20]

CoRR , volume =

Jesse Vig and Yonatan Belinkov , title =. CoRR , volume =. 2019 , url =

work page 2019
[21]

Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =

Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =. CoRR , volume =. 2020 , url =

work page 2020
[22]

Proceedings of the National Academy of Sciences , volume=

Identification of direct residue contacts in protein--protein interaction by message passing , author=. Proceedings of the National Academy of Sciences , volume=. 2009 , publisher=

work page 2009
[23]

Cell , volume=

Three-dimensional structures of membrane proteins from genomic sequencing , author=. Cell , volume=. 2012 , publisher=

work page 2012
[24]

Nucleic acids research , volume=

UniProt: a worldwide hub of protein knowledge , author=. Nucleic acids research , volume=. 2019 , publisher=

work page 2019
[25]

Elife , volume=

Robust and accurate prediction of residue--residue interactions across protein interfaces using evolutionary information , author=. Elife , volume=. 2014 , publisher=

work page 2014
[27]

Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =

Oriol Vinyals and Meire Fortunato and Navdeep Jaitly , title =. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =

work page 2015
[28]

Advances in Neural Information Processing Systems , pages=

Generative models for graph-based protein design , author=. Advances in Neural Information Processing Systems , pages=

work page
[29]

CoRR , volume =

Haoneng Luo and Shiliang Zhang and Ming Lei and Lei Xie , title =. CoRR , volume =. 2020 , url =

work page 2020
[30]

CoRR , volume =

Ciprian Chelba and Mia Xu Chen and Ankur Bapna and Noam Shazeer , title =. CoRR , volume =. 2020 , url =

work page 2020
[31]

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,

Tong Xiao and Yinqiao Li and Jingbo Zhu and Zhengtao Yu and Tongran Liu , title =. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,. 2019 , url =

work page 2019
[32]

8th International Conference on Learning Representations,

Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

work page 2020
[33]

CoRR , volume =

Aurko Roy and Mohammad Saffar and Ashish Vaswani and David Grangier , title =. CoRR , volume =. 2020 , url =

work page 2020
[34]

Proceedings of the 35th International Conference on Machine Learning,

Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Noam Shazeer and Alexander Ku and Dustin Tran , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

work page 2018
[35]

Mia Xu Chen and Orhan Firat and Ankur Bapna and Melvin Johnson and Wolfgang Macherey and George F. Foster and Llion Jones and Mike Schuster and Noam Shazeer and Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Zhifeng Chen and Yonghui Wu and Macduff Hughes , title =. Proceedings of the 56th Annual Meeting of the Association for Com...

work page 2018
[36]

Music Transformer: Generating Music with Long-Term Structure , booktitle =

Cheng. Music Transformer: Generating Music with Long-Term Structure , booktitle =. 2019 , url =

work page 2019
[37]

7th International Conference on Learning Representations,

Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and Lukasz Kaiser , title =. 7th International Conference on Learning Representations,. 2019 , url =

work page 2019
[38]

Advances in Neural Information Processing Systems 30 , pages =

Attention is All you Need , author =. Advances in Neural Information Processing Systems 30 , pages =. 2017 , publisher =

work page 2017
[39]

CoRR , volume =

Jacob Devlin and Ming. CoRR , volume =. 2018 , url =

work page 2018
[41]

Peters and Arman Cohan , title =

Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =

work page 2020
[42]

CoRR , volume =

Rewon Child and Scott Gray and Alec Radford and Ilya Sutskever , title =. CoRR , volume =. 2019 , url =

work page 2019
[43]

Mahoney , title =

Haim Avron and Vikas Sindhwani and Jiyan Yang and Michael W. Mahoney , title =. J. Mach. Learn. Res. , volume =. 2016 , url =

work page 2016
[44]

CoRR , volume =

Yujin Tang and Duong Nguyen and David Ha , title =. CoRR , volume =. 2020 , url =

work page 2020
[45]

Graph Attention Networks , booktitle =

Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li. Graph Attention Networks , booktitle =. 2018 , url =

work page 2018
[46]

Smola and Eduard H

Zichao Yang and Diyi Yang and Chris Dyer and Xiaodong He and Alexander J. Smola and Eduard H. Hovy , title =. 2016 , url =

work page 2016
[47]

Demystifying Orthogonal

Han Lin and Haoxian Chen and Tianyi Zhang and Cl. Demystifying Orthogonal. CoRR , volume =

work page
[48]

Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N

Felix X. Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N. Holtmann. Orthogonal Random Features , booktitle =

work page
[49]

Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,

Krzysztof Marcin Choromanski and Mark Rowland and Adrian Weller , title =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,

work page 2017
[50]

6th International Conference on Learning Representations,

Krzysztof Choromanski and Carlton Downey and Byron Boots , title =. 6th International Conference on Learning Representations,. 2018 , url =

work page 2018
[51]

Orthogonal Estimation of

Mark Rowland and Jiri Hron and Yunhao Tang and Krzysztof Choromanski and Tam. Orthogonal Estimation of. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =

work page 2019
[52]

Turner and Adrian Weller , title =

Krzysztof Choromanski and Mark Rowland and Vikas Sindhwani and Richard E. Turner and Adrian Weller , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

work page 2018
[53]

2010 , url =

A sparse Johnson: Lindenstrauss transform , booktitle =. 2010 , url =. doi:10.1145/1806689.1806737 , timestamp =

work page doi:10.1145/1806689.1806737 2010
[54]

2013 , url =

Nir Ailon and Edo Liberty , title =. 2013 , url =. doi:10.1145/2483699.2483701 , timestamp =

work page doi:10.1145/2483699.2483701 2013
[55]

Pure and Applied Chemistry , volume=

Nomenclature and symbolism for amino acids and peptides , author=. Pure and Applied Chemistry , volume=

work page
[56]

2009 , url =

Nir Ailon and Bernard Chazelle , title =. 2009 , url =

work page 2009
[57]

Cormen and Charles E

Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein , title =. 2009 , url =

work page 2009
[58]

Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , booktitle =

Yao. Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , booktitle =. 2019 , url =

work page 2019
[59]

Ali Rahimi and Benjamin Recht , title =. Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007 , pages =. 2007 , url =

work page 2007
[60]

Monte Carlo Meth

Bouhari Arouna , title =. Monte Carlo Meth. and Appl. , volume =. 2004 , url =

work page 2004
[61]

2019 , eprint=

Control variate selection for Monte Carlo integration , author=. 2019 , eprint=

work page 2019
[62]

Monte Carlo integration with a growing number of control variates , journal =

Fran. Monte Carlo integration with a growing number of control variates , journal =. 2019 , url =

work page 2019
[63]

Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =

Yukun Zhu and Ryan Kiros and Richard S. Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =. 2015. 2015 , url =

work page 2015
[64]

2014 , url =

Uniform Distribution and Quasi-Monte Carlo Methods - Discrepancy, Integration and Applications , series =. 2014 , url =. doi:10.1515/9783110317930 , isbn =

work page doi:10.1515/9783110317930 2014
[65]

Kuo and Ian H

Josef Dick and Frances Y. Kuo and Ian H. Sloan , title =. Acta Numer. , volume =. 2013 , url =

work page 2013
[66]

CoRR , volume =

Josef Dick and Michael Feischl , title =. CoRR , volume =. 2020 , url =

work page 2020
[67]

Sliced and Radon Wasserstein Barycenters of Measures , journal =

Nicolas Bonneel and Julien Rabin and Gabriel Peyr. Sliced and Radon Wasserstein Barycenters of Measures , journal =. 2015 , url =

work page 2015
[68]

The 22nd International Conference on Artificial Intelligence and Statistics,

Krzysztof Choromanski and Aldo Pacchiano and Jeffrey Pennington and Yunhao Tang , title =. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =

work page 2019
[69]

Proceedings of the 36th International Conference on Machine Learning,

Krzysztof Choromanski and Mark Rowland and Wenyu Chen and Adrian Weller , title =. Proceedings of the 36th International Conference on Machine Learning,. 2019 , url =

work page 2019
[70]

Geometrically Coupled Monte Carlo Sampling , booktitle =

Mark Rowland and Krzysztof Choromanski and Fran. Geometrically Coupled Monte Carlo Sampling , booktitle =. 2018 , url =

work page 2018
[71]

Extensions of Lipschitz maps into a Hilbert space , author=

work page
[72]

Random Struct

Matousek, Jir\'. Random Struct. Algorithms , keywords =. doi:http://dx.doi.org/10.1002/rsa.20218 , interhash =

work page doi:10.1002/rsa.20218
[73]

CoRR , volume =

Wenbo Gao and Laura Graesser and Krzysztof Choromanski and Xingyou Song and Nevena Lazic and Pannag Sanketi and Vikas Sindhwani and Navdeep Jaitly , title =. CoRR , volume =. 2020 , url =

work page 2020
[74]

CoRR , volume =

Xingyou Song and Yuxiang Yang and Krzysztof Choromanski and Ken Caluwaerts and Wenbo Gao and Chelsea Finn and Jie Tan , title =. CoRR , volume =. 2020 , url =

work page 2020
[75]

2019 , url =

Jiqing Wu and Zhiwu Huang and Dinesh Acharya and Wen Li and Janine Thoma and Danda Pani Paudel and Luc Van Gool , title =. 2019 , url =

work page 2019
[76]

The Geometry of Random Features , booktitle =

Krzysztof Choromanski and Mark Rowland and Tam. The Geometry of Random Features , booktitle =. 2018 , url =

work page 2018
[77]

Proceedings of the 33nd International Conference on Machine Learning,

Krzysztof Choromanski and Vikas Sindhwani , title =. Proceedings of the 33nd International Conference on Machine Learning,. 2016 , url =

work page 2016
[78]

On the exponential inequalities for negatively dependent random variables , volume =

Sung, Soo , year =. On the exponential inequalities for negatively dependent random variables , volume =. Journal of Mathematical Analysis and Applications - J MATH ANAL APPL , doi =

work page
[79]

Towards a theory of negative dependence , author=

work page
[80]

2016 , eprint=

Monte Carlo with Determinantal Point Processes , author=. 2016 , eprint=

work page 2016
[81]

Foundations and Trends in Machine Learning , volume =

Alex Kulesza and Ben Taskar , title =. Foundations and Trends in Machine Learning , volume =. 2012 , url =

work page 2012
[82]

On two ways to use determinantal point processes for Monte Carlo integration , booktitle =

Guillaume Gautier and R. On two ways to use determinantal point processes for Monte Carlo integration , booktitle =. 2019 , url =

work page 2019
[83]

Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes , journal =

Krzysztof Choromanski and Aldo Pacchiano and Jack Parker. Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes , journal =. 2019 , url =

work page 2019
[84]

Wasserstein Generative Adversarial Networks , booktitle =

Mart. Wasserstein Generative Adversarial Networks , booktitle =. 2017 , url =

work page 2017

Showing first 80 references.