pith. machine review for the scientific record. sign in

arxiv: 2009.14794 · v4 · submitted 2020-09-30 · 💻 cs.LG · cs.CL· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Rethinking Attention with Performers

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CLstat.ML
keywords transformersattention mechanismsrandom featureslinear complexitykernel approximationefficient modelsFAVOR+
0
0 comments X

The pith

Performers approximate full softmax attention in Transformers using linear time and space with provable accuracy guarantees.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Performers as Transformer architectures that estimate regular full-rank softmax attention without quadratic costs. It relies on a new FAVOR+ estimator based on positive orthogonal random features to achieve this linear scaling while preserving theoretical properties like unbiased estimation and uniform convergence. A reader would care because the method works without assuming sparsity or low-rank structure in the data, opening attention models to longer sequences on standard hardware. The authors demonstrate this on tasks from pixel prediction to protein sequences and show competitive results against other efficient attention variants.

Core claim

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods.

What carries the argument

FAVOR+, the Fast Attention Via positive Orthogonal Random features estimator that constructs an unbiased or nearly-unbiased approximation to the softmax attention matrix using random features.

Load-bearing premise

The random-feature approximation must stay close enough to the true attention matrix that downstream task performance does not degrade noticeably.

What would settle it

Training identical Performer and standard Transformer models on a large-scale task and observing that the Performer version achieves substantially lower accuracy would indicate the approximation fails to preserve performance.

read the original abstract

We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces Performers, a class of Transformer architectures that approximate standard full-rank softmax attention via the FAVOR+ estimator (Fast Attention Via positive Orthogonal Random features). FAVOR+ constructs an unbiased or nearly-unbiased approximation to the attention matrix using positive orthogonal random features, achieving linear time and space complexity without sparsity or low-rank assumptions. The paper supplies explicit proofs of unbiasedness, variance bounds that decrease with the number of features, and uniform convergence results. Experiments on pixel prediction, text, and protein modeling tasks demonstrate that the approximation error remains small enough for end-to-end performance to remain competitive with dense Transformers and other efficient attention baselines, while also enabling large-scale comparisons across different kernelizable attention mechanisms.

Significance. If the central claims hold, the work is significant because it supplies a theoretically grounded route to linear-complexity attention that does not rely on structural priors. The explicit unbiasedness proofs, variance bounds, and uniform convergence results (without sparsity or low-rank assumptions) constitute a clear strength, as does the empirical demonstration that the approximation preserves downstream performance across diverse modalities. The ability to evaluate multiple attention kernels at scale is a useful byproduct for the broader field.

major comments (2)
  1. [§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.
  2. [Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.
minor comments (3)
  1. [§3.1] The notation for the positive orthogonal random features (Definition 1) mixes matrix and vector indexing; a single consistent notation would improve readability.
  2. [Figure 3] Figure 3 (attention matrix visualizations) would benefit from an additional panel showing the absolute error |A - Â| for a representative long sequence to make the uniform-convergence claim visually concrete.
  3. [Abstract] The abstract claims 'provable accuracy' while the body distinguishes unbiased from nearly-unbiased regimes; a short clarifying sentence in the abstract would align the two.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. We address the two major comments point by point below.

read point-by-point responses
  1. Referee: [§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.

    Authors: We appreciate this observation. While Theorem 3.2 in §3.2 highlights the decrease in variance with m, the full dependence on n and the maximum attention weight is derived explicitly in the proof (Appendix, Theorem 3.3). To address the concern, we will revise the main text in §3.2 to state the complete bound and include a direct reference to the appendix. This will allow readers to compute the required m for sequences of length 10^4–10^5 without additional derivation. revision: yes

  2. Referee: [Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.

    Authors: We agree that standard deviations over multiple seeds would strengthen the presentation of robustness. Due to the substantial computational cost of training large models on these datasets, we performed single runs per configuration. However, the variance of the FAVOR+ estimator is theoretically controlled and decreases with m (see §3.2), and our choice of m ensures the approximation error is negligible compared to the observed gaps. In the revision we will add a discussion of these variance bounds in the experimental section to contextualize the results. For the smaller-scale pixel tasks we will also report standard deviations from additional runs. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's core contribution is the FAVOR+ estimator for approximating softmax attention via positive orthogonal random features. It supplies explicit unbiasedness proofs, variance bounds decreasing with feature count, and uniform convergence results derived from kernel approximation theory. These are not reductions to fitted parameters, self-citations, or ansatzes smuggled from prior author work; the construction is presented as novel and externally grounded. Experiments confirm the approximation suffices for competitive performance, but the theoretical claims stand independently of the empirical results. This matches the default expectation of a self-contained derivation with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the mathematical validity of positive orthogonal random features as an unbiased estimator for the softmax kernel; no explicit free parameters or invented entities are stated in the abstract.

axioms (1)
  • domain assumption Positive orthogonal random features yield an unbiased or nearly-unbiased estimator of the softmax attention kernel
    This is the core premise enabling the linear-complexity approximation and provable accuracy claims.

pith-pipeline@v0.9.0 · 5542 in / 1292 out tokens · 47462 ms · 2026-05-12T09:11:12.722504+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RoFormer: Enhanced Transformer with Rotary Position Embedding

    cs.CL 2021-04 accept novelty 8.0

    RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.

  2. QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling

    cs.LG 2026-05 unverdicted novelty 7.0

    QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.

  3. Complex-Valued Phase-Coherent Transformer

    cs.LG 2026-05 unverdicted novelty 7.0

    PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex T...

  4. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.

  5. FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...

  6. Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling

    cs.LG 2026-04 unverdicted novelty 7.0

    Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.

  7. SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

  8. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  9. Search Your Block Floating Point Scales!

    cs.LG 2026-05 unverdicted novelty 6.0

    ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.

  10. Compute Where it Counts: Self Optimizing Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...

  11. Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing

    cs.CL 2026-05 conditional novelty 6.0

    EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.

  12. RT-Transformer: The Transformer Block as a Spherical State Estimator

    cs.LG 2026-05 unverdicted novelty 6.0

    Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.

  13. Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns

    cs.LG 2026-05 unverdicted novelty 6.0

    A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.

  14. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.

  15. Training Transformers for KV Cache Compressibility

    cs.LG 2026-05 unverdicted novelty 6.0

    KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...

  16. Retrieval from Within: An Intrinsic Capability of Attention-Based Models

    cs.LG 2026-05 unverdicted novelty 6.0

    Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.

  17. Linear-Time Global Visual Modeling without Explicit Attention

    cs.CV 2026-05 unverdicted novelty 6.0

    Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.

  18. GateMOT: Q-Gated Attention for Dense Object Tracking

    cs.CV 2026-04 unverdicted novelty 6.0

    GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.

  19. Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization

    cs.SI 2026-04 unverdicted novelty 6.0

    DetailDPO cuts detail-level hallucination errors in LLMs on long regulatory documents by 42-61% using targeted contrastive pairs on a new 13,000-pair benchmark.

  20. HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models

    cs.LG 2026-04 unverdicted novelty 6.0

    HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.

  21. MICA: Multivariate Infini Compressive Attention for Time Series Forecasting

    cs.LG 2026-04 unverdicted novelty 6.0

    MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.

  22. MICA: Multivariate Infini Compressive Attention for Time Series Forecasting

    cs.LG 2026-04 unverdicted novelty 6.0

    MICA adds linearly scaling compressive cross-channel attention to Transformers, cutting average forecast error by 5.4% and ranking first among multivariate baselines.

  23. Attention to Mamba: A Recipe for Cross-Architecture Distillation

    cs.CL 2026-04 unverdicted novelty 6.0

    A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.

  24. YOLOv12: Attention-Centric Real-Time Object Detectors

    cs.CV 2025-02 unverdicted novelty 6.0

    YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.

  25. DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models

    cs.LG 2023-09 accept novelty 6.0

    DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.

  26. Token Merging: Your ViT But Faster

    cs.CV 2022-10 unverdicted novelty 6.0

    Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.

  27. PaLM: Scaling Language Modeling with Pathways

    cs.CL 2022-04 accept novelty 6.0

    PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.

  28. Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging

    cs.LG 2026-05 unverdicted novelty 5.0

    Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.

  29. One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 5.0

    Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.

  30. Kaczmarz Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...

  31. MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

    cs.LG 2026-05 unverdicted novelty 5.0

    MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.

  32. StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k

    cs.LG 2026-05 accept novelty 5.0

    Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.

  33. Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity

    cs.CL 2026-04 unverdicted novelty 5.0

    Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.

  34. Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction

    cs.MM 2026-04 unverdicted novelty 5.0

    A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.

  35. FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control

    cs.LG 2026-04 unverdicted novelty 5.0

    FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.

  36. NFTDELTA: Detecting Permission Control Vulnerabilities in NFT Contracts through Multi-View Learning

    cs.CR 2026-04 unverdicted novelty 5.0

    NFTDELTA detects permission control vulnerabilities in NFT contracts by combining sequence and graph views of function CFGs, reporting 241 confirmed issues across 795 collections with 97.92% average precision.

  37. VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

    cs.LG 2026-04 unverdicted novelty 5.0

    VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.

  38. A Geometric Algebra-informed NeRF Framework for Generalizable Wireless Channel Prediction

    cs.NI 2026-04 unverdicted novelty 5.0

    GAI-NeRF combines geometric algebra attention and an adaptive ray tracing module inside a NeRF model to deliver more accurate and generalizable wireless channel predictions across varied indoor environments.

  39. Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation

    cs.CV 2026-04 unverdicted novelty 3.0

    RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.

Reference graph

Works this paper leans on

161 extracted references · 161 canonical work pages · cited by 36 Pith papers · 5 internal anchors

  1. [1]

    CoRR , volume =

    Andrew Cotter and Joseph Keshet and Nathan Srebro , title =. CoRR , volume =. 2011 , url =

  2. [3]

    CoRR , volume =

    Jin Li , title =. CoRR , volume =. 2019 , url =

  3. [4]

    bioRxiv , pages=

    End-to-end multitask learning, from protein language to protein features without alignments , author=. bioRxiv , pages=. 2019 , publisher=

  4. [5]

    CoRR , volume =

    Zhuoran Shen and Mingyuan Zhang and Shuai Yi and Junjie Yan and Haiyu Zhao , title =. CoRR , volume =. 2018 , url =

  5. [6]

    Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =

    Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =. 2020 , url =

  6. [7]

    Li and Madian Khabsa and Han Fang and Hao Ma , title =

    Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma , title =. CoRR , volume =. 2020 , url =

  7. [8]

    Nguyen and Katrin Kirchhoff , title =

    Julian Salazar and Davis Liang and Toan Q. Nguyen and Katrin Kirchhoff , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,

  8. [10]

    Eguchi and Po

    Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po. ProGen: Language Modeling for Protein Generation , journal =. 2020 , url =

  9. [11]

    Ciprian Chelba and Tomas Mikolov and Mike Schuster and Qi Ge and Thorsten Brants and Phillipp Koehn and Tony Robinson , title =

  10. [12]

    Jun Fu and Jing Liu and Haijie Tian and Yong Li and Yongjun Bao and Zhiwei Fang and Hanqing Lu , title =

  11. [13]

    2018 , booktitle =

    Compiling machine learning programs via high-level tracing , author =. 2018 , booktitle =

  12. [14]

    Deep reinforcement learning with relational inductive biases , booktitle =

    Vin. Deep reinforcement learning with relational inductive biases , booktitle =

  13. [15]

    9th International Conference on Learning Representations,

    Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,

  14. [16]

    Cell , volume=

    Induction of potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory syncytial virus , author=. Cell , volume=. 2019 , publisher=

  15. [17]

    and Ma, Jerry and Fergus, Rob , year =

    Rives, Alexander and Goyal, Siddharth and Meier, Joshua and Guo, Demi and Ott, Myle and Zitnick, C. and Ma, Jerry and Fergus, Rob , year =. bioArxiv , title =

  16. [18]

    Science , volume=

    Protein interaction networks revealed by proteome coevolution , author=. Science , volume=. 2019 , publisher=

  17. [19]

    Proceedings of the National Academy of Sciences , volume=

    Inferring interaction partners from protein sequences , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=

  18. [20]

    CoRR , volume =

    Jesse Vig and Yonatan Belinkov , title =. CoRR , volume =. 2019 , url =

  19. [21]

    Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =

    Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =. CoRR , volume =. 2020 , url =

  20. [22]

    Proceedings of the National Academy of Sciences , volume=

    Identification of direct residue contacts in protein--protein interaction by message passing , author=. Proceedings of the National Academy of Sciences , volume=. 2009 , publisher=

  21. [23]

    Cell , volume=

    Three-dimensional structures of membrane proteins from genomic sequencing , author=. Cell , volume=. 2012 , publisher=

  22. [24]

    Nucleic acids research , volume=

    UniProt: a worldwide hub of protein knowledge , author=. Nucleic acids research , volume=. 2019 , publisher=

  23. [25]

    Elife , volume=

    Robust and accurate prediction of residue--residue interactions across protein interfaces using evolutionary information , author=. Elife , volume=. 2014 , publisher=

  24. [27]

    Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =

    Oriol Vinyals and Meire Fortunato and Navdeep Jaitly , title =. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =

  25. [28]

    Advances in Neural Information Processing Systems , pages=

    Generative models for graph-based protein design , author=. Advances in Neural Information Processing Systems , pages=

  26. [29]

    CoRR , volume =

    Haoneng Luo and Shiliang Zhang and Ming Lei and Lei Xie , title =. CoRR , volume =. 2020 , url =

  27. [30]

    CoRR , volume =

    Ciprian Chelba and Mia Xu Chen and Ankur Bapna and Noam Shazeer , title =. CoRR , volume =. 2020 , url =

  28. [31]

    Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,

    Tong Xiao and Yinqiao Li and Jingbo Zhu and Zhengtao Yu and Tongran Liu , title =. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,. 2019 , url =

  29. [32]

    8th International Conference on Learning Representations,

    Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =

  30. [33]

    CoRR , volume =

    Aurko Roy and Mohammad Saffar and Ashish Vaswani and David Grangier , title =. CoRR , volume =. 2020 , url =

  31. [34]

    Proceedings of the 35th International Conference on Machine Learning,

    Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Noam Shazeer and Alexander Ku and Dustin Tran , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

  32. [35]

    Mia Xu Chen and Orhan Firat and Ankur Bapna and Melvin Johnson and Wolfgang Macherey and George F. Foster and Llion Jones and Mike Schuster and Noam Shazeer and Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Zhifeng Chen and Yonghui Wu and Macduff Hughes , title =. Proceedings of the 56th Annual Meeting of the Association for Com...

  33. [36]

    Music Transformer: Generating Music with Long-Term Structure , booktitle =

    Cheng. Music Transformer: Generating Music with Long-Term Structure , booktitle =. 2019 , url =

  34. [37]

    7th International Conference on Learning Representations,

    Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and Lukasz Kaiser , title =. 7th International Conference on Learning Representations,. 2019 , url =

  35. [38]

    Advances in Neural Information Processing Systems 30 , pages =

    Attention is All you Need , author =. Advances in Neural Information Processing Systems 30 , pages =. 2017 , publisher =

  36. [39]

    CoRR , volume =

    Jacob Devlin and Ming. CoRR , volume =. 2018 , url =

  37. [41]

    Peters and Arman Cohan , title =

    Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =

  38. [42]

    CoRR , volume =

    Rewon Child and Scott Gray and Alec Radford and Ilya Sutskever , title =. CoRR , volume =. 2019 , url =

  39. [43]

    Mahoney , title =

    Haim Avron and Vikas Sindhwani and Jiyan Yang and Michael W. Mahoney , title =. J. Mach. Learn. Res. , volume =. 2016 , url =

  40. [44]

    CoRR , volume =

    Yujin Tang and Duong Nguyen and David Ha , title =. CoRR , volume =. 2020 , url =

  41. [45]

    Graph Attention Networks , booktitle =

    Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li. Graph Attention Networks , booktitle =. 2018 , url =

  42. [46]

    Smola and Eduard H

    Zichao Yang and Diyi Yang and Chris Dyer and Xiaodong He and Alexander J. Smola and Eduard H. Hovy , title =. 2016 , url =

  43. [47]

    Demystifying Orthogonal

    Han Lin and Haoxian Chen and Tianyi Zhang and Cl. Demystifying Orthogonal. CoRR , volume =

  44. [48]

    Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N

    Felix X. Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N. Holtmann. Orthogonal Random Features , booktitle =

  45. [49]

    Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,

    Krzysztof Marcin Choromanski and Mark Rowland and Adrian Weller , title =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,

  46. [50]

    6th International Conference on Learning Representations,

    Krzysztof Choromanski and Carlton Downey and Byron Boots , title =. 6th International Conference on Learning Representations,. 2018 , url =

  47. [51]

    Orthogonal Estimation of

    Mark Rowland and Jiri Hron and Yunhao Tang and Krzysztof Choromanski and Tam. Orthogonal Estimation of. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =

  48. [52]

    Turner and Adrian Weller , title =

    Krzysztof Choromanski and Mark Rowland and Vikas Sindhwani and Richard E. Turner and Adrian Weller , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =

  49. [53]

    2010 , url =

    A sparse Johnson: Lindenstrauss transform , booktitle =. 2010 , url =. doi:10.1145/1806689.1806737 , timestamp =

  50. [54]

    2013 , url =

    Nir Ailon and Edo Liberty , title =. 2013 , url =. doi:10.1145/2483699.2483701 , timestamp =

  51. [55]

    Pure and Applied Chemistry , volume=

    Nomenclature and symbolism for amino acids and peptides , author=. Pure and Applied Chemistry , volume=

  52. [56]

    2009 , url =

    Nir Ailon and Bernard Chazelle , title =. 2009 , url =

  53. [57]

    Cormen and Charles E

    Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein , title =. 2009 , url =

  54. [58]

    Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , booktitle =

    Yao. Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , booktitle =. 2019 , url =

  55. [59]

    Ali Rahimi and Benjamin Recht , title =. Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007 , pages =. 2007 , url =

  56. [60]

    Monte Carlo Meth

    Bouhari Arouna , title =. Monte Carlo Meth. and Appl. , volume =. 2004 , url =

  57. [61]

    2019 , eprint=

    Control variate selection for Monte Carlo integration , author=. 2019 , eprint=

  58. [62]

    Monte Carlo integration with a growing number of control variates , journal =

    Fran. Monte Carlo integration with a growing number of control variates , journal =. 2019 , url =

  59. [63]

    Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =

    Yukun Zhu and Ryan Kiros and Richard S. Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =. 2015. 2015 , url =

  60. [64]

    2014 , url =

    Uniform Distribution and Quasi-Monte Carlo Methods - Discrepancy, Integration and Applications , series =. 2014 , url =. doi:10.1515/9783110317930 , isbn =

  61. [65]

    Kuo and Ian H

    Josef Dick and Frances Y. Kuo and Ian H. Sloan , title =. Acta Numer. , volume =. 2013 , url =

  62. [66]

    CoRR , volume =

    Josef Dick and Michael Feischl , title =. CoRR , volume =. 2020 , url =

  63. [67]

    Sliced and Radon Wasserstein Barycenters of Measures , journal =

    Nicolas Bonneel and Julien Rabin and Gabriel Peyr. Sliced and Radon Wasserstein Barycenters of Measures , journal =. 2015 , url =

  64. [68]

    The 22nd International Conference on Artificial Intelligence and Statistics,

    Krzysztof Choromanski and Aldo Pacchiano and Jeffrey Pennington and Yunhao Tang , title =. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =

  65. [69]

    Proceedings of the 36th International Conference on Machine Learning,

    Krzysztof Choromanski and Mark Rowland and Wenyu Chen and Adrian Weller , title =. Proceedings of the 36th International Conference on Machine Learning,. 2019 , url =

  66. [70]

    Geometrically Coupled Monte Carlo Sampling , booktitle =

    Mark Rowland and Krzysztof Choromanski and Fran. Geometrically Coupled Monte Carlo Sampling , booktitle =. 2018 , url =

  67. [71]

    Extensions of Lipschitz maps into a Hilbert space , author=

  68. [72]

    Random Struct

    Matousek, Jir\'. Random Struct. Algorithms , keywords =. doi:http://dx.doi.org/10.1002/rsa.20218 , interhash =

  69. [73]

    CoRR , volume =

    Wenbo Gao and Laura Graesser and Krzysztof Choromanski and Xingyou Song and Nevena Lazic and Pannag Sanketi and Vikas Sindhwani and Navdeep Jaitly , title =. CoRR , volume =. 2020 , url =

  70. [74]

    CoRR , volume =

    Xingyou Song and Yuxiang Yang and Krzysztof Choromanski and Ken Caluwaerts and Wenbo Gao and Chelsea Finn and Jie Tan , title =. CoRR , volume =. 2020 , url =

  71. [75]

    2019 , url =

    Jiqing Wu and Zhiwu Huang and Dinesh Acharya and Wen Li and Janine Thoma and Danda Pani Paudel and Luc Van Gool , title =. 2019 , url =

  72. [76]

    The Geometry of Random Features , booktitle =

    Krzysztof Choromanski and Mark Rowland and Tam. The Geometry of Random Features , booktitle =. 2018 , url =

  73. [77]

    Proceedings of the 33nd International Conference on Machine Learning,

    Krzysztof Choromanski and Vikas Sindhwani , title =. Proceedings of the 33nd International Conference on Machine Learning,. 2016 , url =

  74. [78]

    On the exponential inequalities for negatively dependent random variables , volume =

    Sung, Soo , year =. On the exponential inequalities for negatively dependent random variables , volume =. Journal of Mathematical Analysis and Applications - J MATH ANAL APPL , doi =

  75. [79]

    Towards a theory of negative dependence , author=

  76. [80]

    2016 , eprint=

    Monte Carlo with Determinantal Point Processes , author=. 2016 , eprint=

  77. [81]

    Foundations and Trends in Machine Learning , volume =

    Alex Kulesza and Ben Taskar , title =. Foundations and Trends in Machine Learning , volume =. 2012 , url =

  78. [82]

    On two ways to use determinantal point processes for Monte Carlo integration , booktitle =

    Guillaume Gautier and R. On two ways to use determinantal point processes for Monte Carlo integration , booktitle =. 2019 , url =

  79. [83]

    Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes , journal =

    Krzysztof Choromanski and Aldo Pacchiano and Jack Parker. Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes , journal =. 2019 , url =

  80. [84]

    Wasserstein Generative Adversarial Networks , booktitle =

    Mart. Wasserstein Generative Adversarial Networks , booktitle =. 2017 , url =

Showing first 80 references.