Recognition: 2 theorem links
· Lean TheoremRethinking Attention with Performers
Pith reviewed 2026-05-12 09:11 UTC · model grok-4.3
The pith
Performers approximate full softmax attention in Transformers using linear time and space with provable accuracy guarantees.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods.
What carries the argument
FAVOR+, the Fast Attention Via positive Orthogonal Random features estimator that constructs an unbiased or nearly-unbiased approximation to the softmax attention matrix using random features.
Load-bearing premise
The random-feature approximation must stay close enough to the true attention matrix that downstream task performance does not degrade noticeably.
What would settle it
Training identical Performer and standard Transformer models on a large-scale task and observing that the Performer version achieves substantially lower accuracy would indicate the approximation fails to preserve performance.
read the original abstract
We introduce Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear (as opposed to quadratic) space and time complexity, without relying on any priors such as sparsity or low-rankness. To approximate softmax attention-kernels, Performers use a novel Fast Attention Via positive Orthogonal Random features approach (FAVOR+), which may be of independent interest for scalable kernel methods. FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks, beyond the reach of regular Transformers, and investigate optimal attention-kernels. Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform convergence and low estimation variance. We tested Performers on a rich set of tasks stretching from pixel-prediction through text models to protein sequence modeling. We demonstrate competitive results with other examined efficient sparse and dense attention methods, showcasing effectiveness of the novel attention-learning paradigm leveraged by Performers.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Performers, a class of Transformer architectures that approximate standard full-rank softmax attention via the FAVOR+ estimator (Fast Attention Via positive Orthogonal Random features). FAVOR+ constructs an unbiased or nearly-unbiased approximation to the attention matrix using positive orthogonal random features, achieving linear time and space complexity without sparsity or low-rank assumptions. The paper supplies explicit proofs of unbiasedness, variance bounds that decrease with the number of features, and uniform convergence results. Experiments on pixel prediction, text, and protein modeling tasks demonstrate that the approximation error remains small enough for end-to-end performance to remain competitive with dense Transformers and other efficient attention baselines, while also enabling large-scale comparisons across different kernelizable attention mechanisms.
Significance. If the central claims hold, the work is significant because it supplies a theoretically grounded route to linear-complexity attention that does not rely on structural priors. The explicit unbiasedness proofs, variance bounds, and uniform convergence results (without sparsity or low-rank assumptions) constitute a clear strength, as does the empirical demonstration that the approximation preserves downstream performance across diverse modalities. The ability to evaluate multiple attention kernels at scale is a useful byproduct for the broader field.
major comments (2)
- [§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.
- [Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.
minor comments (3)
- [§3.1] The notation for the positive orthogonal random features (Definition 1) mixes matrix and vector indexing; a single consistent notation would improve readability.
- [Figure 3] Figure 3 (attention matrix visualizations) would benefit from an additional panel showing the absolute error |A - Â| for a representative long sequence to make the uniform-convergence claim visually concrete.
- [Abstract] The abstract claims 'provable accuracy' while the body distinguishes unbiased from nearly-unbiased regimes; a short clarifying sentence in the abstract would align the two.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. We address the two major comments point by point below.
read point-by-point responses
-
Referee: [§3.2] §3.2, the FAVOR+ construction: the variance bound is stated to decrease with feature count m, yet the dependence on sequence length n and the maximum attention weight is not made fully explicit in the main theorem; this makes it difficult to predict the m required for a target error on sequences of length 10^4–10^5 without additional calculation.
Authors: We appreciate this observation. While Theorem 3.2 in §3.2 highlights the decrease in variance with m, the full dependence on n and the maximum attention weight is derived explicitly in the proof (Appendix, Theorem 3.3). To address the concern, we will revise the main text in §3.2 to state the complete bound and include a direct reference to the appendix. This will allow readers to compute the required m for sequences of length 10^4–10^5 without additional derivation. revision: yes
-
Referee: [Tables 2 and 4] Table 4 (protein modeling) and Table 2 (text): the reported perplexity or accuracy gaps versus the dense baseline are small, but the tables do not include standard deviations over multiple random seeds or feature initializations; this leaves open whether the observed competitiveness is robust to the stochasticity of the random-feature estimator.
Authors: We agree that standard deviations over multiple seeds would strengthen the presentation of robustness. Due to the substantial computational cost of training large models on these datasets, we performed single runs per configuration. However, the variance of the FAVOR+ estimator is theoretically controlled and decreases with m (see §3.2), and our choice of m ensures the approximation error is negligible compared to the observed gaps. In the revision we will add a discussion of these variance bounds in the experimental section to contextualize the results. For the smaller-scale pixel tasks we will also report standard deviations from additional runs. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper's core contribution is the FAVOR+ estimator for approximating softmax attention via positive orthogonal random features. It supplies explicit unbiasedness proofs, variance bounds decreasing with feature count, and uniform convergence results derived from kernel approximation theory. These are not reductions to fitted parameters, self-citations, or ansatzes smuggled from prior author work; the construction is presented as novel and externally grounded. Experiments confirm the approximation suffices for competitive performance, but the theoretical claims stand independently of the empirical results. This matches the default expectation of a self-contained derivation with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Positive orthogonal random features yield an unbiased or nearly-unbiased estimator of the softmax attention kernel
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.DAlembert.Inevitabilitybilinear_family_forced unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
FAVOR+ can be also used to efficiently model kernelizable attention mechanisms beyond softmax. This representational power is crucial to accurately compare softmax with other kernels for the first time on large-scale tasks
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 39 Pith papers
-
RoFormer: Enhanced Transformer with Rotary Position Embedding
RoFormer introduces rotary position embeddings that encode absolute positions via rotation matrices and relative dependencies in attention, outperforming prior position methods on long text classification tasks.
-
QLAM: A Quantum Long-Attention Memory Approach to Long-Sequence Token Modeling
QLAM extends state-space models with quantum superposition in the hidden state for linear-time long-sequence modeling and reports consistent gains over RNN and transformer baselines on sequential image tasks.
-
Complex-Valued Phase-Coherent Transformer
PCT replaces softmax token competition with a smooth phase-preserving gate on normalized complex similarities, yielding stronger generalization on long-range and phase-sensitive benchmarks than both real and complex T...
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can intrinsically retrieve and reuse pre-encoded evidence chunks via decoder attention queries, unifying retrieval with generation and outperforming external RAG pipelines on QA benchmarks.
-
FLUID: Continuous-Time Hyperconnected Sparse Transformer for Sink-Free Learning
FLUID is a continuous-time transformer using Liquid Attention Networks to model attention as stable ODE solutions that interpolate between discrete SDPA and CT-RNNs, with an explicit sink gate and liquid hyper-connect...
-
Stream-CQSA: Avoiding Out-of-Memory in Attention Computation via Flexible Workload Scheduling
Stream-CQSA uses CQS-based decomposition to stream exact attention computations for billion-token sequences on limited-memory hardware.
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
Elastic Attention Cores for Scalable Vision Transformers
VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...
-
Search Your Block Floating Point Scales!
ScaleSearch optimizes block floating point scales via fine-grained search to cut quantization error by 27% for NVFP4, improving PTQ by up to 15 points on MATH500 for Qwen3-8B and attention PPL by 0.77 on Llama 3.1 70B.
-
Compute Where it Counts: Self Optimizing Language Models
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
-
Where Does Long-Context Supervision Actually Go? Effective-Context Exposure Balancing
EXACT re-allocates training supervision by inverse frequency of long effective-context targets, improving NoLiMa and RULER scores by 5-18 points on Qwen and LLaMA models without degrading standard QA or reasoning.
-
RT-Transformer: The Transformer Block as a Spherical State Estimator
Transformer components arise as the natural solution to precision-weighted directional state estimation on the hypersphere.
-
Practical Wi-Fi-based Motion Recognition Under Variable Traffic Patterns
A sampling-rate-versatile transformer network with dynamic augmentation achieves stable high accuracy for Wi-Fi-based motion and gesture recognition across variable sampling rates and traffic patterns.
-
Training Transformers for KV Cache Compressibility
Training transformers with KV sparsification during continued pretraining produces representations that admit better post-hoc KV cache compression, improving quality under memory budgets for long-context tasks.
-
Training Transformers for KV Cache Compressibility
KV compressibility is a property of learned transformer representations that can be improved by training with KV sparsification, leading to better quality-budget tradeoffs in downstream compression for retrieval, QA, ...
-
Retrieval from Within: An Intrinsic Capability of Attention-Based Models
Attention-based models can retrieve evidence intrinsically by using decoder attention to score and reuse their own pre-encoded chunks, outperforming separate retrieval pipelines on QA benchmarks.
-
Linear-Time Global Visual Modeling without Explicit Attention
Dynamic parameterization of standard layers can replace explicit attention for linear-time global visual modeling.
-
GateMOT: Q-Gated Attention for Dense Object Tracking
GateMOT proposes Q-Gated Attention to enable linear-complexity, spatially aware attention for state-of-the-art dense object tracking on benchmarks like BEE24.
-
Reducing Detail Hallucinations in Long-Context Regulatory Understanding via Targeted Preference Optimization
DetailDPO cuts detail-level hallucination errors in LLMs on long regulatory documents by 42-61% using targeted contrastive pairs on a new 13,000-pair benchmark.
-
HubRouter: A Pluggable Sub-Quadratic Routing Primitive for Hybrid Sequence Models
HubRouter is a sub-quadratic routing primitive using learned hubs that replaces attention layers in hybrid models while delivering competitive perplexity and large throughput gains.
-
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
MICA adapts infini compressive attention to the channel dimension, enabling scalable cross-channel dependencies in Transformers and cutting forecast error by 5.4% on average versus channel-independent baselines.
-
MICA: Multivariate Infini Compressive Attention for Time Series Forecasting
MICA adds linearly scaling compressive cross-channel attention to Transformers, cutting average forecast error by 5.4% and ranking first among multivariate baselines.
-
Attention to Mamba: A Recipe for Cross-Architecture Distillation
A two-stage distillation recipe converts a Pythia-1B Transformer into a Mamba model that preserves performance with perplexity 14.11 versus the teacher's 13.86.
-
YOLOv12: Attention-Centric Real-Time Object Detectors
YOLOv12 is a new attention-based real-time object detector that reports higher accuracy than YOLOv10, YOLOv11, and RT-DETR variants at comparable or better speed and efficiency.
-
DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models
DeepSpeed-Ulysses keeps communication volume constant for sequence-parallel attention when sequence length and device count scale together, delivering 2.5x faster training on 4x longer sequences than prior SOTA.
-
Token Merging: Your ViT But Faster
Token Merging (ToMe) doubles the throughput of large Vision Transformers on images, video, and audio by merging similar tokens with a fast matching algorithm, incurring only 0.2-0.4% accuracy loss.
-
PaLM: Scaling Language Modeling with Pathways
PaLM 540B demonstrates continued scaling benefits by setting new few-shot SOTA results on hundreds of benchmarks and outperforming humans on BIG-bench.
-
Rethinking Random Transformers as Adaptive Sequence Smoothers for Sleep Staging
Randomly initialized Transformers act as adaptive sequence smoothers for sleep staging via a Random Attention Prior Kernel, with gains mainly from inductive bias rather than training.
-
One for All: A Non-Linear Transformer can Enable Cross-Domain Generalization for In-Context Reinforcement Learning
Non-linear transformers enable cross-domain generalization in in-context RL by representing value functions from different domains with shared weights inside a shared RKHS.
-
Kaczmarz Linear Attention
Kaczmarz Linear Attention replaces the empirical coefficient in Gated DeltaNet with a key-norm-normalized step size derived from the online regression objective, yielding lower perplexity and better needle-in-haystack...
-
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
MDN parallelizes stepwise momentum for delta linear attention using geometric reordering and dynamical systems analysis, yielding performance gains over Mamba2 and GDN on 400M and 1.3B models.
-
StreamIndex: Memory-Bounded Compressed Sparse Attention via Streaming Top-k
Chunked streaming top-k enables CSA indexer execution at 1M sequence length with 6.21 GB peak memory and >=0.998 recall on synthetic V4-shaped inputs.
-
Working Memory Constraints Scaffold Learning in Transformers under Data Scarcity
Fixed-width and decay-based attention mechanisms inspired by working memory improve Transformer grammatical accuracy and human alignment under limited training data.
-
Seeing Further and Wider: Joint Spatio-Temporal Enlargement for Micro-Video Popularity Prediction
A new joint spatio-temporal enlargement model for micro-video popularity prediction using frame scoring for long sequences and a topology-aware memory bank for unbounded historical associations.
-
FG$^2$-GDN: Enhancing Long-Context Gated Delta Networks with Doubly Fine-Grained Control
FG²-GDN replaces the scalar beta in the delta update with a channel-wise vector and decouples key/value scaling to improve recall over prior GDN and KDA models.
-
NFTDELTA: Detecting Permission Control Vulnerabilities in NFT Contracts through Multi-View Learning
NFTDELTA detects permission control vulnerabilities in NFT contracts by combining sequence and graph views of function CFGs, reporting 241 confirmed issues across 795 collections with 97.92% average precision.
-
VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
VFA optimizes Flash Attention by pre-computing global max approximations from key blocks and reordering traversal to reduce vector bottlenecks while preserving exact computation.
-
A Geometric Algebra-informed NeRF Framework for Generalizable Wireless Channel Prediction
GAI-NeRF combines geometric algebra attention and an adaptive ray tracing module inside a NeRF model to deliver more accurate and generalizable wireless channel predictions across varied indoor environments.
-
Image Classification via Random Dilated Convolution with Multi-Branch Feature Extraction and Context Excitation
RDCNet reports state-of-the-art accuracy on CIFAR-10, CIFAR-100, SVHN, Imagenette, and Imagewoof by combining random dilated convolutions with multi-branch and attention modules.
Reference graph
Works this paper leans on
-
[1]
Andrew Cotter and Joseph Keshet and Nathan Srebro , title =. CoRR , volume =. 2011 , url =
work page 2011
- [3]
-
[4]
End-to-end multitask learning, from protein language to protein features without alignments , author=. bioRxiv , pages=. 2019 , publisher=
work page 2019
-
[5]
Zhuoran Shen and Mingyuan Zhang and Shuai Yi and Junjie Yan and Haiyu Zhao , title =. CoRR , volume =. 2018 , url =
work page 2018
-
[6]
Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =
Angelos Katharopoulos and Apoorv Vyas and Nikolaos Pappas and Fran. Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention , journal =. 2020 , url =
work page 2020
-
[7]
Li and Madian Khabsa and Han Fang and Hao Ma , title =
Sinong Wang and Belinda Z. Li and Madian Khabsa and Han Fang and Hao Ma , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[8]
Nguyen and Katrin Kirchhoff , title =
Julian Salazar and Davis Liang and Toan Q. Nguyen and Katrin Kirchhoff , title =. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics,
-
[10]
Ali Madani and Bryan McCann and Nikhil Naik and Nitish Shirish Keskar and Namrata Anand and Raphael R. Eguchi and Po. ProGen: Language Modeling for Protein Generation , journal =. 2020 , url =
work page 2020
-
[11]
Ciprian Chelba and Tomas Mikolov and Mike Schuster and Qi Ge and Thorsten Brants and Phillipp Koehn and Tony Robinson , title =
-
[12]
Jun Fu and Jing Liu and Haijie Tian and Yong Li and Yongjun Bao and Zhiwei Fang and Hanqing Lu , title =
-
[13]
Compiling machine learning programs via high-level tracing , author =. 2018 , booktitle =
work page 2018
-
[14]
Deep reinforcement learning with relational inductive biases , booktitle =
Vin. Deep reinforcement learning with relational inductive biases , booktitle =
-
[15]
9th International Conference on Learning Representations,
Yi Tay and Mostafa Dehghani and Samira Abnar and Yikang Shen and Dara Bahri and Philip Pham and Jinfeng Rao and Liu Yang and Sebastian Ruder and Donald Metzler , title =. 9th International Conference on Learning Representations,
-
[16]
Induction of potent neutralizing antibody responses by a designed protein nanoparticle vaccine for respiratory syncytial virus , author=. Cell , volume=. 2019 , publisher=
work page 2019
-
[17]
and Ma, Jerry and Fergus, Rob , year =
Rives, Alexander and Goyal, Siddharth and Meier, Joshua and Guo, Demi and Ott, Myle and Zitnick, C. and Ma, Jerry and Fergus, Rob , year =. bioArxiv , title =
-
[18]
Protein interaction networks revealed by proteome coevolution , author=. Science , volume=. 2019 , publisher=
work page 2019
-
[19]
Proceedings of the National Academy of Sciences , volume=
Inferring interaction partners from protein sequences , author=. Proceedings of the National Academy of Sciences , volume=. 2016 , publisher=
work page 2016
-
[20]
Jesse Vig and Yonatan Belinkov , title =. CoRR , volume =. 2019 , url =
work page 2019
-
[21]
Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =
Jesse Vig and Ali Madani and Lav R. Varshney and Caiming Xiong and Richard Socher and Nazneen Fatema Rajani , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[22]
Proceedings of the National Academy of Sciences , volume=
Identification of direct residue contacts in protein--protein interaction by message passing , author=. Proceedings of the National Academy of Sciences , volume=. 2009 , publisher=
work page 2009
-
[23]
Three-dimensional structures of membrane proteins from genomic sequencing , author=. Cell , volume=. 2012 , publisher=
work page 2012
-
[24]
Nucleic acids research , volume=
UniProt: a worldwide hub of protein knowledge , author=. Nucleic acids research , volume=. 2019 , publisher=
work page 2019
-
[25]
Robust and accurate prediction of residue--residue interactions across protein interfaces using evolutionary information , author=. Elife , volume=. 2014 , publisher=
work page 2014
-
[27]
Oriol Vinyals and Meire Fortunato and Navdeep Jaitly , title =. Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada , pages =
work page 2015
-
[28]
Advances in Neural Information Processing Systems , pages=
Generative models for graph-based protein design , author=. Advances in Neural Information Processing Systems , pages=
-
[29]
Haoneng Luo and Shiliang Zhang and Ming Lei and Lei Xie , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[30]
Ciprian Chelba and Mia Xu Chen and Ankur Bapna and Noam Shazeer , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[31]
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,
Tong Xiao and Yinqiao Li and Jingbo Zhu and Zhengtao Yu and Tongran Liu , title =. Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence,. 2019 , url =
work page 2019
-
[32]
8th International Conference on Learning Representations,
Nikita Kitaev and Lukasz Kaiser and Anselm Levskaya , title =. 8th International Conference on Learning Representations,. 2020 , url =
work page 2020
-
[33]
Aurko Roy and Mohammad Saffar and Ashish Vaswani and David Grangier , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[34]
Proceedings of the 35th International Conference on Machine Learning,
Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Noam Shazeer and Alexander Ku and Dustin Tran , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =
work page 2018
-
[35]
Mia Xu Chen and Orhan Firat and Ankur Bapna and Melvin Johnson and Wolfgang Macherey and George F. Foster and Llion Jones and Mike Schuster and Noam Shazeer and Niki Parmar and Ashish Vaswani and Jakob Uszkoreit and Lukasz Kaiser and Zhifeng Chen and Yonghui Wu and Macduff Hughes , title =. Proceedings of the 56th Annual Meeting of the Association for Com...
work page 2018
-
[36]
Music Transformer: Generating Music with Long-Term Structure , booktitle =
Cheng. Music Transformer: Generating Music with Long-Term Structure , booktitle =. 2019 , url =
work page 2019
-
[37]
7th International Conference on Learning Representations,
Mostafa Dehghani and Stephan Gouws and Oriol Vinyals and Jakob Uszkoreit and Lukasz Kaiser , title =. 7th International Conference on Learning Representations,. 2019 , url =
work page 2019
-
[38]
Advances in Neural Information Processing Systems 30 , pages =
Attention is All you Need , author =. Advances in Neural Information Processing Systems 30 , pages =. 2017 , publisher =
work page 2017
- [39]
-
[41]
Peters and Arman Cohan , title =
Iz Beltagy and Matthew E. Peters and Arman Cohan , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[42]
Rewon Child and Scott Gray and Alec Radford and Ilya Sutskever , title =. CoRR , volume =. 2019 , url =
work page 2019
-
[43]
Haim Avron and Vikas Sindhwani and Jiyan Yang and Michael W. Mahoney , title =. J. Mach. Learn. Res. , volume =. 2016 , url =
work page 2016
-
[44]
Yujin Tang and Duong Nguyen and David Ha , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[45]
Graph Attention Networks , booktitle =
Petar Velickovic and Guillem Cucurull and Arantxa Casanova and Adriana Romero and Pietro Li. Graph Attention Networks , booktitle =. 2018 , url =
work page 2018
-
[46]
Zichao Yang and Diyi Yang and Chris Dyer and Xiaodong He and Alexander J. Smola and Eduard H. Hovy , title =. 2016 , url =
work page 2016
-
[47]
Han Lin and Haoxian Chen and Tianyi Zhang and Cl. Demystifying Orthogonal. CoRR , volume =
-
[48]
Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N
Felix X. Yu and Ananda Theertha Suresh and Krzysztof Marcin Choromanski and Daniel N. Holtmann. Orthogonal Random Features , booktitle =
-
[49]
Krzysztof Marcin Choromanski and Mark Rowland and Adrian Weller , title =. Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA,
work page 2017
-
[50]
6th International Conference on Learning Representations,
Krzysztof Choromanski and Carlton Downey and Byron Boots , title =. 6th International Conference on Learning Representations,. 2018 , url =
work page 2018
-
[51]
Mark Rowland and Jiri Hron and Yunhao Tang and Krzysztof Choromanski and Tam. Orthogonal Estimation of. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =
work page 2019
-
[52]
Turner and Adrian Weller , title =
Krzysztof Choromanski and Mark Rowland and Vikas Sindhwani and Richard E. Turner and Adrian Weller , title =. Proceedings of the 35th International Conference on Machine Learning,. 2018 , url =
work page 2018
-
[53]
A sparse Johnson: Lindenstrauss transform , booktitle =. 2010 , url =. doi:10.1145/1806689.1806737 , timestamp =
-
[54]
Nir Ailon and Edo Liberty , title =. 2013 , url =. doi:10.1145/2483699.2483701 , timestamp =
-
[55]
Pure and Applied Chemistry , volume=
Nomenclature and symbolism for amino acids and peptides , author=. Pure and Applied Chemistry , volume=
- [56]
-
[57]
Thomas H. Cormen and Charles E. Leiserson and Ronald L. Rivest and Clifford Stein , title =. 2009 , url =
work page 2009
-
[58]
Yao. Transformer Dissection: An Unified Understanding for Transformer's Attention via the Lens of Kernel , booktitle =. 2019 , url =
work page 2019
-
[59]
Ali Rahimi and Benjamin Recht , title =. Advances in Neural Information Processing Systems 20, Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver, British Columbia, Canada, December 3-6, 2007 , pages =. 2007 , url =
work page 2007
-
[60]
Bouhari Arouna , title =. Monte Carlo Meth. and Appl. , volume =. 2004 , url =
work page 2004
-
[61]
Control variate selection for Monte Carlo integration , author=. 2019 , eprint=
work page 2019
-
[62]
Monte Carlo integration with a growing number of control variates , journal =
Fran. Monte Carlo integration with a growing number of control variates , journal =. 2019 , url =
work page 2019
-
[63]
Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =
Yukun Zhu and Ryan Kiros and Richard S. Zemel and Ruslan Salakhutdinov and Raquel Urtasun and Antonio Torralba and Sanja Fidler , title =. 2015. 2015 , url =
work page 2015
-
[64]
Uniform Distribution and Quasi-Monte Carlo Methods - Discrepancy, Integration and Applications , series =. 2014 , url =. doi:10.1515/9783110317930 , isbn =
-
[65]
Josef Dick and Frances Y. Kuo and Ian H. Sloan , title =. Acta Numer. , volume =. 2013 , url =
work page 2013
-
[66]
Josef Dick and Michael Feischl , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[67]
Sliced and Radon Wasserstein Barycenters of Measures , journal =
Nicolas Bonneel and Julien Rabin and Gabriel Peyr. Sliced and Radon Wasserstein Barycenters of Measures , journal =. 2015 , url =
work page 2015
-
[68]
The 22nd International Conference on Artificial Intelligence and Statistics,
Krzysztof Choromanski and Aldo Pacchiano and Jeffrey Pennington and Yunhao Tang , title =. The 22nd International Conference on Artificial Intelligence and Statistics,. 2019 , url =
work page 2019
-
[69]
Proceedings of the 36th International Conference on Machine Learning,
Krzysztof Choromanski and Mark Rowland and Wenyu Chen and Adrian Weller , title =. Proceedings of the 36th International Conference on Machine Learning,. 2019 , url =
work page 2019
-
[70]
Geometrically Coupled Monte Carlo Sampling , booktitle =
Mark Rowland and Krzysztof Choromanski and Fran. Geometrically Coupled Monte Carlo Sampling , booktitle =. 2018 , url =
work page 2018
-
[71]
Extensions of Lipschitz maps into a Hilbert space , author=
-
[72]
Matousek, Jir\'. Random Struct. Algorithms , keywords =. doi:http://dx.doi.org/10.1002/rsa.20218 , interhash =
-
[73]
Wenbo Gao and Laura Graesser and Krzysztof Choromanski and Xingyou Song and Nevena Lazic and Pannag Sanketi and Vikas Sindhwani and Navdeep Jaitly , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[74]
Xingyou Song and Yuxiang Yang and Krzysztof Choromanski and Ken Caluwaerts and Wenbo Gao and Chelsea Finn and Jie Tan , title =. CoRR , volume =. 2020 , url =
work page 2020
-
[75]
Jiqing Wu and Zhiwu Huang and Dinesh Acharya and Wen Li and Janine Thoma and Danda Pani Paudel and Luc Van Gool , title =. 2019 , url =
work page 2019
-
[76]
The Geometry of Random Features , booktitle =
Krzysztof Choromanski and Mark Rowland and Tam. The Geometry of Random Features , booktitle =. 2018 , url =
work page 2018
-
[77]
Proceedings of the 33nd International Conference on Machine Learning,
Krzysztof Choromanski and Vikas Sindhwani , title =. Proceedings of the 33nd International Conference on Machine Learning,. 2016 , url =
work page 2016
-
[78]
On the exponential inequalities for negatively dependent random variables , volume =
Sung, Soo , year =. On the exponential inequalities for negatively dependent random variables , volume =. Journal of Mathematical Analysis and Applications - J MATH ANAL APPL , doi =
-
[79]
Towards a theory of negative dependence , author=
-
[80]
Monte Carlo with Determinantal Point Processes , author=. 2016 , eprint=
work page 2016
-
[81]
Foundations and Trends in Machine Learning , volume =
Alex Kulesza and Ben Taskar , title =. Foundations and Trends in Machine Learning , volume =. 2012 , url =
work page 2012
-
[82]
On two ways to use determinantal point processes for Monte Carlo integration , booktitle =
Guillaume Gautier and R. On two ways to use determinantal point processes for Monte Carlo integration , booktitle =. 2019 , url =
work page 2019
-
[83]
Krzysztof Choromanski and Aldo Pacchiano and Jack Parker. Structured Monte Carlo Sampling for Nonisotropic Distributions via Determinantal Point Processes , journal =. 2019 , url =
work page 2019
-
[84]
Wasserstein Generative Adversarial Networks , booktitle =
Mart. Wasserstein Generative Adversarial Networks , booktitle =. 2017 , url =
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.