pith. sign in

arxiv: 2510.04008 · v5 · submitted 2025-10-05 · 💻 cs.LG · cs.AI

RACE Attention: A Strictly Linear-Time Attention Layer for Training on Outrageously Large Contexts

Pith reviewed 2026-05-18 10:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords linear attentionlong context traininglocality sensitive hashingrandom projectionstransformer efficiencysequence modeling
0
0 comments X

The pith

RACE Attention approximates softmax attention using sharpened angular similarity and soft LSH to achieve strictly linear time in sequence length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RACE Attention as an alternative to standard softmax attention in transformers. Standard attention scales quadratically with sequence length, limiting its use on very long contexts. RACE Attention replaces the exponential kernel with a sharpened angular similarity and uses Gaussian random projections along with soft locality-sensitive hashing to approximate the attention output without ever building the full matrix. This results in linear scaling in both sequence length and embedding dimension. Experiments show it matches or exceeds baseline performance on language modeling and classification tasks up to 64K tokens while enabling processing of millions of tokens on existing hardware.

Core claim

RACE Attention is a kernel-inspired method that approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH) after replacing the exponential kernel with a sharpened angular similarity, yielding a strictly linear-time attention layer that avoids construction of the full attention matrix.

What carries the argument

Repeated Arrays-of-Count Estimators (RACE) that combine Gaussian random projections with soft LSH to estimate attention scores using sharpened angular similarity.

If this is right

  • RACE Attention matches or outperforms strong baselines on language modeling, masked language modeling, and text/image classification up to 64K sequence length.
  • It reduces wall-clock time and memory usage compared to quadratic attention.
  • It enables a single forward-backward pass on up to 12 million tokens on an NVIDIA GH200 GPU.
  • It enables processing up to 75 million tokens on an Intel Xeon Gold 5220R CPU in a single pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This approach could allow training large language models on contexts far beyond current limits without specialized hardware.
  • Integration with other linear attention methods might further improve efficiency or accuracy.
  • Testing on multimodal tasks with even longer sequences could reveal additional benefits or limitations of the approximation.

Load-bearing premise

The sharpened angular similarity combined with Gaussian random projections and soft LSH produces attention outputs of sufficient quality to match or exceed softmax attention on downstream tasks without harmful approximation artifacts.

What would settle it

Demonstrating that on a controlled long-context task with sequence lengths exceeding 1 million tokens, the downstream performance of RACE Attention falls below that of softmax attention due to accumulated approximation errors.

Figures

Figures reproduced from arXiv: 2510.04008 by Agniva Chowdhury, Amar Kanakamedala, Anshumali Shrivastava, Ekam Singh, Evan Tu, Sahil Joshi.

Figure 1
Figure 1. Figure 1: This figure demonstrates the difference between the linear complexity of RACE Attention and the quadratic complexity of Softmax Attention mechanism. Specifically, we highlight how the final representation o5 is computed under Softmax versus RACE. In Softmax, the entire fifth column of the attention score matrix is required. In contrast, RACE does not require the full matrix; instead, it aggregates statisti… view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of Softmax and Angular kernels at different sharpening levels γ. As γ (or non-linearity) increases, Angular transitions from flat similarity scores to a sharper distribution, recovering behavior similar to the exponential in the Softmax. 3.2 The Final Algorithm At a high level, RACE does not approximate the N ×N score matrix (which would remain quadratic). Instead, it sketches the sufficient sta… view at source ↗
Figure 3
Figure 3. Figure 3: A rigorous scaling stress-test across hardware. The top row shows GPU scaling results; the bottom row shows CPU scaling results. We run a single forward-backward pass configured with 1 batch, 4 heads, and an embedding dimension of 128. Linformer and Performer use the same low-rank/feature dimension as in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A rigorous scaling stress-test (including FlashAttention) across hardware. Plots (a)–(b) use loga￾rithmic axes. RACE is evaluated with (P=2, L=2, M=1) throughout; Linformer and Performer use the same low-rank/feature dimension as in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A rigorous scaling test for algo￾rithmic comparison between FlashAttention on GPU vs. RACE Attention on CPU While GPUs are obviously significantly faster than CPUs for the same algorithm, if we compare FlashAttention on GPU vs. RACE Attention on CPU, we observe the real power of algorithmic acceleration [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: RACE Attention pipeline from the inputs Q, K, V ∈ RN×d to the output O ∈ RN×d : queries/keys are soft-hashed into R buckets across L tables and M ensembles, keys/values form per-bucket summaries, and each query mixes the matched summaries to produce O. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: An intuitive schematic of how RACE Attention runs with L hash tables and R buckets per table. Similarity between Queries and Keys is highest if they both hash to same buckets across all hash tables. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
read the original abstract

Softmax Attention has a quadratic time complexity in sequence length, which becomes prohibitive to run at long contexts, even with highly optimized GPU kernels. For example, FlashAttention-2/3 (exact, GPU-optimized implementations of Softmax Attention) cannot complete a single forward-backward pass of a single attention layer once the context exceeds ~4 million tokens on an NVIDIA GH200 (96 GB). We introduce Repeated Arrays-of-Count Estimators (RACE) Attention, a kernel-inspired alternative to Softmax Attention that is strictly linear in sequence length and embedding size. RACE Attention replaces the exponential kernel with a sharpened angular similarity, and approximates attention outputs via Gaussian random projections and soft Locality-Sensitive Hashing (LSH), avoiding construction of the full attention matrix. Across language modeling, masked language modeling, and text/image classification, RACE Attention matches or outperforms strong baselines up to 64K seqeuence length while reducing wall-clock time and memory usage. In addition, we conduct a controlled scaling study on a single attention layer and demonstrate processing of up to 12 million tokens on an NVIDIA GH200 GPU and 75 million tokens on an Intel Xeon Gold 5220R CPU in a single forward-backward pass, which is well beyond the capabilities of current state-of-the-art attention implementations. RACE Attention thus offers a practical and theoretically grounded mechanism for long-context training on today's hardware. We release our code at https://github.com/sahiljoshi515/RACE_Attention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript presents RACE Attention, a new attention layer that achieves strictly linear time complexity in sequence length by replacing the softmax kernel with a sharpened angular similarity and approximating the attention computation using Gaussian random projections combined with soft Locality-Sensitive Hashing (LSH) in a Repeated Arrays-of-Count Estimators (RACE) framework. The authors report that this approach matches or exceeds the performance of standard attention on language modeling, masked LM, and classification tasks for sequences up to 64K tokens, while enabling single-layer forward-backward passes on up to 12M tokens on GPU and 75M on CPU.

Significance. Should the approximation errors prove to be well-controlled and not to degrade representation quality or training stability at extreme lengths, the result would be of high significance. It directly addresses the quadratic scaling barrier in transformer attention, potentially allowing models to train on contexts that are currently infeasible. The empirical scaling demonstrations and code release strengthen the practical contribution.

major comments (3)
  1. §3 (RACE Attention Mechanism): The description of the RACE estimator via Gaussian projections and soft LSH lacks any derivation or bound on the approximation error, bias, or variance as a function of sequence length L and the number of projections/hash functions. Without this, it is difficult to assess whether the estimator remains faithful to softmax attention at 10M+ token scales, which is load-bearing for the central claim of unaffected downstream training dynamics.
  2. §4 (Experiments): The controlled scaling study is limited to a single attention layer. It is unclear if approximation artifacts compound across the multiple layers of a full transformer stack, potentially affecting the validity of extrapolating these results to complete model training on outrageously large contexts.
  3. Abstract and §3: The method is described as 'theoretically grounded,' yet the text provides no formal analysis or proof of properties such as unbiasedness or convergence rates for the combined sharpened angular similarity and soft LSH estimator.
minor comments (3)
  1. Abstract: Typo: 'seqeuence length' should read 'sequence length'.
  2. Throughout: The number of free parameters (e.g., number of Gaussian projections and hash functions) should be explicitly listed and their sensitivity analyzed in the main text or appendix.
  3. Related Work: A more detailed comparison to prior LSH-based methods such as Reformer would help clarify the novelty of the sharpened angular similarity and repeated arrays-of-counts approach.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the constructive feedback on our work. We address the major comments in detail below and outline the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: §3 (RACE Attention Mechanism): The description of the RACE estimator via Gaussian projections and soft LSH lacks any derivation or bound on the approximation error, bias, or variance as a function of sequence length L and the number of projections/hash functions. Without this, it is difficult to assess whether the estimator remains faithful to softmax attention at 10M+ token scales, which is load-bearing for the central claim of unaffected downstream training dynamics.

    Authors: We thank the referee for highlighting this important aspect. While the RACE framework leverages well-studied Gaussian projections for estimating angular similarities and soft LSH for efficient bucketing, our manuscript indeed prioritizes the algorithmic presentation and empirical results over a full theoretical error analysis. We agree that providing bounds on bias and variance would better support the claims at extreme scales. In the revision, we will add a theoretical analysis subsection in §3 that derives the expected error using properties of random projections and discusses variance reduction with increasing hash functions. We will also include empirical plots of approximation error versus sequence length to demonstrate control at large L. revision: yes

  2. Referee: §4 (Experiments): The controlled scaling study is limited to a single attention layer. It is unclear if approximation artifacts compound across the multiple layers of a full transformer stack, potentially affecting the validity of extrapolating these results to complete model training on outrageously large contexts.

    Authors: The referee is correct that our scaling experiments isolate the attention mechanism in a single layer to clearly demonstrate the linear scaling and extreme context capabilities. We acknowledge the potential for error accumulation in deeper models. To address this, we will include additional experiments in the revised manuscript using a multi-layer transformer (e.g., 4-6 layers) on long-context tasks, showing that training dynamics remain stable and performance is maintained. We will also add a discussion on why the per-layer approximation errors do not compound destructively based on our observations. revision: yes

  3. Referee: Abstract and §3: The method is described as 'theoretically grounded,' yet the text provides no formal analysis or proof of properties such as unbiasedness or convergence rates for the combined sharpened angular similarity and soft LSH estimator.

    Authors: We used 'theoretically grounded' to indicate that the approach is built on established theoretical tools from random matrix theory and locality-sensitive hashing, which provide guarantees for individual components. However, we accept that this phrasing may imply more formal analysis than is currently present for the integrated estimator. In the revision, we will temper the language in the abstract and §3, replacing it with 'inspired by theoretical results on kernel approximation' or similar, and add references to supporting theory. We do not claim new proofs of unbiasedness or specific convergence rates in this work, as the focus is on the practical linear-time attention implementation. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in RACE Attention derivation

full rationale

The paper presents RACE Attention as a direct construction: it replaces the exponential kernel with sharpened angular similarity and approximates outputs using Gaussian random projections plus soft LSH, avoiding the full matrix. This is a new estimator built from standard random-projection and LSH primitives rather than any fitted parameter renamed as a prediction, self-definitional equation, or load-bearing self-citation chain. The scaling claims rest on the linear-time properties of the chosen approximations, which are independent of the target result and externally verifiable via the released code. No step reduces the claimed linear-time behavior or performance equivalence to an input by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on the assumption that random projections and soft LSH can faithfully approximate the desired attention distribution; no explicit free parameters or invented entities are named in the abstract, but the number of projections and hash functions are typical tunable elements in such methods.

free parameters (1)
  • number of Gaussian projections and hash functions
    Standard hyperparameter in random projection and LSH methods; likely selected to trade off approximation quality against speed and memory.
axioms (1)
  • domain assumption Sharpened angular similarity preserves the essential ranking and weighting properties needed for effective attention.
    Invoked when replacing the exponential kernel; the abstract treats this substitution as valid without further proof.

pith-pipeline@v0.9.0 · 5825 in / 1426 out tokens · 35570 ms · 2026-05-18T10:02:56.809359+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Elastic Attention Cores for Scalable Vision Transformers

    cs.CV 2026-05 unverdicted novelty 6.0

    VECA learns effective visual representations using core-periphery attention where patches interact exclusively via a resolution-invariant set of learned core embeddings, achieving linear O(N) complexity while maintain...

  2. SOCKET: SOft Collision Kernel EsTimator for Sparse Attention

    cs.LG 2026-02 unverdicted novelty 5.0

    SOCKET replaces hard LSH bucket matches with soft probabilistic collision aggregation to enable efficient, high-quality token selection for sparse attention, matching or exceeding prior methods with up to 1.5x through...

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · cited by 2 Pith papers · 8 internal anchors

  1. [1]

    On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

    A. Backurs, P. Indyk, and L. Schmidt. On the fine-grained complexity of empirical risk mini- mization: Kernel methods and neural networks.arXiv preprint arXiv:1704.02958, 2017

  2. [2]

    Neural Machine Translation by Jointly Learning to Align and Translate

    D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

  3. [3]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  4. [4]

    M. S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the 34th Annual ACM Symposium on Theory of Computing (STOC’02), pages 380–388. ACM, 2002. doi: 10.1145/509907.509965

  5. [5]

    The Role of Downflows in Establishing Solar Near-Surface Shear

    B. Chen and A. Shrivastava. Densified winner take all (wta) hashing for sparse datasets. InProceedings of the 34th Conference on Uncertainty in Artificial Intelligence (UAI), 2018. arXiv:1810.00115

  6. [6]

    Choromanski, V

    K. Choromanski, V. Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlós, P. Hawkins, J. Davis, A. Mohiuddin, Ł. Kaiser, D. Belanger, L. Colwell, and A. Weller. Rethinking attention with performers. InInternational Conference on Learning Representations, 2021

  7. [7]

    K. M. Choromanski, M. Rowland, and A. Weller. The unreasonable effectiveness of structured random orthogonal embeddings. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. V. N. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems (NeurIPS) 30. Curran Associates, Inc., 2017. 12

  8. [8]

    Coleman and A

    B. Coleman and A. Shrivastava. Sub-linear RACE sketches for approximate kernel density es- timation on streaming data. InWWW ’20: The Web Conference 2020, Taipei, Taiwan, April 20–24, 2020, pages 1739–1749. ACM / IW3C2, 2020. doi: 10.1145/3366423.3380244

  9. [9]

    Coleman, R

    B. Coleman, R. Baraniuk, and A. Shrivastava. Sub-linear memory sketches for near neighbor search on streaming data. In H. Daumé III and A. Singh, editors,Proceedings of the 37th In- ternational Conference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 2089–2099. PMLR, 13–18 Jul 2020

  10. [10]

    Flashattention: Fastandmemory-efficientexact attention with io-awareness

    T.Dao, D.Y.Fu, S.Ermon, A.Rudra, andC.Ré. Flashattention: Fastandmemory-efficientexact attention with io-awareness. InAdvances in Neural Information Processing Systems (NeurIPS), 2022

  11. [11]

    Dehghani, S

    M. Dehghani, S. Gouws, O. Vinyals, J. Uszkoreit, and L. Kaiser. Universal transformers. InPro- ceedings of the 7th International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 2019. OpenReview.net

  12. [12]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations (ICLR), 2021

  13. [13]

    TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

    R. Eldan and Y. Li. Tinystories: How small can language models be and still speak coherent english?arXiv preprint arXiv:2305.07759, 2023

  14. [14]

    I. Han, R. Jayaram, A. Karbasi, V. Mirrokni, D. P. Woodruff, and A. Zandieh. Hyperattention: Long-context attention in near-linear time.arXiv preprint arXiv:2310.05869, 2023

  15. [15]

    Katharopoulos, A

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregres- sive transformers with linear attention. InProceedings of the 37th International Conference on Machine Learning. PMLR, 2020

  16. [16]

    Kitaev, Ł

    N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. InInternational Conference on Learning Representations, 2020

  17. [17]

    Krizhevsky

    A. Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  18. [18]

    Luo and A

    C. Luo and A. Shrivastava. Arrays of (locality-sensitive) count estimators (ace): Anomaly de- tection on the edge. InProceedings of the 2018 World Wide Web Conference, WWW ’18, pages 1439–1448, New York, NY, USA, 2018. ACM. doi: 10.1145/3178876.3186056

  19. [19]

    H. Luo, S. Zhang, M. Lei, and L. Xie. Simplified self-attention for transformer-based end-to-end speech recognition.CoRR, abs/2005.10463, 2020

  20. [20]

    A. L. Maas, R. E. Daly, P. T. Pham, D. Huang, A. Y. Ng, and C. Potts. Learning word vec- tors for sentiment analysis. InProceedings of the 49th Annual Meeting of the Association for Computational Linguistics (ACL), pages 142–150, 2011

  21. [21]

    M. P. Marcus, B. Santorini, and M. A. Marcinkiewicz. Building a large annotated corpus of english: The penn treebank.Computational Linguistics, 19(2):313–330, 1993

  22. [22]

    Regularizing and Optimizing LSTM Language Models

    S. Merity, N. S. Keskar, and R. Socher. Regularizing and optimizing lstm language models.arXiv preprint arXiv:1708.02182, 2017

  23. [23]

    Parmar, A

    N. Parmar, A. Vaswani, J. Uszkoreit, L. Kaiser, N. Shazeer, A. Ku, and D. Tran. Image trans- former. InProceedings of the 35th International Conference on Machine Learning (ICML), vol- 13 ume 80 ofProceedings of Machine Learning Research, pages 4052–4061, Stockholm, Sweden, 2018. PMLR

  24. [24]

    H. Peng, N. Pappas, D. Yogatama, R. Schwartz, N. A. Smith, and L. Kong. Random feature attention. InInternational Conference on Learning Representations (ICLR), 2021

  25. [25]

    Z. Qin, W. Sun, H. Deng, D. Li, Y. Wei, B. Lv, J. Yan, L. Kong, and Y. Zhong. cosFormer: Rethinking softmax in attention.arXiv preprint arXiv:2202.08791, 2022

  26. [26]

    Rahimi and B

    A. Rahimi and B. Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems 20 (NeurIPS 2007), pages 1177–1184, Vancouver, British Columbia, Canada, 2007. Curran Associates, Inc

  27. [27]

    Rajpurkar, J

    P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2383–2392, 2016

  28. [28]

    Socher, A

    R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Ng, and C. Potts. Recursive deep models for semantic compositionality over a sentiment treebank. InProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1631–1642, 2013

  29. [29]

    Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D.Metzler. Longrangearena: Abenchmarkforefficienttransformers. InInternational Conference on Learning Representations (ICLR), 2021

  30. [30]

    J. A. Tropp. An introduction to matrix concentration inequalities.Foundations and Trends®in Machine Learning, 8(1-2):1–230, 2015

  31. [31]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polo- sukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, vol- ume 30, 2017

  32. [32]

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  33. [33]

    H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms.arXiv preprint arXiv:1708.07747, 2017

  34. [34]

    Xiong, Z

    Y. Xiong, Z. Zeng, R. Chakraborty, M. Tan, G. Fung, Y. Li, and V. Singh. Nyströmformer: A nyström-based algorithm for approximating self-attention.arXiv preprint arXiv:2102.03902, 2021

  35. [35]

    Yagnik, D

    J. Yagnik, D. Strelow, D. A. Ross, and R. Lin. The power of comparative reasoning. In2011 International Conference on Computer Vision (ICCV), pages 2431–2438. IEEE, 2011. doi: 10. 1109/ICCV.2011.6126540

  36. [36]

    Zaheer, G

    M. Zaheer, G. Guruganesh, A. Dubey, J. Ainslie, C. Alberti, S. Ontañón, P. Pham, A. Ravula, Q. Wang, L. Yang, and A. Ahmed. Big bird: Transformers for longer sequences. InAdvances in Neural Information Processing Systems, volume 33, 2020

  37. [37]

    Zhang, J

    X. Zhang, J. Zhao, and Y. LeCun. Character-level convolutional networks for text classification. InAdvances in Neural Information Processing Systems, volume 28, 2015. 14 Appendix Table 13:Experiment Setup and Hyperparameters Dataset Task Hyperparameters PTB Language ModelingN=128; layers=1; heads=2;d=128; batch=16; lr=6e −4;β’s=(0.9, 0.999);ϵ=1e−8; wd=0.1...

  38. [38]

    ˆS(ℓ)= Φ (ℓ) Q ( Φ (ℓ) K )⊤

  39. [39]

    Each row ofΦ (ℓ) Q andΦ (ℓ) K is a probability vector; hence∥Φ(ℓ) Q∥F,∥Φ(ℓ) K∥F≤ √ N

  40. [40]

    16 Proof.Eachϕ (ℓ)(x)is a softmax overR= 2 P corners, so entries are nonnegative and sum to1

    Consequently∥ˆS(ℓ)∥F≤N. 16 Proof.Eachϕ (ℓ)(x)is a softmax overR= 2 P corners, so entries are nonnegative and sum to1. Item (1) is by definition of ˆS(ℓ) ij . For (2), every rowpsatisfies∥p∥2≤∥p∥1 = 1, hence∥Φ(ℓ) Q∥2 F = ∑ i∥ϕ(ℓ)(Qi)∥2 2≤N(and similarly forΦ(ℓ) K ). Item (3) follows from∥AB∥F≤∥A∥F∥B∥F. Having controlled the feature-induced matrix norms, we...

  41. [41]

    Proof.(1)By definition,X (ℓ)= ˆS(ℓ)−E[ˆS(ℓ)], henceE[X (ℓ)] =E[ ˆS(ℓ)]−E[ˆS(ℓ)] = 0

    With v:= max     L∑ ℓ=1 E [( 1 LX (ℓ) )( 1 LX (ℓ) )⊤] 2 ,  L∑ ℓ=1 E [( 1 LX (ℓ) )⊤( 1 LX (ℓ) )] 2   , we havev≤4N2/L. Proof.(1)By definition,X (ℓ)= ˆS(ℓ)−E[ˆS(ℓ)], henceE[X (ℓ)] =E[ ˆS(ℓ)]−E[ˆS(ℓ)] = 0. (2)By Lemma 3(3) we have∥ˆS(ℓ)∥2≤∥ˆS(ℓ)∥F≤N. By convexity of the spectral norm, ∥E[ˆS(ℓ)]∥2≤E [ ∥ˆS(ℓ)∥2 ] ≤N. Therefore, by the...

  42. [42]

    21 Proof.(1) Row-sum control.Since ˆS=S+ ∆and ˆD= diag( ˆS1), we rewrite E:= ˆD−D= diag ( ( ˆS−S)1 ) = diag(∆1)

    If∥E∥2≤smin/2, then∥ˆD−1∥2≤2/smin. 21 Proof.(1) Row-sum control.Since ˆS=S+ ∆and ˆD= diag( ˆS1), we rewrite E:= ˆD−D= diag ( ( ˆS−S)1 ) = diag(∆1). Hence each diagonal entry isEii = (∆1) i = ∑N j=1 ∆ ij, so ∥E∥2 = max i |Eii|= max i ⏐⏐(∆1) i ⏐⏐ ≤max i N∑ j=1 |∆ij|=∥∆∥∞. For the second inequality, by Cauchy–Schwarz on each rowi, N∑ j=1 |∆ij| ≤ √ N ( N∑ j=1...