pith. sign in

arxiv: 2411.12502 · v4 · submitted 2024-11-19 · 💻 cs.LG · cs.AI· stat.ML

Transformer Neural Processes - Kernel Regression

Pith reviewed 2026-05-23 08:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords neural processeskernel regressiontransformer attentionscan attentiondeep kernel attentionscalable modelsmeta-learning
0
0 comments X

The pith

TNP-KR scales Neural Processes to 100K context points using a kernel regression block and new attention mechanisms while achieving state-of-the-art results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops Transformer Neural Process - Kernel Regression (TNP-KR) to address the quadratic complexity bottleneck in modern Neural Processes. It introduces a Kernel Regression Block with reduced complexity, a kernel-based attention bias, scan attention for memory efficiency and invariance, and deep kernel attention for linear complexity. These allow fast inference on very large datasets. The variants outperform or match top methods on benchmarks in meta regression, Bayesian optimization, image completion, and epidemiology. If correct, this makes accurate stochastic process modeling practical for big data applications.

Core claim

The central claim is that TNP-KR with its KRBlock, kernel bias, SA, and DKA maintains the posterior predictive modeling of NPs but reduces complexity to O(n_c^2 + n_c n_t) or O(n_c), enabling inference with 100K context points on over 1M test points in under a minute, and delivering superior or SOTA performance on the benchmarks.

What carries the argument

The Kernel Regression Block (KRBlock) is a transformer block with O(n_c^2 + n_c n_t) complexity that incorporates a kernel-based attention bias; paired with scan attention (SA) or deep kernel attention (DKA).

If this is right

  • TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark.
  • TNP-KR with SA achieves state-of-the-art results on meta regression, Bayesian optimization, image completion, and epidemiology benchmarks.
  • Both variants can perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU.
  • The enhancements preserve the modeling fidelity of standard Neural Processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could make Neural Processes viable for real-time large-scale applications where GPs were previously too slow.
  • The translation invariance property of SA might improve performance on spatially structured data.
  • Future work could test these mechanisms on even larger scales or different modalities like time series.

Load-bearing premise

The kernel-based attention bias together with the scan attention and deep kernel attention mechanisms preserve the modeling fidelity of standard Neural Processes while delivering the stated complexity reductions and benchmark improvements.

What would settle it

Running the benchmarks and finding that TNP-KR does not outperform the baselines or cannot process 100K context points in the claimed time on a 24GB GPU would disprove the claims.

Figures

Figures reproduced from arXiv: 2411.12502 by Daniel Jenson, Elizaveta Semenova, Jhonathan Navott, Makkunda Sharma, Mengyan Zhang, Seth Flaxman.

Figure 1
Figure 1. Figure 1: and the pseudocode is presented in Algorithm 1. The time and space complexity for all models is in [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TNP-KR Scan on a SIR 64x64 sample. From the left, the panels are Task, Uncertainty, Prediction, and Ground Truth. For the Task, Prediction, and Ground Truth panels, blue represents susceptible individuals, magenta represents infected individuals, and green represents recovered individuals. Uncertainty is measured with a heatmap ranging from dark purple (low uncertainty) to bright red (high uncertainty). (B… view at source ↗
Figure 3
Figure 3. Figure 3: 1D GP RBF Samples. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: 1D GP Periodic Samples. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: 1D GP Matern 3/2 Samples. ´ 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: An example of one of the highest regrets for TNP-KR: SCAN. Red is proposed minimum and green is actual minimum [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Another example of one of the highest regrets for TNP-KR: SCAN. Red is proposed minimum and green is actual minimum. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: 2D GP RBF samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MNIST samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: CelebA samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: CIFAR-10 samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: SIR 64x64 samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: SIR 128x128 samples for models trained on 64x64 images. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
read the original abstract

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. Originally developed as a scalable alternative to Gaussian Processes (GPs), which are limited by $O(n^3)$ runtime complexity, the most accurate modern NPs can often rival GPs but still suffer from an $O(n^2)$ bottleneck due to their attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a scalable NP featuring: (1) a Kernel Regression Block (KRBlock), a simple, extensible, and parameter efficient transformer block with complexity $O(n_c^2 + n_c n_t)$, where $n_c$ and $n_t$ are the number of context and test points, respectively; (2) a kernel-based attention bias; and (3) two novel attention mechanisms: scan attention (SA), a memory-efficient scan-based attention that when paired with a kernel-based bias can make TNP-KR translation invariant, and deep kernel attention (DKA), a Performer-style attention that implicitly incoporates a distance bias and further reduces complexity to $O(n_c)$. These enhancements enable both TNP-KR variants to perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU. On benchmarks spanning meta regression, Bayesian optimization, image completion, and epidemiology, TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark, while TNP-KR with SA achieves state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Transformer Neural Process - Kernel Regression (TNP-KR), a scalable variant of Neural Processes. It proposes a Kernel Regression Block (KRBlock) with O(n_c² + n_c n_t) complexity, a kernel-based attention bias, scan attention (SA) that achieves translation invariance when paired with the bias, and deep kernel attention (DKA) that reduces complexity to O(n_c) while implicitly incorporating a distance bias. The authors claim that TNP-KR with DKA outperforms a Performer baseline on nearly all benchmarks and that TNP-KR with SA reaches state-of-the-art results across meta-regression, Bayesian optimization, image completion, and epidemiology tasks, while enabling inference on 100K context points and over 1M test points in under a minute on a single 24GB GPU.

Significance. If the modeling fidelity of the original Neural Process posterior is preserved, the work would offer a practical advance in scalable stochastic process modeling by combining transformer efficiency with kernel-inspired inductive biases, potentially enabling larger-scale applications than current O(n²) attention-based NPs while matching or exceeding GP-level accuracy on the reported tasks.

major comments (2)
  1. [Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.
  2. [Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.
minor comments (2)
  1. [Abstract] Typo: 'incoporates' should be 'incorporates'.
  2. [Abstract] The abstract states complexities (O(n_c² + n_c n_t), O(n_c)) but does not reference the corresponding equations or pseudocode that would allow direct verification of the claimed scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments regarding the fidelity of TNP-KR to the Neural Process framework. We address each major comment below with clarifications on the manuscript's claims and scope.

read point-by-point responses
  1. Referee: [Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.

    Authors: The manuscript introduces TNP-KR explicitly as 'a scalable variant of Neural Processes' and 'a scalable NP', with the goal of modeling the posterior predictive distribution of stochastic processes using transformer-based mechanisms that incorporate kernel-inspired inductive biases. It does not claim that the KRBlock, kernel attention bias, scan attention, or deep kernel attention produce exactly the same conditional distribution as any specific prior NP model. The original NP literature itself employs attention-based approximations to the conditional without formal equivalence to GPs. The presented work focuses on achieving the stated complexity reductions while maintaining the NP objective of direct posterior predictive modeling, as evidenced by the architecture and training procedure. We therefore disagree that the claims rest on unverified assumptions of exact distributional equivalence. revision: no

  2. Referee: [Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.

    Authors: The headline claims are substantiated by the experimental results on meta-regression, Bayesian optimization, image completion, and epidemiology benchmarks, where TNP-KR with DKA outperforms the Performer baseline on nearly every task and TNP-KR with SA reaches state-of-the-art performance, all under consistent evaluation settings from prior NP work. These tasks directly assess the quality of the modeled posterior predictive distribution via standard metrics. While the manuscript does not include separate experiments (such as explicit posterior calibration or likelihood comparisons isolating the attention changes), the consistent empirical gains across diverse tasks support that the mechanisms improve modeling effectiveness within the NP paradigm. We can add a clarifying sentence in the discussion section noting the scope of the claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new architecture and mechanisms defined independently of their claimed outputs

full rationale

The paper introduces TNP-KR via three explicitly defined components (KRBlock with stated O(n_c² + n_c n_t) complexity, kernel-based attention bias, SA, and DKA) whose properties and benchmark performance are presented as empirical engineering results rather than derived predictions. No equations, definitions, or central claims reduce by construction to fitted parameters, self-citations, or prior inputs; the posterior fidelity and complexity reductions are asserted as design outcomes verified on external benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable. The paper introduces new architectural blocks but provides insufficient detail for ledger entries.

pith-pipeline@v0.9.0 · 5833 in / 1071 out tokens · 36409 ms · 2026-05-23T08:31:34.190685+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    P., and Turner, R

    Ashman, M., Diaconu, C., Kim, J., Sivaraya, L., Markou, S., Requeima, J., Bruinsma, W. P., and Turner, R. E. Translation equivariant transformer neural processes. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

  3. [3]

    Ashman, M., Diaconu, C., Langezaal, E., Weller, A., and Turner, R. E. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024 b . URL https://arxiv.org/abs/2410.06731

  4. [4]

    Longformer: The Long-Document Transformer

    Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  5. [5]

    Leveraging redundancy in attention with reuse transformers, 2022

    Bhojanapalli, S., Chakrabarti, A., Veit, A., Lukasik, M., Jain, H., Liu, F., Chang, Y.-W., and Kumar, S. Leveraging redundancy in attention with reuse transformers, 2022. URL https://openreview.net/forum?id=V37YFd_fFgN

  6. [6]

    J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

    Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

  7. [7]

    P., Markou, S., Requiema, J., Foong, A

    Bruinsma, W. P., Markou, S., Requiema, J., Foong, A. Y. K., Andersson, T. R., Vaughan, A., Buonomo, A., Hosking, J. S., and Turner, R. E. Autoregressive conditional neural processes, 2023. URL https://arxiv.org/abs/2303.14468

  8. [8]

    Generating Long Sequences with Sparse Transformers

    Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509

  9. [9]

    M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

    Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

  10. [10]

    Flash A ttention-2: Faster attention with better parallelism and work partitioning

    Dao, T. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

  11. [11]

    Y., Ermon, S., Rudra, A., and R \'e , C

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  12. [12]

    Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Efficient queries transformer neural processes. URL https://openreview.net/forum?id=_3FyT_W1DW

  13. [13]

    Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=yIxtevizEA

  14. [14]

    Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Memory efficient neural processes via constant memory attention block, 2024. URL https://openreview.net/forum?id=I0gwsdSgsk

  15. [15]

    W., Rezende, D., and Eslami, S

    Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. M. A. Conditional neural processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1704--1713. PMLR, 10--15 Jul 2018 a ....

  16. [16]

    Neural Processes

    Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. Neural processes, 2018 b . URL https://arxiv.org/abs/1807.01622

  17. [17]

    P., Foong, A

    Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skey4eBYPS

  18. [18]

    Perceiver: General perception with iterative attention

    Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 4651--4664. PMLR, 18--24 Jul 2021. URL https://proceedi...

  19. [19]

    Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkE6PjC9KX

  20. [20]

    Reformer: The efficient transformer

    Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

  21. [21]

    Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pp.\ 3744--3753, 2019

  22. [22]

    J., and Teh, Y

    Lee, J., Lee, Y., Kim, J., Yang, E., Hwang, S. J., and Teh, Y. W. Bootstrapping neural processes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6606--6615. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/492114f69...

  23. [23]

    Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9 0 (1): 0 141--142, 1964

  24. [24]

    and Grover, A

    Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022

  25. [25]

    Train short, test long: Attention with linear biases enables input length extrapolation

    Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0

  26. [27]

    Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022 b . URL https://arxiv.org/abs/2112.05682

  27. [28]

    Flashattention-3: Fast and accurate attention with asynchrony and low-precision

    Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20

  28. [29]

    Efficient attention: Attention with linear complexities

    Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 3531--3539, 2021. doi:10.1109/WACV48630.2021.00356. URL https://openaccess.thecvf.com/content/WACV2021/papers/Shen_Efficient_Attention_Attention_With_Lin...

  29. [30]

    Sparse sinkhorn attention, 2020

    Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. Sparse sinkhorn attention, 2020. URL https://arxiv.org/abs/2002.11296

  30. [31]

    N., Kaiser, L

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

  31. [32]

    P., Andersson, T

    Vaughan, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., Herzog, M., Lane, N. D., Chantry, M., Hosking, J. S., and Turner, R. E. Aardvark weather: end-to-end data-driven weather forecasting, 2024. URL https://arxiv.org/abs/2404.00411

  32. [33]

    Linformer: Self-Attention with Linear Complexity

    Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768

  33. [34]

    and Henter, G

    Wennberg, U. and Henter, G. E. The case for translation-invariant self-attention in transformer-based language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Pape...

  34. [35]

    On layer normalization in the transformer architecture, 2020

    Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr

  35. [36]

    Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

    Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a . URL https://openreview.net/forum?id=OeWooOxFwDa

  36. [37]

    Lazyformer: Self attention with lazy update, 2021 b

    Ying, C., Ke, G., He, D., and Liu, T.-Y. Lazyformer: Self attention with lazy update, 2021 b . URL https://arxiv.org/abs/2102.12702

  37. [38]

    Adaptive methods for nonconvex optimization

    Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/fil...

  38. [39]

    A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

    Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020