Transformer Neural Processes - Kernel Regression

Daniel Jenson; Elizaveta Semenova; Jhonathan Navott; Makkunda Sharma; Mengyan Zhang; Seth Flaxman

arxiv: 2411.12502 · v4 · submitted 2024-11-19 · 💻 cs.LG · cs.AI· stat.ML

Transformer Neural Processes - Kernel Regression

Daniel Jenson , Jhonathan Navott , Mengyan Zhang , Makkunda Sharma , Elizaveta Semenova , Seth Flaxman This is my paper

Pith reviewed 2026-05-23 08:31 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords neural processeskernel regressiontransformer attentionscan attentiondeep kernel attentionscalable modelsmeta-learning

0 comments

The pith

TNP-KR scales Neural Processes to 100K context points using a kernel regression block and new attention mechanisms while achieving state-of-the-art results.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops Transformer Neural Process - Kernel Regression (TNP-KR) to address the quadratic complexity bottleneck in modern Neural Processes. It introduces a Kernel Regression Block with reduced complexity, a kernel-based attention bias, scan attention for memory efficiency and invariance, and deep kernel attention for linear complexity. These allow fast inference on very large datasets. The variants outperform or match top methods on benchmarks in meta regression, Bayesian optimization, image completion, and epidemiology. If correct, this makes accurate stochastic process modeling practical for big data applications.

Core claim

The central claim is that TNP-KR with its KRBlock, kernel bias, SA, and DKA maintains the posterior predictive modeling of NPs but reduces complexity to O(n_c^2 + n_c n_t) or O(n_c), enabling inference with 100K context points on over 1M test points in under a minute, and delivering superior or SOTA performance on the benchmarks.

What carries the argument

The Kernel Regression Block (KRBlock) is a transformer block with O(n_c^2 + n_c n_t) complexity that incorporates a kernel-based attention bias; paired with scan attention (SA) or deep kernel attention (DKA).

If this is right

TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark.
TNP-KR with SA achieves state-of-the-art results on meta regression, Bayesian optimization, image completion, and epidemiology benchmarks.
Both variants can perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU.
The enhancements preserve the modeling fidelity of standard Neural Processes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could make Neural Processes viable for real-time large-scale applications where GPs were previously too slow.
The translation invariance property of SA might improve performance on spatially structured data.
Future work could test these mechanisms on even larger scales or different modalities like time series.

Load-bearing premise

The kernel-based attention bias together with the scan attention and deep kernel attention mechanisms preserve the modeling fidelity of standard Neural Processes while delivering the stated complexity reductions and benchmark improvements.

What would settle it

Running the benchmarks and finding that TNP-KR does not outperform the baselines or cannot process 100K context points in the claimed time on a 24GB GPU would disprove the claims.

Figures

Figures reproduced from arXiv: 2411.12502 by Daniel Jenson, Elizaveta Semenova, Jhonathan Navott, Makkunda Sharma, Mengyan Zhang, Seth Flaxman.

**Figure 2.** Figure 2: TNP-KR Scan on a SIR 64x64 sample. From the left, the panels are Task, Uncertainty, Prediction, and Ground Truth. For the Task, Prediction, and Ground Truth panels, blue represents susceptible individuals, magenta represents infected individuals, and green represents recovered individuals. Uncertainty is measured with a heatmap ranging from dark purple (low uncertainty) to bright red (high uncertainty). (B… view at source ↗

**Figure 3.** Figure 3: 1D GP RBF Samples. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

**Figure 4.** Figure 4: 1D GP Periodic Samples. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗

**Figure 5.** Figure 5: 1D GP Matern 3/2 Samples. ´ 18 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: An example of one of the highest regrets for TNP-KR: SCAN. Red is proposed minimum and green is actual minimum [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Another example of one of the highest regrets for TNP-KR: SCAN. Red is proposed minimum and green is actual minimum. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: 2D GP RBF samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 20 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: MNIST samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: CelebA samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 22 [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: CIFAR-10 samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_11.png] view at source ↗

**Figure 12.** Figure 12: SIR 64x64 samples. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗

**Figure 13.** Figure 13: SIR 128x128 samples for models trained on 64x64 images. The panels from left to right are Task, Uncertainty, Prediction, and Ground Truth. Uncertainty is measured with a heatmap ranging from dark blue (low uncertainty) to bright red (high uncertainty). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

read the original abstract

Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. Originally developed as a scalable alternative to Gaussian Processes (GPs), which are limited by $O(n^3)$ runtime complexity, the most accurate modern NPs can often rival GPs but still suffer from an $O(n^2)$ bottleneck due to their attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a scalable NP featuring: (1) a Kernel Regression Block (KRBlock), a simple, extensible, and parameter efficient transformer block with complexity $O(n_c^2 + n_c n_t)$, where $n_c$ and $n_t$ are the number of context and test points, respectively; (2) a kernel-based attention bias; and (3) two novel attention mechanisms: scan attention (SA), a memory-efficient scan-based attention that when paired with a kernel-based bias can make TNP-KR translation invariant, and deep kernel attention (DKA), a Performer-style attention that implicitly incoporates a distance bias and further reduces complexity to $O(n_c)$. These enhancements enable both TNP-KR variants to perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU. On benchmarks spanning meta regression, Bayesian optimization, image completion, and epidemiology, TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark, while TNP-KR with SA achieves state-of-the-art results.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TNP-KR adds a kernel-regression transformer block, kernel bias, scan attention, and deep kernel attention to Neural Processes, delivering claimed complexity cuts and benchmark gains, but the preservation of posterior fidelity remains unshown.

read the letter

The paper's main move is to replace standard attention in Neural Processes with a KRBlock that costs O(n_c² + n_c n_t), add a kernel-based attention bias, and introduce scan attention for translation invariance plus deep kernel attention that reaches O(n_c) while keeping a distance bias. These are presented as concrete additions that let the model run 100k context points against over 1M test points in under a minute on one 24GB GPU. On the listed benchmarks the DKA version beats its Performer counterpart on most tasks and the SA version hits SOTA across meta-regression, Bayesian optimization, image completion, and epidemiology. That scaling number and the breadth of tasks are the parts that would interest people who already use NPs but hit the quadratic wall. The constructions themselves look like genuine extensions beyond the cited prior NP literature. The soft spot is the one flagged in the stress test. The abstract asserts that the new bias and attention variants keep the modeling fidelity of standard NPs while cutting cost, yet no equations or derivation are visible to confirm that the attention outputs remain consistent with the original conditional distribution. If the modifications change the inductive bias in ways that explain the gains, the efficiency story is still useful but the claim of retaining NP advantages needs separate checking. Error bars and full experimental details are also absent from what is shown. This work is aimed at researchers who build scalable probabilistic models for meta-learning and need attention mechanisms that respect kernel structure. A reader already working on Neural Processes or kernel approximations would get direct value from the specific block and attention designs. It deserves peer review because the complexity claims are testable and the empirical scope is wide enough to evaluate even if the theoretical grounding requires more work.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Transformer Neural Process - Kernel Regression (TNP-KR), a scalable variant of Neural Processes. It proposes a Kernel Regression Block (KRBlock) with O(n_c² + n_c n_t) complexity, a kernel-based attention bias, scan attention (SA) that achieves translation invariance when paired with the bias, and deep kernel attention (DKA) that reduces complexity to O(n_c) while implicitly incorporating a distance bias. The authors claim that TNP-KR with DKA outperforms a Performer baseline on nearly all benchmarks and that TNP-KR with SA reaches state-of-the-art results across meta-regression, Bayesian optimization, image completion, and epidemiology tasks, while enabling inference on 100K context points and over 1M test points in under a minute on a single 24GB GPU.

Significance. If the modeling fidelity of the original Neural Process posterior is preserved, the work would offer a practical advance in scalable stochastic process modeling by combining transformer efficiency with kernel-inspired inductive biases, potentially enabling larger-scale applications than current O(n²) attention-based NPs while matching or exceeding GP-level accuracy on the reported tasks.

major comments (2)

[Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.
[Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.

minor comments (2)

[Abstract] Typo: 'incoporates' should be 'incorporates'.
[Abstract] The abstract states complexities (O(n_c² + n_c n_t), O(n_c)) but does not reference the corresponding equations or pseudocode that would allow direct verification of the claimed scaling.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments regarding the fidelity of TNP-KR to the Neural Process framework. We address each major comment below with clarifications on the manuscript's claims and scope.

read point-by-point responses

Referee: [Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.

Authors: The manuscript introduces TNP-KR explicitly as 'a scalable variant of Neural Processes' and 'a scalable NP', with the goal of modeling the posterior predictive distribution of stochastic processes using transformer-based mechanisms that incorporate kernel-inspired inductive biases. It does not claim that the KRBlock, kernel attention bias, scan attention, or deep kernel attention produce exactly the same conditional distribution as any specific prior NP model. The original NP literature itself employs attention-based approximations to the conditional without formal equivalence to GPs. The presented work focuses on achieving the stated complexity reductions while maintaining the NP objective of direct posterior predictive modeling, as evidenced by the architecture and training procedure. We therefore disagree that the claims rest on unverified assumptions of exact distributional equivalence. revision: no
Referee: [Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.

Authors: The headline claims are substantiated by the experimental results on meta-regression, Bayesian optimization, image completion, and epidemiology benchmarks, where TNP-KR with DKA outperforms the Performer baseline on nearly every task and TNP-KR with SA reaches state-of-the-art performance, all under consistent evaluation settings from prior NP work. These tasks directly assess the quality of the modeled posterior predictive distribution via standard metrics. While the manuscript does not include separate experiments (such as explicit posterior calibration or likelihood comparisons isolating the attention changes), the consistent empirical gains across diverse tasks support that the mechanisms improve modeling effectiveness within the NP paradigm. We can add a clarifying sentence in the discussion section noting the scope of the claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; new architecture and mechanisms defined independently of their claimed outputs

full rationale

The paper introduces TNP-KR via three explicitly defined components (KRBlock with stated O(n_c² + n_c n_t) complexity, kernel-based attention bias, SA, and DKA) whose properties and benchmark performance are presented as empirical engineering results rather than derived predictions. No equations, definitions, or central claims reduce by construction to fitted parameters, self-citations, or prior inputs; the posterior fidelity and complexity reductions are asserted as design outcomes verified on external benchmarks, making the derivation self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no free parameters, axioms, or invented entities are identifiable. The paper introduces new architectural blocks but provides insufficient detail for ledger entries.

pith-pipeline@v0.9.0 · 5833 in / 1071 out tokens · 36409 ms · 2026-05-23T08:31:34.190685+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 4 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

P., and Turner, R

Ashman, M., Diaconu, C., Kim, J., Sivaraya, L., Markou, S., Requeima, J., Bruinsma, W. P., and Turner, R. E. Translation equivariant transformer neural processes. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

work page 1924
[3]

Ashman, M., Diaconu, C., Langezaal, E., Weller, A., and Turner, R. E. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024 b . URL https://arxiv.org/abs/2410.06731

work page arXiv 2024
[4]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Leveraging redundancy in attention with reuse transformers, 2022

Bhojanapalli, S., Chakrabarti, A., Veit, A., Lukasik, M., Jain, H., Liu, F., Chang, Y.-W., and Kumar, S. Leveraging redundancy in attention with reuse transformers, 2022. URL https://openreview.net/forum?id=V37YFd_fFgN

work page 2022
[6]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018
[7]

P., Markou, S., Requiema, J., Foong, A

Bruinsma, W. P., Markou, S., Requiema, J., Foong, A. Y. K., Andersson, T. R., Vaughan, A., Buonomo, A., Hosking, J. S., and Turner, R. E. Autoregressive conditional neural processes, 2023. URL https://arxiv.org/abs/2303.14468

work page arXiv 2023
[8]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019
[9]

M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

work page 2021
[10]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Dao, T. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[11]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022
[12]

Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Efficient queries transformer neural processes. URL https://openreview.net/forum?id=_3FyT_W1DW

work page
[13]

Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=yIxtevizEA

work page 2023
[14]

Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Memory efficient neural processes via constant memory attention block, 2024. URL https://openreview.net/forum?id=I0gwsdSgsk

work page 2024
[15]

W., Rezende, D., and Eslami, S

Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. M. A. Conditional neural processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1704--1713. PMLR, 10--15 Jul 2018 a ....

work page 2018
[16]

Neural Processes

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. Neural processes, 2018 b . URL https://arxiv.org/abs/1807.01622

work page internal anchor Pith review Pith/arXiv arXiv 2018
[17]

P., Foong, A

Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skey4eBYPS

work page 2020
[18]

Perceiver: General perception with iterative attention

Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 4651--4664. PMLR, 18--24 Jul 2021. URL https://proceedi...

work page 2021
[19]

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkE6PjC9KX

work page 2019
[20]

Reformer: The efficient transformer

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

work page 2020
[21]

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pp.\ 3744--3753, 2019

work page 2019
[22]

J., and Teh, Y

Lee, J., Lee, Y., Kim, J., Yang, E., Hwang, S. J., and Teh, Y. W. Bootstrapping neural processes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6606--6615. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/492114f69...

work page 2020
[23]

Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9 0 (1): 0 141--142, 1964

work page 1964
[24]

and Grover, A

Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022

work page arXiv 2022
[25]

Train short, test long: Attention with linear biases enables input length extrapolation

Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0

work page 2022
[27]

Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022 b . URL https://arxiv.org/abs/2112.05682

work page arXiv 2022
[28]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20

work page 2024
[29]

Efficient attention: Attention with linear complexities

Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 3531--3539, 2021. doi:10.1109/WACV48630.2021.00356. URL https://openaccess.thecvf.com/content/WACV2021/papers/Shen_Efficient_Attention_Attention_With_Lin...

work page doi:10.1109/wacv48630.2021.00356 2021
[30]

Sparse sinkhorn attention, 2020

Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. Sparse sinkhorn attention, 2020. URL https://arxiv.org/abs/2002.11296

work page arXiv 2020
[31]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

work page 2017
[32]

P., Andersson, T

Vaughan, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., Herzog, M., Lane, N. D., Chantry, M., Hosking, J. S., and Turner, R. E. Aardvark weather: end-to-end data-driven weather forecasting, 2024. URL https://arxiv.org/abs/2404.00411

work page arXiv 2024
[33]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768

work page internal anchor Pith review Pith/arXiv arXiv 2020
[34]

and Henter, G

Wennberg, U. and Henter, G. E. The case for translation-invariant self-attention in transformer-based language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Pape...

work page doi:10.18653/v1/2021.acl-short.18 2021
[35]

On layer normalization in the transformer architecture, 2020

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr

work page 2020
[36]

Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a . URL https://openreview.net/forum?id=OeWooOxFwDa

work page 2021
[37]

Lazyformer: Self attention with lazy update, 2021 b

Ying, C., Ke, G., He, D., and Liu, T.-Y. Lazyformer: Self attention with lazy update, 2021 b . URL https://arxiv.org/abs/2102.12702

work page arXiv 2021
[38]

Adaptive methods for nonconvex optimization

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/fil...

work page arXiv 2018
[39]

A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020

work page 2020

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

P., and Turner, R

Ashman, M., Diaconu, C., Kim, J., Sivaraya, L., Markou, S., Requeima, J., Bruinsma, W. P., and Turner, R. E. Translation equivariant transformer neural processes. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of...

work page 1924

[3] [3]

Ashman, M., Diaconu, C., Langezaal, E., Weller, A., and Turner, R. E. Gridded transformer neural processes for large unstructured spatio-temporal data, 2024 b . URL https://arxiv.org/abs/2410.06731

work page arXiv 2024

[4] [4]

Longformer: The Long-Document Transformer

Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[5] [5]

Leveraging redundancy in attention with reuse transformers, 2022

Bhojanapalli, S., Chakrabarti, A., Veit, A., Lukasik, M., Jain, H., Liu, F., Chang, Y.-W., and Kumar, S. Leveraging redundancy in attention with reuse transformers, 2022. URL https://openreview.net/forum?id=V37YFd_fFgN

work page 2022

[6] [6]

J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q

Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax

work page 2018

[7] [7]

P., Markou, S., Requiema, J., Foong, A

Bruinsma, W. P., Markou, S., Requiema, J., Foong, A. Y. K., Andersson, T. R., Vaughan, A., Buonomo, A., Hosking, J. S., and Turner, R. E. Autoregressive conditional neural processes, 2023. URL https://arxiv.org/abs/2303.14468

work page arXiv 2023

[8] [8]

Generating Long Sequences with Sparse Transformers

Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509

work page internal anchor Pith review Pith/arXiv arXiv 2019

[9] [9]

M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J

Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH

work page 2021

[10] [10]

Flash A ttention-2: Faster attention with better parallelism and work partitioning

Dao, T. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[11] [11]

Y., Ermon, S., Rudra, A., and R \'e , C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022

work page 2022

[12] [12]

Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Efficient queries transformer neural processes. URL https://openreview.net/forum?id=_3FyT_W1DW

work page

[13] [13]

Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=yIxtevizEA

work page 2023

[14] [14]

Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Memory efficient neural processes via constant memory attention block, 2024. URL https://openreview.net/forum?id=I0gwsdSgsk

work page 2024

[15] [15]

W., Rezende, D., and Eslami, S

Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. M. A. Conditional neural processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1704--1713. PMLR, 10--15 Jul 2018 a ....

work page 2018

[16] [16]

Neural Processes

Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. Neural processes, 2018 b . URL https://arxiv.org/abs/1807.01622

work page internal anchor Pith review Pith/arXiv arXiv 2018

[17] [17]

P., Foong, A

Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skey4eBYPS

work page 2020

[18] [18]

Perceiver: General perception with iterative attention

Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 4651--4664. PMLR, 18--24 Jul 2021. URL https://proceedi...

work page 2021

[19] [19]

Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkE6PjC9KX

work page 2019

[20] [20]

Reformer: The efficient transformer

Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB

work page 2020

[21] [21]

Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pp.\ 3744--3753, 2019

work page 2019

[22] [22]

J., and Teh, Y

Lee, J., Lee, Y., Kim, J., Yang, E., Hwang, S. J., and Teh, Y. W. Bootstrapping neural processes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6606--6615. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/492114f69...

work page 2020

[23] [23]

Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9 0 (1): 0 141--142, 1964

work page 1964

[24] [24]

and Grover, A

Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022

work page arXiv 2022

[25] [25]

Train short, test long: Attention with linear biases enables input length extrapolation

Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0

work page 2022

[26] [27]

Rabe, M. N. and Staats, C. Self-attention does not need o(n^2) memory, 2022 b . URL https://arxiv.org/abs/2112.05682

work page arXiv 2022

[27] [28]

Flashattention-3: Fast and accurate attention with asynchrony and low-precision

Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20

work page 2024

[28] [29]

Efficient attention: Attention with linear complexities

Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 3531--3539, 2021. doi:10.1109/WACV48630.2021.00356. URL https://openaccess.thecvf.com/content/WACV2021/papers/Shen_Efficient_Attention_Attention_With_Lin...

work page doi:10.1109/wacv48630.2021.00356 2021

[29] [30]

Sparse sinkhorn attention, 2020

Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. Sparse sinkhorn attention, 2020. URL https://arxiv.org/abs/2002.11296

work page arXiv 2020

[30] [31]

N., Kaiser, L

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...

work page 2017

[31] [32]

P., Andersson, T

Vaughan, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., Herzog, M., Lane, N. D., Chantry, M., Hosking, J. S., and Turner, R. E. Aardvark weather: end-to-end data-driven weather forecasting, 2024. URL https://arxiv.org/abs/2404.00411

work page arXiv 2024

[32] [33]

Linformer: Self-Attention with Linear Complexity

Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768

work page internal anchor Pith review Pith/arXiv arXiv 2020

[33] [34]

and Henter, G

Wennberg, U. and Henter, G. E. The case for translation-invariant self-attention in transformer-based language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Pape...

work page doi:10.18653/v1/2021.acl-short.18 2021

[34] [35]

On layer normalization in the transformer architecture, 2020

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr

work page 2020

[35] [36]

Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a

Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a . URL https://openreview.net/forum?id=OeWooOxFwDa

work page 2021

[36] [37]

Lazyformer: Self attention with lazy update, 2021 b

Ying, C., Ke, G., He, D., and Liu, T.-Y. Lazyformer: Self attention with lazy update, 2021 b . URL https://arxiv.org/abs/2102.12702

work page arXiv 2021

[37] [38]

Adaptive methods for nonconvex optimization

Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/fil...

work page arXiv 2018

[38] [39]

A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al

Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020

work page 2020