Transformer Neural Processes - Kernel Regression
Pith reviewed 2026-05-23 08:31 UTC · model grok-4.3
The pith
TNP-KR scales Neural Processes to 100K context points using a kernel regression block and new attention mechanisms while achieving state-of-the-art results.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that TNP-KR with its KRBlock, kernel bias, SA, and DKA maintains the posterior predictive modeling of NPs but reduces complexity to O(n_c^2 + n_c n_t) or O(n_c), enabling inference with 100K context points on over 1M test points in under a minute, and delivering superior or SOTA performance on the benchmarks.
What carries the argument
The Kernel Regression Block (KRBlock) is a transformer block with O(n_c^2 + n_c n_t) complexity that incorporates a kernel-based attention bias; paired with scan attention (SA) or deep kernel attention (DKA).
If this is right
- TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark.
- TNP-KR with SA achieves state-of-the-art results on meta regression, Bayesian optimization, image completion, and epidemiology benchmarks.
- Both variants can perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU.
- The enhancements preserve the modeling fidelity of standard Neural Processes.
Where Pith is reading between the lines
- This could make Neural Processes viable for real-time large-scale applications where GPs were previously too slow.
- The translation invariance property of SA might improve performance on spatially structured data.
- Future work could test these mechanisms on even larger scales or different modalities like time series.
Load-bearing premise
The kernel-based attention bias together with the scan attention and deep kernel attention mechanisms preserve the modeling fidelity of standard Neural Processes while delivering the stated complexity reductions and benchmark improvements.
What would settle it
Running the benchmarks and finding that TNP-KR does not outperform the baselines or cannot process 100K context points in the claimed time on a 24GB GPU would disprove the claims.
Figures
read the original abstract
Neural Processes (NPs) are a rapidly evolving class of models designed to directly model the posterior predictive distribution of stochastic processes. Originally developed as a scalable alternative to Gaussian Processes (GPs), which are limited by $O(n^3)$ runtime complexity, the most accurate modern NPs can often rival GPs but still suffer from an $O(n^2)$ bottleneck due to their attention mechanism. We introduce the Transformer Neural Process - Kernel Regression (TNP-KR), a scalable NP featuring: (1) a Kernel Regression Block (KRBlock), a simple, extensible, and parameter efficient transformer block with complexity $O(n_c^2 + n_c n_t)$, where $n_c$ and $n_t$ are the number of context and test points, respectively; (2) a kernel-based attention bias; and (3) two novel attention mechanisms: scan attention (SA), a memory-efficient scan-based attention that when paired with a kernel-based bias can make TNP-KR translation invariant, and deep kernel attention (DKA), a Performer-style attention that implicitly incoporates a distance bias and further reduces complexity to $O(n_c)$. These enhancements enable both TNP-KR variants to perform inference with 100K context points on over 1M test points in under a minute on a single 24GB GPU. On benchmarks spanning meta regression, Bayesian optimization, image completion, and epidemiology, TNP-KR with DKA outperforms its Performer counterpart on nearly every benchmark, while TNP-KR with SA achieves state-of-the-art results.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Transformer Neural Process - Kernel Regression (TNP-KR), a scalable variant of Neural Processes. It proposes a Kernel Regression Block (KRBlock) with O(n_c² + n_c n_t) complexity, a kernel-based attention bias, scan attention (SA) that achieves translation invariance when paired with the bias, and deep kernel attention (DKA) that reduces complexity to O(n_c) while implicitly incorporating a distance bias. The authors claim that TNP-KR with DKA outperforms a Performer baseline on nearly all benchmarks and that TNP-KR with SA reaches state-of-the-art results across meta-regression, Bayesian optimization, image completion, and epidemiology tasks, while enabling inference on 100K context points and over 1M test points in under a minute on a single 24GB GPU.
Significance. If the modeling fidelity of the original Neural Process posterior is preserved, the work would offer a practical advance in scalable stochastic process modeling by combining transformer efficiency with kernel-inspired inductive biases, potentially enabling larger-scale applications than current O(n²) attention-based NPs while matching or exceeding GP-level accuracy on the reported tasks.
major comments (2)
- [Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.
- [Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.
minor comments (2)
- [Abstract] Typo: 'incoporates' should be 'incorporates'.
- [Abstract] The abstract states complexities (O(n_c² + n_c n_t), O(n_c)) but does not reference the corresponding equations or pseudocode that would allow direct verification of the claimed scaling.
Simulated Author's Rebuttal
We thank the referee for their constructive comments regarding the fidelity of TNP-KR to the Neural Process framework. We address each major comment below with clarifications on the manuscript's claims and scope.
read point-by-point responses
-
Referee: [Abstract / Methods (KRBlock, SA, DKA descriptions)] The central claim that TNP-KR variants remain faithful Neural Processes (i.e., model the same posterior predictive distribution) while delivering the stated complexity reductions rests on unverified assumptions about KRBlock, the kernel bias, SA, and DKA. No derivation or analysis is provided showing that the modified attention outputs are consistent with the original NP conditional distribution; deviations could produce benchmark gains through altered inductive bias rather than improved scalability.
Authors: The manuscript introduces TNP-KR explicitly as 'a scalable variant of Neural Processes' and 'a scalable NP', with the goal of modeling the posterior predictive distribution of stochastic processes using transformer-based mechanisms that incorporate kernel-inspired inductive biases. It does not claim that the KRBlock, kernel attention bias, scan attention, or deep kernel attention produce exactly the same conditional distribution as any specific prior NP model. The original NP literature itself employs attention-based approximations to the conditional without formal equivalence to GPs. The presented work focuses on achieving the stated complexity reductions while maintaining the NP objective of direct posterior predictive modeling, as evidenced by the architecture and training procedure. We therefore disagree that the claims rest on unverified assumptions of exact distributional equivalence. revision: no
-
Referee: [Abstract / Experiments] The headline performance claims (DKA outperforming Performer baseline; SA reaching SOTA) require empirical verification that the new mechanisms preserve NP posterior fidelity. Without such verification, it is impossible to distinguish whether reported improvements stem from better modeling or from the changed attention bias and approximation.
Authors: The headline claims are substantiated by the experimental results on meta-regression, Bayesian optimization, image completion, and epidemiology benchmarks, where TNP-KR with DKA outperforms the Performer baseline on nearly every task and TNP-KR with SA reaches state-of-the-art performance, all under consistent evaluation settings from prior NP work. These tasks directly assess the quality of the modeled posterior predictive distribution via standard metrics. While the manuscript does not include separate experiments (such as explicit posterior calibration or likelihood comparisons isolating the attention changes), the consistent empirical gains across diverse tasks support that the mechanisms improve modeling effectiveness within the NP paradigm. We can add a clarifying sentence in the discussion section noting the scope of the claims. revision: partial
Circularity Check
No significant circularity; new architecture and mechanisms defined independently of their claimed outputs
full rationale
The paper introduces TNP-KR via three explicitly defined components (KRBlock with stated O(n_c² + n_c n_t) complexity, kernel-based attention bias, SA, and DKA) whose properties and benchmark performance are presented as empirical engineering results rather than derived predictions. No equations, definitions, or central claims reduce by construction to fitted parameters, self-citations, or prior inputs; the posterior fidelity and complexity reductions are asserted as design outcomes verified on external benchmarks, making the derivation self-contained.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[2]
Ashman, M., Diaconu, C., Kim, J., Sivaraya, L., Markou, S., Requeima, J., Bruinsma, W. P., and Turner, R. E. Translation equivariant transformer neural processes. In Salakhutdinov, R., Kolter, Z., Heller, K., Weller, A., Oliver, N., Scarlett, J., and Berkenkamp, F. (eds.), Proceedings of the 41st International Conference on Machine Learning, volume 235 of...
work page 1924
- [3]
-
[4]
Longformer: The Long-Document Transformer
Beltagy, I., Peters, M. E., and Cohan, A. Longformer: The long-document transformer. arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[5]
Leveraging redundancy in attention with reuse transformers, 2022
Bhojanapalli, S., Chakrabarti, A., Veit, A., Lukasik, M., Jain, H., Liu, F., Chang, Y.-W., and Kumar, S. Leveraging redundancy in attention with reuse transformers, 2022. URL https://openreview.net/forum?id=V37YFd_fFgN
work page 2022
-
[6]
Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., Maclaurin, D., Necula, G., Paszke, A., Vander P las, J., Wanderman- M ilne, S., and Zhang, Q. JAX : composable transformations of P ython+ N um P y programs, 2018. URL http://github.com/jax-ml/jax
work page 2018
-
[7]
P., Markou, S., Requiema, J., Foong, A
Bruinsma, W. P., Markou, S., Requiema, J., Foong, A. Y. K., Andersson, T. R., Vaughan, A., Buonomo, A., Hosking, J. S., and Turner, R. E. Autoregressive conditional neural processes, 2023. URL https://arxiv.org/abs/2303.14468
-
[8]
Generating Long Sequences with Sparse Transformers
Child, R., Gray, S., Radford, A., and Sutskever, I. Generating long sequences with sparse transformers, 2019. URL https://arxiv.org/abs/1904.10509
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[9]
M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J
Choromanski, K. M., Likhosherstov, V., Dohan, D., Song, X., Gane, A., Sarlos, T., Hawkins, P., Davis, J. Q., Mohiuddin, A., Kaiser, L., Belanger, D. B., Colwell, L. J., and Weller, A. Rethinking attention with performers. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=Ua6zuk0WRH
work page 2021
-
[10]
Flash A ttention-2: Faster attention with better parallelism and work partitioning
Dao, T. Flash A ttention-2: Faster attention with better parallelism and work partitioning. In International Conference on Learning Representations (ICLR), 2024
work page 2024
-
[11]
Y., Ermon, S., Rudra, A., and R \'e , C
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and R \'e , C. Flash A ttention: Fast and memory-efficient exact attention with IO -awareness. In Advances in Neural Information Processing Systems (NeurIPS), 2022
work page 2022
-
[12]
Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Efficient queries transformer neural processes. URL https://openreview.net/forum?id=_3FyT_W1DW
-
[13]
Feng, L., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Latent bottlenecked attentive neural processes. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=yIxtevizEA
work page 2023
-
[14]
Feng, L., Tung, F., Hajimirsadeghi, H., Bengio, Y., and Ahmed, M. O. Memory efficient neural processes via constant memory attention block, 2024. URL https://openreview.net/forum?id=I0gwsdSgsk
work page 2024
-
[15]
W., Rezende, D., and Eslami, S
Garnelo, M., Rosenbaum, D., Maddison, C., Ramalho, T., Saxton, D., Shanahan, M., Teh, Y. W., Rezende, D., and Eslami, S. M. A. Conditional neural processes. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.\ 1704--1713. PMLR, 10--15 Jul 2018 a ....
work page 2018
-
[16]
Garnelo, M., Schwarz, J., Rosenbaum, D., Viola, F., Rezende, D. J., Eslami, S. M. A., and Teh, Y. W. Neural processes, 2018 b . URL https://arxiv.org/abs/1807.01622
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[17]
Gordon, J., Bruinsma, W. P., Foong, A. Y. K., Requeima, J., Dubois, Y., and Turner, R. E. Convolutional conditional neural processes. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=Skey4eBYPS
work page 2020
-
[18]
Perceiver: General perception with iterative attention
Jaegle, A., Gimeno, F., Brock, A., Vinyals, O., Zisserman, A., and Carreira, J. Perceiver: General perception with iterative attention. In Meila, M. and Zhang, T. (eds.), Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pp.\ 4651--4664. PMLR, 18--24 Jul 2021. URL https://proceedi...
work page 2021
-
[19]
Kim, H., Mnih, A., Schwarz, J., Garnelo, M., Eslami, A., Rosenbaum, D., Vinyals, O., and Teh, Y. W. Attentive neural processes. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=SkE6PjC9KX
work page 2019
-
[20]
Reformer: The efficient transformer
Kitaev, N., Kaiser, L., and Levskaya, A. Reformer: The efficient transformer. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=rkgNKkHtvB
work page 2020
-
[21]
Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., and Teh, Y. W. Set transformer: A framework for attention-based permutation-invariant neural networks. In Proceedings of the 36th International Conference on Machine Learning, pp.\ 3744--3753, 2019
work page 2019
-
[22]
Lee, J., Lee, Y., Kim, J., Yang, E., Hwang, S. J., and Teh, Y. W. Bootstrapping neural processes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.\ 6606--6615. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/492114f69...
work page 2020
-
[23]
Nadaraya, E. A. On estimating regression. Theory of Probability & Its Applications, 9 0 (1): 0 141--142, 1964
work page 1964
-
[24]
Nguyen, T. and Grover, A. Transformer neural processes: Uncertainty-aware meta learning via sequence modeling. arXiv preprint arXiv:2207.04179, 2022
-
[25]
Train short, test long: Attention with linear biases enables input length extrapolation
Press, O., Smith, N., and Lewis, M. Train short, test long: Attention with linear biases enables input length extrapolation. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=R8sQPpGCv0
work page 2022
- [27]
-
[28]
Flashattention-3: Fast and accurate attention with asynchrony and low-precision
Shah, J., Bikshandi, G., Zhang, Y., Thakkar, V., Ramani, P., and Dao, T. Flashattention-3: Fast and accurate attention with asynchrony and low-precision. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. URL https://openreview.net/forum?id=tVConYid20
work page 2024
-
[29]
Efficient attention: Attention with linear complexities
Shen, Z., Zhang, M., Zhao, H., Yi, S., and Li, H. Efficient attention: Attention with linear complexities. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp.\ 3531--3539, 2021. doi:10.1109/WACV48630.2021.00356. URL https://openaccess.thecvf.com/content/WACV2021/papers/Shen_Efficient_Attention_Attention_With_Lin...
-
[30]
Sparse sinkhorn attention, 2020
Tay, Y., Bahri, D., Yang, L., Metzler, D., and Juan, D.-C. Sparse sinkhorn attention, 2020. URL https://arxiv.org/abs/2002.11296
-
[31]
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and Polosukhin, I. Attention is all you need. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proc...
work page 2017
-
[32]
Vaughan, A., Markou, S., Tebbutt, W., Requeima, J., Bruinsma, W. P., Andersson, T. R., Herzog, M., Lane, N. D., Chantry, M., Hosking, J. S., and Turner, R. E. Aardvark weather: end-to-end data-driven weather forecasting, 2024. URL https://arxiv.org/abs/2404.00411
-
[33]
Linformer: Self-Attention with Linear Complexity
Wang, S., Li, B. Z., Khabsa, M., Fang, H., and Ma, H. Linformer: Self-attention with linear complexity, 2020. URL https://arxiv.org/abs/2006.04768
work page internal anchor Pith review Pith/arXiv arXiv 2020
-
[34]
Wennberg, U. and Henter, G. E. The case for translation-invariant self-attention in transformer-based language models. In Zong, C., Xia, F., Li, W., and Navigli, R. (eds.), Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Pape...
-
[35]
On layer normalization in the transformer architecture, 2020
Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Zhang, H., Lan, Y., Wang, L., and Liu, T.-Y. On layer normalization in the transformer architecture, 2020. URL https://openreview.net/forum?id=B1x8anVFPr
work page 2020
-
[36]
Ying, C., Cai, T., Luo, S., Zheng, S., Ke, G., He, D., Shen, Y., and Liu, T.-Y. Do transformers really perform badly for graph representation? In Thirty-Fifth Conference on Neural Information Processing Systems, 2021 a . URL https://openreview.net/forum?id=OeWooOxFwDa
work page 2021
-
[37]
Lazyformer: Self attention with lazy update, 2021 b
Ying, C., Ke, G., He, D., and Liu, T.-Y. Lazyformer: Self attention with lazy update, 2021 b . URL https://arxiv.org/abs/2102.12702
-
[38]
Adaptive methods for nonconvex optimization
Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper_files/paper/2018/fil...
-
[39]
A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al
Zaheer, M., Guruganesh, G., Dubey, K. A., Ainslie, J., Alberti, C., Ontanon, S., Pham, P., Ravula, A., Wang, Q., Yang, L., et al. Big bird: Transformers for longer sequences. Advances in Neural Information Processing Systems, 33, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.