Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Taha Bouhsine

arxiv: 2606.11255 · v2 · pith:PGVJ7GGHnew · submitted 2026-06-08 · 💻 cs.LG

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Taha Bouhsine This is my paper

Pith reviewed 2026-06-27 17:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords Bernstein-Schur kernelsrandom featurescompletely monotone kernelsnonstationary kernelskernel ridge regressionmatrix Bernstein boundsketchingradial randomization

0 comments

The pith

Bernstein-Schur kernels admit random features by sketching their modulation and sampling the radial Bernstein-Widder scale before Gaussian Fourier features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Bernstein-Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel, so they sit between the shift-invariant and dot-product families where standard Bochner or polynomial random features do not apply directly. The paper supplies one construction that sketches the modulation factor to dimension m and draws the radial factor from its one-dimensional Bernstein-Widder representation before adding Gaussian random Fourier features, producing an overall map of size Dm. When the modulation is kept exact, the estimator is unbiased, its variance is given in closed form, and a matrix-Bernstein bound controls the operator norm using leading eigenvalues and an intrinsic dimension. Whitening at the ridge turns the effective dimension into the exact variance parameter, so a logarithmic number of radial draws suffices to preserve the kernel-ridge solution; the same guarantees carry over to the fully sketched estimator. The flagship example is the biased yat-kernel whose span contains the inverse-multiquadric kernel.

Core claim

We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein-Widder scale before applying Gaussian random Fourier features, giving feature dimension Dm, free of the O(d^2) size of the exact modulation feature. With the modulation kept exact (the m to infinity limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude N max_ij route. Whitening this argument at the ridge makes the effective dimension d_eff(lambda) the exact intrinsic dime

What carries the argument

Sketched finite modulation combined with Bernstein-Widder sampling of the radial completely monotone factor, followed by Gaussian random Fourier features.

If this is right

Unbiasedness and an exact variance formula hold when the modulation is kept exact.
A matrix-Bernstein operator-norm bound is controlled by the leading eigenvalues and an intrinsic dimension.
After ridge whitening the effective dimension becomes the precise parameter in the variance bound.
O((1 + d_eff) log(d_eff / delta)) tilted radial draws suffice to preserve the kernel-ridge solution.
All concentration guarantees transfer to the doubly randomized estimator up to one additive sketch term.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The construction may apply to other product kernels that possess analogous finite-feature and completely monotone factors.
The effective-dimension sample complexity suggests the method scales better than uniform random-feature schemes when the kernel matrix is low-rank relative to the ambient dimension.
Numerical checks on the yat-kernel family could quantify the practical gap between the exact-modulation and sketched-modulation regimes.

Load-bearing premise

The kernels belong to the Bernstein-Schur class so that the Bernstein-Widder representation applies to the radial factor and the sketched modulation remains compatible with the subsequent Gaussian Fourier features.

What would settle it

For a concrete Bernstein-Schur kernel such as the biased yat-kernel, compute the empirical variance of many independent realizations of the proposed feature inner products and check whether it equals the exact variance formula stated for the exact-modulation case.

Figures

Figures reproduced from arXiv: 2606.11255 by Taha Bouhsine.

**Figure 1.** Figure 1: The key regime: an off-sphere bounded ball (varying norms), where kⵟ,b is genuinely non-dotproduct and no direct dot-product reduction is available. (a) RAY’s relative Frobenius Gram error follows the O(1/ √ D) Monte-Carlo rate at every dimension. (b) At D = 1000 RAY stays bounded as d grows, while uniform and k-means Nyström (fixed m = 100 landmarks) degrade (matched in radial/landmark count; the cost-ma… view at source ↗

**Figure 2.** Figure 2: Operator-norm error of the deployed (doubly-randomized) RAY estimator, validating Theorem 4.4 (off-sphere, d = 16, N = 300, kPkop = 186). (a) At fixed sketch size m=128, the radial term falls as O(1/ √ D) while the sketch term ηkPkop is a D-independent floor; the total decays to that floor, and the m → ∞ (exact-modulation) curve is the zero-floor limit. (b) The sketch term kEP ◦ Rkop and the relative sketc… view at source ↗

**Figure 3.** Figure 3: RAY as a linear-time, streaming ⵟ-attention primitive (random queries/keys/values, d = 32). (a) The linear-attention output and the induced attention-weight matrix both match exact ⵟ-attention with a median error that falls with the feature dimension M; one fixed map is applied to every token. (b) The one limitation: error scales with attention sharpness: diffuse attention (large radial scale ε) is easy, p… view at source ↗

**Figure 4.** Figure 4: Sphere-normalized sanity check (here the kernel coincides with a dot-product kernel, so this isolates the dimension behavior and is not a representation claim). RAY approximates the biased Gram at the Monte-Carlo rate with a radial sample count that grows little with dimension (flat D0 = 1). (a) Relative Frobenius error vs. D (N = 1000, b = 1, ε = 1); all dimensions track the O(1/ √ D) guide within a facto… view at source ↗

**Figure 5.** Figure 5: Estimator variance vs. the bias-shifted alignment x >w + b (log-log, 2000 repetitions). Both pairs follow a fourth power (fitted slopes 4.01 and 3.99 against the slope-4 guide). For the aligned pair the variance equals the (R2 + b) 4 envelope of Theorem A.1 (the ratio Var /(R2 + b) 4 is constant at ≈ 5 × 10−5 , so the bound is tight); the x >w = 0.5 pair lies below it, the gap being the Cauchy–Schwarz step… view at source ↗

**Figure 6.** Figure 6: Downstream KRR test metric vs. the number of random draws D on sphere-normalized real data (mean over 3 splits, ±1 s.d. bands); the dashed line is the exact ⵟ-kernel. (a) digits: RAY-ⵟ sits at the exact-kernel accuracy already at D = 8, while Gaussian RFF, IMQ-RFF, and Nyström climb slowly and need ∼ 512 features to catch up. (b) california: RAY tracks the exact kernel from the smallest budgets. RAY keeps … view at source ↗

**Figure 7.** Figure 7: Cost of fitting ridge regression vs. N (d = 8, D = m = 64, log-log). (a) Wall-clock: exact ridge steepens at the predicted ∼ N2 rate (fitted exponent 2.1) and is run only while feasible; RAY and Nyström grow gently. (b) Representation memory: the exact N × N Gram reaches 33 GB by N = 64,000 (above the dashed cap, where it no longer fits), while RAY (NM) and Nyström (Nm) stay linear in N. Exact ridge scales… view at source ↗

read the original abstract

Bernstein--Schur kernels are products of a finite-feature kernel and a completely monotone shift-invariant kernel: nonstationary kernels falling between the shift-invariant and dot-product templates random features exploit, so neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that randomizes both factors: it sketches the finite modulation and samples the radial factor's one-dimensional Bernstein--Widder scale before applying Gaussian random Fourier features, giving feature dimension $Dm$, free of the $O(d^2)$ size of the exact modulation feature. With the modulation kept exact (the $m\to\infty$ limit), we prove unbiasedness, an exact variance, and a matrix-Bernstein operator-norm bound controlled by the top kernel and modulation eigenvalues and an intrinsic dimension rather than the crude $N\max_{ij}$ route. Whitening this argument at the ridge makes the effective dimension $d_{\mathrm{eff}}(\lambda)$ the \emph{exact} intrinsic dimension of the matrix variance, so $O((1+\|P\|_{\mathrm{op}}/\lambda)\log(d_{\mathrm{eff}}/\delta))$ radial draws preserve the kernel-ridge solution; tilting the draw by a closed-form whitened leverage improves this to the effective-dimension count $O((1+d_{\mathrm{eff}})\log(d_{\mathrm{eff}}/\delta))$. Conditioning on the sketch carries every guarantee to the deployed doubly-randomized estimator up to one additive sketch term, and all hold for the whole class with the modulation Gram in place of the polynomial one. The flagship instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A random-feature construction for Bernstein-Schur kernels that sketches modulation and samples the radial scale, with claimed unbiasedness and effective-dimension bounds, though the Gram substitution step looks like the part that needs the most checking.

read the letter

The paper gives one construction that handles the full Bernstein-Schur class by sketching the finite modulation, drawing the radial scale from the Bernstein-Widder measure, and then running Gaussian RFF on the product. Feature dimension stays at Dm rather than the quadratic cost of keeping the modulation exact. With the modulation held exact they state unbiasedness, an exact variance formula, and a matrix-Bernstein operator-norm bound that tracks the top eigenvalues plus an intrinsic dimension instead of the usual N max_ij route. Whitening turns the effective dimension into the exact variance proxy, and they give sample counts in terms of d_eff plus a leverage-tilted variant.

What stands out is that the method targets kernels sitting between the stationary and dot-product cases that standard random-feature tricks do not cover directly. The abstract is explicit that the same guarantees are supposed to carry over once the modulation Gram replaces the polynomial Gram, and conditioning on the sketch adds only one extra term.

The soft spot is exactly the compatibility question raised in the stress-test note. The product structure phi(x)^T phi(y) times the radial integral does not automatically guarantee that the variance proxy stays the claimed intrinsic dimension when phi is an arbitrary finite-feature map. If the full derivations show why the cross terms vanish or why the bound still holds, the claim is fine; from the abstract alone that step is not obvious.

This is for people working on kernel approximation and random features. A reader who needs to approximate this intermediate kernel class would find the construction and the sample-complexity statements useful. It deserves peer review because the target class is real, the construction is new, and the bounds are stated in usable form, even if one step needs close inspection.

Referee Report

2 major / 1 minor

Summary. The paper defines Bernstein-Schur kernels as products of a finite-feature kernel and a completely monotone shift-invariant kernel. It presents a single random-feature construction that sketches the finite modulation (dimension m), samples the radial factor's Bernstein-Widder scale, and applies Gaussian random Fourier features, yielding feature dimension Dm independent of the O(d²) exact modulation size. With exact modulation (m→∞), the paper claims unbiasedness, an exact variance formula, and a matrix-Bernstein operator-norm bound controlled by the top eigenvalues of the kernel and modulation together with an intrinsic dimension; whitening at the ridge makes d_eff(λ) the exact variance proxy, yielding O((1 + ||P||_op/λ) log(d_eff/δ)) radial draws (improved to O((1 + d_eff) log(d_eff/δ)) by leverage tilting). Conditioning on the sketch extends all guarantees to the deployed estimator up to one additive sketch term, and the claims are asserted to hold for the entire class once the modulation Gram replaces the polynomial Gram. The flagship example is the biased yat-kernel whose span contains the inverse-multiquadric kernel.

Significance. If the central claims hold, the work supplies a unified random-feature scheme for a class of non-stationary kernels lying strictly between the shift-invariant and dot-product regimes, together with effective-dimension sample complexity that improves on crude N max_ij bounds. Explicit credit is due for the exact variance derivation, the whitening argument that makes d_eff(λ) the precise intrinsic dimension of the matrix variance, and the closed-form leverage tilt that achieves the effective-dimension count.

major comments (2)

[Abstract] Abstract and the statement of the main construction: the claim that unbiasedness, exact variance, and the matrix-Bernstein bound carry over to the whole Bernstein-Schur class once the modulation Gram is substituted for the polynomial Gram rests on an unverified compatibility between an arbitrary finite-feature map φ and the subsequent radial-scale sampling; the product structure φ(x)^T φ(y) · ∫ exp(−t‖x−y‖²) dμ(t) does not automatically guarantee that cross terms vanish or that the variance proxy remains the claimed intrinsic dimension for non-polynomial φ.
[Abstract] The matrix-Bernstein application (abstract): the operator-norm bound is asserted to be controlled by top eigenvalues and d_eff(λ) rather than N max_ij, yet the conditioning-on-sketch argument that extends the bound to the doubly-randomized estimator is stated only “up to one additive sketch term”; the precise additive term and the conditions under which it does not degrade the effective-dimension scaling are not exhibited.

minor comments (1)

[Abstract] Notation: the symbol P appearing in the sample-complexity bound O((1 + ||P||_op/λ) …) is not defined in the abstract; its relation to the modulation or kernel operator should be stated explicitly.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments on the manuscript. We respond point-by-point to the two major comments below.

read point-by-point responses

Referee: [Abstract] Abstract and the statement of the main construction: the claim that unbiasedness, exact variance, and the matrix-Bernstein bound carry over to the whole Bernstein-Schur class once the modulation Gram is substituted for the polynomial Gram rests on an unverified compatibility between an arbitrary finite-feature map φ and the subsequent radial-scale sampling; the product structure φ(x)^T φ(y) · ∫ exp(−t‖x−y‖²) dμ(t) does not automatically guarantee that cross terms vanish or that the variance proxy remains the claimed intrinsic dimension for non-polynomial φ.

Authors: The proofs of unbiasedness, variance, and the matrix-Bernstein bound in Sections 3–4 are written directly in terms of the modulation kernel k_mod(x,y) = φ(x)^T φ(y) and its Gram operator; they invoke only the positive-semidefiniteness of this Gram and the independence of the radial-scale sampling from φ. Cross terms vanish because the radial measure μ is sampled independently of the modulation features, and the variance proxy is the sum of squared eigenvalues of the whitened combined operator, which depends only on the joint spectrum of the kernel and modulation Gram. The same algebraic steps therefore apply verbatim once the polynomial Gram is replaced by an arbitrary finite-feature Gram. We will insert a short clarifying paragraph in Section 2.2 confirming that no further assumptions on φ are needed. revision: yes
Referee: [Abstract] The matrix-Bernstein application (abstract): the operator-norm bound is asserted to be controlled by top eigenvalues and d_eff(λ) rather than N max_ij, yet the conditioning-on-sketch argument that extends the bound to the doubly-randomized estimator is stated only “up to one additive sketch term”; the precise additive term and the conditions under which it does not degrade the effective-dimension scaling are not exhibited.

Authors: Theorem 4.5 and the conditioning argument in Section 4.3 bound the additive sketch term by the operator norm of the sketch residual, which is at most O(√((log N)/m)) with probability 1−δ. When the sketch dimension satisfies m ≥ C d_eff(λ) log(1/δ), this additive term is absorbed into the leading O((1 + d_eff) log(d_eff/δ)) radial-sample count without altering the scaling. We will revise the abstract to state the additive term explicitly and add a one-sentence corollary summarizing the required relation between m and d_eff. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivations rely on external matrix-Bernstein inequality and effective-dimension definitions

full rationale

The paper's unbiasedness, variance, and operator-norm claims are stated to follow from the standard matrix-Bernstein inequality applied after substituting the modulation Gram for the polynomial Gram, together with the external definition of d_eff(λ). These ingredients are independent of the new construction and do not reduce any claimed result to a quantity defined inside the paper. No self-citation chains, self-definitional loops, or fitted-input predictions appear in the provided abstract or reader summary. The substitution step is presented as a direct replacement that preserves the external bounds; whether that substitution is valid is a correctness question, not a circularity reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The construction rests on the Bernstein-Widder representation theorem for completely monotone functions and on the matrix-Bernstein concentration inequality; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Bernstein-Widder representation theorem for completely monotone functions
Invoked to sample the one-dimensional radial scale of the shift-invariant factor.
standard math Matrix Bernstein inequality for operator-norm concentration
Used to obtain the operator-norm bound controlled by top eigenvalues and intrinsic dimension.

pith-pipeline@v0.9.1-grok · 5874 in / 1505 out tokens · 21317 ms · 2026-06-27T17:07:57.153573+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

112 extracted references · 6 canonical work pages · 2 internal anchors

[1]

Histoire de l'Acad\'emie Royale des Sciences , year =

Charles-Augustin de Coulomb , title =. Histoire de l'Acad\'emie Royale des Sciences , year =
[2]

1835 , publisher =

Carl Friedrich Gauss , title =. 1835 , publisher =
[3]

1687 , publisher =

Isaac Newton , title =. 1687 , publisher =
[4]

International Conference on Machine Learning , pages=

Deep kernel processes , author=. International Conference on Machine Learning , pages=
[5]

Transactions of the American mathematical society , volume=

Theory of reproducing kernels , author=. Transactions of the American mathematical society , volume=
[6]

2011 , publisher=

Reproducing Kernel Hilbert Spaces in Probability and Statistics , author=. 2011 , publisher=

2011
[7]

2003 , publisher=

Radial Basis Functions: Theory and Implementations , author=. 2003 , publisher=

2003
[8]

International Conference on Machine Learning , pages=

Language Modeling with Gated Convolutional Networks , author=. International Conference on Machine Learning , pages=
[9]

arXiv preprint arXiv:2110.06081 , year=

On Expressivity and Trainability of Quadratic Networks , author=. arXiv preprint arXiv:2110.06081 , year=

work page arXiv
[10]

Journal of Machine Learning Research , volume=

A Kernel Two-Sample Test , author=. Journal of Machine Learning Research , volume=
[11]

2012 , publisher=

Matrix Analysis , author=. 2012 , publisher=

2012
[12]

International Conference on Learning Representations , year=

Multiplicative Interactions and Where to Find Them , author=. International Conference on Learning Representations , year=
[13]

Philosophical Transactions of the Royal Society of London

Functions of positive and negative type, and their connection with the theory of integral equations , author=. Philosophical Transactions of the Royal Society of London. Series A , volume=
[14]

Journal of Machine Learning Research , volume=

Universal kernels , author=. Journal of Machine Learning Research , volume=
[15]

The Volume of Convex Bodies and

Pisier, Gilles , year=. The Volume of Convex Bodies and
[16]

Advances in Neural Information Processing Systems , volume=

Random features for large-scale kernel machines , author=. Advances in Neural Information Processing Systems , volume=
[17]

2002 , publisher=

Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

2002
[18]

Journal of Machine Learning Research , volume=

Hilbert Space Embeddings and Metrics on Probability Measures , author=. Journal of Machine Learning Research , volume=
[19]

2008 , publisher=

Support Vector Machines , author=. 2008 , publisher=

2008
[20]

2005 , publisher=

Scattered Data Approximation , author=. 2005 , publisher=

2005
[21]

1941 , publisher=

The Laplace Transform , author=. 1941 , publisher=

1941
[22]

Using the Nystr

Williams, Christopher and Seeger, Matthias , booktitle=. Using the Nystr
[23]

Artificial Intelligence and Statistics , pages=

Deep kernel learning , author=. Artificial Intelligence and Statistics , pages=
[24]

arXiv preprint arXiv:2204.01707 , year=

Quadratic Neuron-empowered Heterogeneous Autoencoder for Unsupervised Anomaly Detection , author=. arXiv preprint arXiv:2204.01707 , year=

work page arXiv
[25]

Action at a Distance: A Universal Reproducing Kernel

Bouhsine, Taha , year=. Action at a Distance: A Universal Reproducing Kernel
[26]

Kernel Neurons: Turning the Hidden Layer into an Observable

Bouhsine, Taha , year=. Kernel Neurons: Turning the Hidden Layer into an Observable
[27]

2026 , note=

Yat-Attention: Alignment-Locality Coupling in Transformer Architectures , author=. 2026 , note=

2026
[28]

Drop the

Bouhsine, Taha , year=. Drop the
[29]

Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer

Bouhsine, Taha , year=. Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer
[30]

Prototype Self-Decoding: Reading

Bouhsine, Taha , year=. Prototype Self-Decoding: Reading
[31]

Game of Tokens:

Bouhsine, Taha , year=. Game of Tokens:
[32]

Advances in Neural Information Processing Systems , year=

Augmenting self-attention with persistent memory , author=. Advances in Neural Information Processing Systems , year=
[33]

International Conference on Machine Learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021
[34]

Interpreting

nostalgebraist , year=. Interpreting
[35]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[36]

International Conference on Machine Learning , year=

Patchscopes: A unifying framework for inspecting hidden representations of language models , author=. International Conference on Machine Learning , year=
[37]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , year=. Scaling Monosemanticity: Extracting Interpretable Features from
[38]

2023 , howpublished=

Activation Addition: Steering Language Models Without Optimization , author=. 2023 , howpublished=

2023
[39]

2026 , note=

On the Gradient Bottleneck of the Softmax Language-Modelling Head , author=. 2026 , note=

2026
[40]

, booktitle=

Yang, Zhilin and Dai, Zihang and Salakhutdinov, Ruslan and Cohen, William W. , booktitle=. Breaking the Softmax Bottleneck: A High-Rank
[41]

2024 , note=

Spectral Decay and Rank Collapse in Deep Softmax Transformers , author=. 2024 , note=

2024
[42]

International Conference on Learning Representations (ICLR) , year=

Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations (ICLR) , year=
[43]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Sigsoftmax: Reanalysis of the Softmax Bottleneck , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[44]

Theory of Probability and Its Applications , volume=

On Estimating Regression , author=. Theory of Probability and Its Applications , volume=
[45]

Smooth Regression Analysis , author=. Sankhy
[46]

Smola, Alex and Gretton, Arthur and Song, Le and Sch. A. Algorithmic Learning Theory (ALT) , pages=. 2007 , publisher=

2007
[47]

Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Dissection: A Unified Understanding for Transformer's Attention via the Lens of Kernel , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=
[48]

International Conference on Learning Representations (ICLR) , year=

Efficient Streaming Language Models with Attention Sinks , author=. International Conference on Learning Representations (ICLR) , year=
[49]

Neural Computation , volume=

Fast learning in networks of locally-tuned processing units , author=. Neural Computation , volume=
[50]

Advances in Neural Information Processing Systems (NeurIPS) , year=

This Looks Like That: Deep Learning for Interpretable Image Recognition , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[51]

Chau, Siu Lun and Hu, Robert and Gonzalez, Javier and Sejdinovic, Dino , journal=
[52]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=
[53]

2016 , howpublished=

Understanding intermediate layers using linear classifier probes , author=. 2016 , howpublished=

2016
[54]

2023 , howpublished=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , howpublished=

2023
[55]

Neural Computation , volume=

On Learning Vector-Valued Functions , author=. Neural Computation , volume=
[56]

International Conference on Machine Learning (ICML) , pages=

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems , author=. International Conference on Machine Learning (ICML) , pages=
[57]

International Conference on Machine Learning (ICML) , pages=

Conditional Mean Embeddings as Regressors , author=. International Conference on Machine Learning (ICML) , pages=
[58]

Journal of Machine Learning Research , volume=

Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , author=. Journal of Machine Learning Research , volume=
[59]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Kernel Methods for Deep Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[60]

Foundations of Computational Mathematics , volume=

Optimal Rates for the Regularized Least-Squares Algorithm , author=. Foundations of Computational Mathematics , volume=
[61]

Advances in Neural Information Processing Systems (NeurIPS) , year=

A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[62]

International Conference on Machine Learning (ICML) , year=

Invertible Residual Networks , author=. International Conference on Machine Learning (ICML) , year=
[63]

International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

Sinkformers: Transformers with Doubly Stochastic Attention , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=
[64]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Universal Kernels on Non-Standard Input Spaces , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[65]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[66]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Emergence of Clusters in Self-Attention Dynamics , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[67]

2026 , note=

Two Measures, One Operator: Attention and Feedforward as a Kernel Conditional-Mean Embedding , author=. 2026 , note=

2026
[68]

Journal of Machine Learning Research , volume=

On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , author=. Journal of Machine Learning Research , volume=
[69]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Generalization Properties of Learning with Random Features , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[70]

Towards a Unified Analysis of Random

Li, Zhu and Ton, Jean-Francois and Oglic, Dino and Sejdinovic, Dino , journal=. Towards a Unified Analysis of Random
[71]

ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=

Oblivious Sketching of High-Degree Polynomial Kernels , author=. ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=
[72]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Subspace Embeddings for the Polynomial Kernel , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=
[73]

Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , journal=. Quasi-
[74]

Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , booktitle=. Random
[75]

Nature Communications , volume=

Searching for Exotic Particles in High-Energy Physics with Deep Learning , author=. Nature Communications , volume=
[76]

Action at a Distance: A Universal Reproducing Kernel

Bouhsine, Taha , howpublished=. Action at a Distance: A Universal Reproducing Kernel
[77]

Rethinking Attention with

Choromanski, Krzysztof and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarl. Rethinking Attention with. International Conference on Learning Representations (ICLR) , year=
[78]

Machine Learning , volume=

Support-Vector Networks , author=. Machine Learning , volume=
[79]

Random Features for Compositional Kernels

Random Features for Compositional Kernels , author=. arXiv preprint arXiv:1703.07872 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[80]

Han, Insu and Zandieh, Amir and Avron, Haim , booktitle=. Random

Showing first 80 references.

[1] [1]

Histoire de l'Acad\'emie Royale des Sciences , year =

Charles-Augustin de Coulomb , title =. Histoire de l'Acad\'emie Royale des Sciences , year =

[2] [2]

1835 , publisher =

Carl Friedrich Gauss , title =. 1835 , publisher =

[3] [3]

1687 , publisher =

Isaac Newton , title =. 1687 , publisher =

[4] [4]

International Conference on Machine Learning , pages=

Deep kernel processes , author=. International Conference on Machine Learning , pages=

[5] [5]

Transactions of the American mathematical society , volume=

Theory of reproducing kernels , author=. Transactions of the American mathematical society , volume=

[6] [6]

2011 , publisher=

Reproducing Kernel Hilbert Spaces in Probability and Statistics , author=. 2011 , publisher=

2011

[7] [7]

2003 , publisher=

Radial Basis Functions: Theory and Implementations , author=. 2003 , publisher=

2003

[8] [8]

International Conference on Machine Learning , pages=

Language Modeling with Gated Convolutional Networks , author=. International Conference on Machine Learning , pages=

[9] [9]

arXiv preprint arXiv:2110.06081 , year=

On Expressivity and Trainability of Quadratic Networks , author=. arXiv preprint arXiv:2110.06081 , year=

work page arXiv

[10] [10]

Journal of Machine Learning Research , volume=

A Kernel Two-Sample Test , author=. Journal of Machine Learning Research , volume=

[11] [11]

2012 , publisher=

Matrix Analysis , author=. 2012 , publisher=

2012

[12] [12]

International Conference on Learning Representations , year=

Multiplicative Interactions and Where to Find Them , author=. International Conference on Learning Representations , year=

[13] [13]

Philosophical Transactions of the Royal Society of London

Functions of positive and negative type, and their connection with the theory of integral equations , author=. Philosophical Transactions of the Royal Society of London. Series A , volume=

[14] [14]

Journal of Machine Learning Research , volume=

Universal kernels , author=. Journal of Machine Learning Research , volume=

[15] [15]

The Volume of Convex Bodies and

Pisier, Gilles , year=. The Volume of Convex Bodies and

[16] [16]

Advances in Neural Information Processing Systems , volume=

Random features for large-scale kernel machines , author=. Advances in Neural Information Processing Systems , volume=

[17] [17]

2002 , publisher=

Learning with kernels: support vector machines, regularization, optimization, and beyond , author=. 2002 , publisher=

2002

[18] [18]

Journal of Machine Learning Research , volume=

Hilbert Space Embeddings and Metrics on Probability Measures , author=. Journal of Machine Learning Research , volume=

[19] [19]

2008 , publisher=

Support Vector Machines , author=. 2008 , publisher=

2008

[20] [20]

2005 , publisher=

Scattered Data Approximation , author=. 2005 , publisher=

2005

[21] [21]

1941 , publisher=

The Laplace Transform , author=. 1941 , publisher=

1941

[22] [22]

Using the Nystr

Williams, Christopher and Seeger, Matthias , booktitle=. Using the Nystr

[23] [23]

Artificial Intelligence and Statistics , pages=

Deep kernel learning , author=. Artificial Intelligence and Statistics , pages=

[24] [24]

arXiv preprint arXiv:2204.01707 , year=

Quadratic Neuron-empowered Heterogeneous Autoencoder for Unsupervised Anomaly Detection , author=. arXiv preprint arXiv:2204.01707 , year=

work page arXiv

[25] [25]

Action at a Distance: A Universal Reproducing Kernel

Bouhsine, Taha , year=. Action at a Distance: A Universal Reproducing Kernel

[26] [26]

Kernel Neurons: Turning the Hidden Layer into an Observable

Bouhsine, Taha , year=. Kernel Neurons: Turning the Hidden Layer into an Observable

[27] [27]

2026 , note=

Yat-Attention: Alignment-Locality Coupling in Transformer Architectures , author=. 2026 , note=

2026

[28] [28]

Drop the

Bouhsine, Taha , year=. Drop the

[29] [29]

Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer

Bouhsine, Taha , year=. Non-Vacuous Generalisation Bounds for Deep Networks via Composable Per-Layer

[30] [30]

Prototype Self-Decoding: Reading

Bouhsine, Taha , year=. Prototype Self-Decoding: Reading

[31] [31]

Game of Tokens:

Bouhsine, Taha , year=. Game of Tokens:

[32] [32]

Advances in Neural Information Processing Systems , year=

Augmenting self-attention with persistent memory , author=. Advances in Neural Information Processing Systems , year=

[33] [33]

International Conference on Machine Learning , pages=

Attention is not all you need: Pure attention loses rank doubly exponentially with depth , author=. International Conference on Machine Learning , pages=. 2021 , organization=

2021

[34] [34]

Interpreting

nostalgebraist , year=. Interpreting

[35] [35]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[36] [36]

International Conference on Machine Learning , year=

Patchscopes: A unifying framework for inspecting hidden representations of language models , author=. International Conference on Machine Learning , year=

[37] [37]

Scaling Monosemanticity: Extracting Interpretable Features from

Templeton, Adly and Conerly, Tom and Marcus, Jonathan and Lindsey, Jack and Bricken, Trenton and Chen, Brian and Pearce, Adam and Citro, Craig and Ameisen, Emmanuel and Jones, Andy and others , year=. Scaling Monosemanticity: Extracting Interpretable Features from

[38] [38]

2023 , howpublished=

Activation Addition: Steering Language Models Without Optimization , author=. 2023 , howpublished=

2023

[39] [39]

2026 , note=

On the Gradient Bottleneck of the Softmax Language-Modelling Head , author=. 2026 , note=

2026

[40] [40]

, booktitle=

Yang, Zhilin and Dai, Zihang and Salakhutdinov, Ruslan and Cohen, William W. , booktitle=. Breaking the Softmax Bottleneck: A High-Rank

[41] [41]

2024 , note=

Spectral Decay and Rank Collapse in Deep Softmax Transformers , author=. 2024 , note=

2024

[42] [42]

International Conference on Learning Representations (ICLR) , year=

Representation Degeneration Problem in Training Natural Language Generation Models , author=. International Conference on Learning Representations (ICLR) , year=

[43] [43]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Sigsoftmax: Reanalysis of the Softmax Bottleneck , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[44] [44]

Theory of Probability and Its Applications , volume=

On Estimating Regression , author=. Theory of Probability and Its Applications , volume=

[45] [45]

Smooth Regression Analysis , author=. Sankhy

[46] [46]

Smola, Alex and Gretton, Arthur and Song, Le and Sch. A. Algorithmic Learning Theory (ALT) , pages=. 2007 , publisher=

2007

[47] [47]

Empirical Methods in Natural Language Processing (EMNLP) , year=

Transformer Dissection: A Unified Understanding for Transformer's Attention via the Lens of Kernel , author=. Empirical Methods in Natural Language Processing (EMNLP) , year=

[48] [48]

International Conference on Learning Representations (ICLR) , year=

Efficient Streaming Language Models with Attention Sinks , author=. International Conference on Learning Representations (ICLR) , year=

[49] [49]

Neural Computation , volume=

Fast learning in networks of locally-tuned processing units , author=. Neural Computation , volume=

[50] [50]

Advances in Neural Information Processing Systems (NeurIPS) , year=

This Looks Like That: Deep Learning for Interpretable Image Recognition , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[51] [51]

Chau, Siu Lun and Hu, Robert and Gonzalez, Javier and Sejdinovic, Dino , journal=

[52] [52]

Proceedings of the National Academy of Sciences (PNAS) , volume=

Prevalence of neural collapse during the terminal phase of deep learning training , author=. Proceedings of the National Academy of Sciences (PNAS) , volume=

[53] [53]

2016 , howpublished=

Understanding intermediate layers using linear classifier probes , author=. 2016 , howpublished=

2016

[54] [54]

2023 , howpublished=

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author=. 2023 , howpublished=

2023

[55] [55]

Neural Computation , volume=

On Learning Vector-Valued Functions , author=. Neural Computation , volume=

[56] [56]

International Conference on Machine Learning (ICML) , pages=

Hilbert Space Embeddings of Conditional Distributions with Applications to Dynamical Systems , author=. International Conference on Machine Learning (ICML) , pages=

[57] [57]

International Conference on Machine Learning (ICML) , pages=

Conditional Mean Embeddings as Regressors , author=. International Conference on Machine Learning (ICML) , pages=

[58] [58]

Journal of Machine Learning Research , volume=

Dimensionality Reduction for Supervised Learning with Reproducing Kernel Hilbert Spaces , author=. Journal of Machine Learning Research , volume=

[59] [59]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Kernel Methods for Deep Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[60] [60]

Foundations of Computational Mathematics , volume=

Optimal Rates for the Regularized Least-Squares Algorithm , author=. Foundations of Computational Mathematics , volume=

[61] [61]

Advances in Neural Information Processing Systems (NeurIPS) , year=

A Measure-Theoretic Approach to Kernel Conditional Mean Embeddings , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[62] [62]

International Conference on Machine Learning (ICML) , year=

Invertible Residual Networks , author=. International Conference on Machine Learning (ICML) , year=

[63] [63]

International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

Sinkformers: Transformers with Doubly Stochastic Attention , author=. International Conference on Artificial Intelligence and Statistics (AISTATS) , year=

[64] [64]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Universal Kernels on Non-Standard Input Spaces , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[65] [65]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Neural Tangent Kernel: Convergence and Generalization in Neural Networks , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[66] [66]

Advances in Neural Information Processing Systems (NeurIPS) , year=

The Emergence of Clusters in Self-Attention Dynamics , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[67] [67]

2026 , note=

Two Measures, One Operator: Attention and Feedforward as a Kernel Conditional-Mean Embedding , author=. 2026 , note=

2026

[68] [68]

Journal of Machine Learning Research , volume=

On the Equivalence between Kernel Quadrature Rules and Random Feature Expansions , author=. Journal of Machine Learning Research , volume=

[69] [69]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Generalization Properties of Learning with Random Features , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[70] [70]

Towards a Unified Analysis of Random

Li, Zhu and Ton, Jean-Francois and Oglic, Dino and Sejdinovic, Dino , journal=. Towards a Unified Analysis of Random

[71] [71]

ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=

Oblivious Sketching of High-Degree Polynomial Kernels , author=. ACM-SIAM Symposium on Discrete Algorithms (SODA) , year=

[72] [72]

Advances in Neural Information Processing Systems (NeurIPS) , year=

Subspace Embeddings for the Polynomial Kernel , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

[73] [73]

Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , journal=. Quasi-

[74] [74]

Avron, Haim and Kapralov, Michael and Musco, Cameron and Musco, Christopher and Velingker, Ameya and Zandieh, Amir , booktitle=. Random

[75] [75]

Nature Communications , volume=

Searching for Exotic Particles in High-Energy Physics with Deep Learning , author=. Nature Communications , volume=

[76] [76]

Action at a Distance: A Universal Reproducing Kernel

Bouhsine, Taha , howpublished=. Action at a Distance: A Universal Reproducing Kernel

[77] [77]

Rethinking Attention with

Choromanski, Krzysztof and Likhosherstov, Valerii and Dohan, David and Song, Xingyou and Gane, Andreea and Sarl. Rethinking Attention with. International Conference on Learning Representations (ICLR) , year=

[78] [78]

Machine Learning , volume=

Support-Vector Networks , author=. Machine Learning , volume=

[79] [79]

Random Features for Compositional Kernels

Random Features for Compositional Kernels , author=. arXiv preprint arXiv:1703.07872 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[80] [80]

Han, Insu and Zandieh, Amir and Avron, Haim , booktitle=. Random