Sobolev Norm Learning Rates for Conditional Mean Embeddings
Pith reviewed 2026-05-24 13:55 UTC · model grok-4.3
The pith
Conditional mean embeddings admit explicit adaptive convergence rates in the misspecified regime via Sobolev interpolation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces. We derive explicit, adaptive convergence rates for the sample estimator under the misspecified setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS.
What carries the argument
Interpolation conditions between Sobolev norms on the input and output RKHSs that the target conditional mean embedding must satisfy.
If this is right
- The sample estimator for the conditional mean embedding converges at rates determined by the interpolation parameters rather than by Hilbert-Schmidt or boundedness assumptions.
- Uniform convergence in the output RKHS is attainable when the output-space smoothness exceeds a threshold set by the input-space parameters.
- Conditional mean embeddings become applicable to infinite-dimensional RKHSs and continuous state spaces where prior bounded-operator assumptions fail.
- The rates adapt automatically to the degree of misspecification encoded in the Sobolev relations.
Where Pith is reading between the lines
- These rates could support more reliable use of conditional mean embeddings inside continuous-state reinforcement learning algorithms that previously avoided them due to unboundedness concerns.
- Kernel selection in practice might be guided by checking whether the expected smoothness of the embedding satisfies the required interpolation relations on held-out data.
- The same interpolation technique could be applied to derive rates for other kernel operators, such as those appearing in kernel-based policy evaluation.
Load-bearing premise
The target conditional mean embedding satisfies specific interpolation conditions between Sobolev norms on the input and output reproducing kernel Hilbert spaces.
What would settle it
Run a numerical experiment with a known conditional mean embedding that violates the assumed Sobolev-norm interpolation relations and check whether the predicted convergence rates are observed; mismatch would show the rates do not apply.
read the original abstract
We develop novel learning rates for conditional mean embeddings by applying the theory of interpolation for reproducing kernel Hilbert spaces (RKHS). We derive explicit, adaptive convergence rates for the sample estimator under the misspecifed setting, where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs. We demonstrate that in certain parameter regimes, we can achieve uniform convergence rates in the output RKHS. We hope our analyses will allow the much broader application of conditional mean embeddings to more complex ML/RL settings involving infinite dimensional RKHSs and continuous state spaces.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops novel learning rates for conditional mean embeddings by applying interpolation theory for reproducing kernel Hilbert spaces. It derives explicit, adaptive convergence rates for the sample estimator under the misspecified setting where the target operator is not Hilbert-Schmidt or bounded with respect to the input/output RKHSs, and demonstrates that uniform convergence rates in the output RKHS are achievable in certain parameter regimes.
Significance. If the derived rates hold, the work would meaningfully extend the applicability of conditional mean embeddings to misspecified infinite-dimensional settings common in complex ML and RL tasks with continuous state spaces. The explicit and adaptive character of the rates, obtained via Sobolev-norm interpolation, is a technical strength that addresses a gap in existing analyses.
major comments (1)
- [Abstract] Abstract and the interpolation-based derivation: the explicit adaptive rates under misspecification are obtained only when the target conditional mean embedding satisfies specific interpolation inequalities between Sobolev norms on the input and output RKHSs; the manuscript supplies no verification, examples, or conditions under which these relations hold for concrete kernels or data-generating processes, rendering the rates conditional rather than generally applicable.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for highlighting an important point regarding the applicability of the derived rates. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract and the interpolation-based derivation: the explicit adaptive rates under misspecification are obtained only when the target conditional mean embedding satisfies specific interpolation inequalities between Sobolev norms on the input and output RKHSs; the manuscript supplies no verification, examples, or conditions under which these relations hold for concrete kernels or data-generating processes, rendering the rates conditional rather than generally applicable.
Authors: We agree that the explicit adaptive rates are derived under the assumption that the target conditional mean embedding satisfies the stated interpolation inequalities between Sobolev norms. These inequalities are standard in the theory of interpolation spaces and Sobolev embeddings for RKHS, and they are satisfied for a range of kernels (e.g., Matérn kernels of sufficient smoothness) and data-generating processes with appropriate regularity. However, the manuscript does not include explicit verification, examples, or sufficient conditions for concrete kernels. In the revised version we will add a dedicated remark (or short subsection) that states verifiable conditions on the kernel and the conditional distribution under which the interpolation inequalities hold, together with two concrete examples (one for a Gaussian kernel and one for a Matérn kernel) that satisfy the assumptions. This will make the scope of the results clearer without altering the main theorems. revision: yes
Circularity Check
No circularity: rates derived from standard RKHS interpolation theory under explicit assumptions
full rationale
The paper applies existing interpolation theory for RKHS to obtain convergence rates for the sample estimator of the conditional mean embedding in the misspecified regime. The key conditions (target operator satisfying specific Sobolev-norm interpolation inequalities between input and output RKHSs) are stated as assumptions required for the rates to hold, rather than quantities derived or fitted within the paper itself. No self-definitional steps, fitted inputs renamed as predictions, or load-bearing self-citations that reduce the central claims to tautologies appear in the abstract or described derivation approach. The result is therefore conditional on external assumptions but self-contained as a derivation from those assumptions plus standard functional-analysis tools.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We apply the theory of interpolation spaces for RKHS … require that the target 'conditioning function' lie in some intermediate fractional space between the input RKHS and L² … ||C^β_{Y|X}|| ≤ B < ∞
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 5 … λ_n ≍ (log^r n / n)^{1/max{α,β+p}} … ||Ĉ_{Y|X} − C_{Y|X}||_γ ≤ K log(δ^{-1}) (n / log^r n)^{-(β−γ)/2 max{α,β+p}}
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Kernel dimension reduction in regression
Kenji Fukumizu, Francis R Bach, Michael I Jordan, et al. Kernel dimension reduction in regression. The Annals of Statistics , 37(4):1871–1905,
work page 1905
-
[2]
Nonparamet- ric approximation of conditional expectation opera- tors
Mattes Mollenhauer and Péter Koltai. Nonparamet- ric approximation of conditional expectation opera- tors. arXiv preprint arXiv:2012.12917 ,
-
[3]
Kernel autocovariance operators of stationary processes: Estimation and conver- gence
Mattes Mollenhauer, Stefan Klus, Christof Schütte, and Péter Koltai. Kernel autocovariance operators of stationary processes: Estimation and conver- gence. arXiv preprint arXiv:2004.00891 ,
-
[4]
Since {µ β 2 i ei}∞ i=1 is an orthonormal basis for Hβ K, we may express any f ∈ H β K as f = ∑∞ i=1⟨f, µ β 2 i ei⟩Hβ K µ β 2 i ei. Hence, we have: ⟨f, Cβ,γ,ν (µ β 2 i ei)⟩Hβ K = ⟨f, I ∗ β,γ,ν Iβ,γ,ν (µ β 2 i ei)⟩Hβ K = ⟨Iβ,γ,ν f, Iβ,γ,ν (µ β 2 i ei)⟩Hγ K = ⟨f, µ β 2 i ei⟩Hγ K = ⣨ ∞∑ i=1 ⟨f, µ β 2 i ei⟩Hβ K µ β 2 i ei, µ β 2 i ei ⟩ Hγ K = µ β −γ i ⟨f, µ β...
work page 2020
-
[5]
We first note that, here π may be any measure, and we only require that the compact imbedding HL֒→L2(π ) be injective (which ensures that {η 1 2 i fi}∞ i=1 is indeed an orthonormal basis for HL by Theorem 3.3 in Steinwart and Scovel (2012)) Let gf (x) = EY |x[f (Y )], for f ∈ H L. Then, we have that: EY |x [( (l(Y, ·) − µ Y |x) ⊗ (l(Y, ·) − µ Y |x) ) p] = ...
work page 2012
-
[6]
A particularly illustrative case of the assumption ηi = O ( i−q−1 ) oc- curs when the ηi decay exponentially (such as when l is the Gaussian kernel and π is the Lebesgue measure), in which case it is easy to see that the decay condition holds f or any q ∈ (0, 1). Moreover, we note that our boundedness condition ∑ i∈N ηγ i f 2 i L∞(Y) < ∞ is si...
work page 2012
-
[7]
· ∑ i µ α e2 i (x) · ||Cβ Y |X ||2 ≤ λ β −α ||kα ||2 ∞||Cβ Y |X ||2 when β > α (here (23) follows from the fact that {µ β 2 i ei}∞ i=1 is an orthonormal basis for Hβ K and the last line follows from Lemma A.1 in Fischer and Steinwart (2020)). Whe n β < α , we have that: ||µ λ Y |x||2 L = ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ ∞∑ i=1 µ i µ i + λ ·Cβ Y |X µ β i ei(x)ei ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ 2 ...
work page 2020
-
[8]
sup ||f ||L≤1 ∞∑ i=1 ⟨f, C β Y |X µ β 2 i ei⟩2 L (25) ≤ λ β ||Cβ Y |X ||2 (26) where (24) follows from the fact that EX [ei(X)ej(X)] = δij (as {ei}∞ i=1 is an orthonormal basis for L2(ν)), and the last step follows from {µ β 2 i ei}∞ i=1 being an orthonormal basis in Hβ K. For the final part of Lemma 6, we observe that like before: Cλ Y |X = CY X (CXX + λ)...
work page 2020
-
[9]
We begin like in the proof of Theorem 6.8 in Fischer and Steinw art (2020). Namely, ap- plying Lemma 2 we write: ||ˆCY |X − Cλ Y |X ||γ = ||( ˆCY |X − Cλ Y |X ) ◦ C 1 2 1,γ,ν || (31) = ||( ˆCY |X − Cλ Y |X ) ◦ C 1−γ 2 XX || (32) = ||( ˆCY X ( ˆCXX + λ)−1 − CY X (CXX + λ)−1)C 1−γ 2 XX || ≤ ||( ˆCY X − CY X (CXX + λ)−1( ˆCXX + λ))(CXX + λ)− 1 2 ||· ||(CXX +...
work page 2020
-
[10]
Finally for (37) E [ ||µλ Y |X − µY |X ||2(p−1) L ||h(X, ·)||2p K ( (µλ Y |X − µY |X ) ⊗ (µλ Y |X − µY |X) )] ≼ (2p)!M (λ)2(p−1)||kα||2p ∞ 2λpα · E [ (µλ Y |X − µY |X) ⊗ (µλ Y |X − µY |X ) ] Let Q = M (λ) ∨ R and ρλ = E [ (µ Y |X − µ λ Y |X ) ⊗ (µ Y |X − µ λ Y |X ) ] . Then, we can apply Lemma C.3 with ˜V = 2(σ 2 + M 2(λ))CXX (CXX + λ)−1, ˜W = 2N (λ)V + 2...
work page 2020
-
[11]
Note, in (47), we have applied (6), (7), and Lemma D.4. Thus, we have that: ||ˆCY |X − Cλ Y |X ||γ ≤ 3λ − γ 2 n ( 16Q||kα ||∞β (δ) λ α 2 n n + 8 √ ηβ (δ) n ) ≤ 24λ − γ 2 n ( 2N4||kα ||∞β (δ) nλ α +(α −β )+ 2 n + √ N5β (δ) nλ max{p,α −β } n ) (49) ≤ 24λ − γ 2 n √ β (δ) nλ max{p,α −β } n ( 2N4||kα ||∞ √ β (δ) nλ α +(α −β )+−max{p,α −β } n + N5 ) where (49) ...
work page 2020
-
[12]
Hence, since δ < 1 and r > 1, we have that β (δ) nλ max{β +p,α } n = O(log(δ−1)) as n → ∞ . Thus, we have, that there exists a K > 0 not depending on n or δ, such that: ||ˆCY |X − CY |X ||γ ≤ K log(δ−1)λ β −γ 2 n with probability 1 − 2δ. C Concentration Bounds Lemma C.1. Let X1, X2, . . . X N be i.i.d self-adjoint operators on a Hilbert space V, with: E[X...
work page 2015
-
[13]
25σ√ N : P ( ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ 1 N N∑ i=1 Xi ⏐ ⏐ ⏐ ⏐ ⏐ ⏐ > t ) ≤ 4tr(V ) σ 2 exp ( − N t2 2( √ 8σ 2 + 2Rt) ) Prem T alwai, Ali Shameli, David Simchi-Levi The result follows from setting the RHS equal to δ, solving for t using the quadratic formula, applying the tri- angle inequality to this solution, and noting that √ 2 ≤ 2, we obtain our result. Remark. We emp...
work page 2015
-
[14]
Proof. Let T ∈ L 1(H). Then, we have that T = u ⊗ u for u ∈ H . Thus, f (T ) = ||u||2p−2(u ⊗ u). By the definition of the semidefinite order, we have f is convex iff the real-valued function f (T ) = ||u||2p−2⟨y, u⟩2 for all y ∈ H . The latter follows from Lemma D.1. Lemma D.2. Suppose Assumption 1 holds. Then, if β > p , there exists a constant D > 0 that d...
work page 2020
-
[15]
Then, for every β ∈ (0, 1), Hβ K contains constant functions. Proof. We only treat the one-dimensional case d = 1 and note that the more general case follows easily from the argument of Steinwart and Christmann (2008). By Minh (20 10), we have that: HK = { f = e− x2 σ 2 ∞∑ k=0 wkxk : ||f ||2 K ≡ ∞∑ k=0 w2 kσ 2kk! 2k < ∞ } Prem T alwai, Ali Shameli, David ...
work page 2008
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.