Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Arthur Gretton; Dimitri Meunier; Tim Christensen; Zhu Li

arxiv: 2411.19653 · v2 · submitted 2024-11-29 · 📊 stat.ML · cs.LG

Nonparametric Instrumental Regression via Kernel Methods is Minimax Optimal

Dimitri Meunier , Zhu Li , Tim Christensen , Arthur Gretton This is my paper

Pith reviewed 2026-05-23 16:36 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords kernel instrumental variablesnonparametric IV regressionminimax optimalityL2 convergencespectral regularizationill-posednesskernel methods

0 comments

The pith

Kernel instrumental variable regression attains minimax optimal strong L2 rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the kernel instrumental variable algorithm for nonparametric regression with endogenous regressors. It establishes convergence in the strong L2 norm to the minimum-norm solution in the reproducing kernel Hilbert space, whether or not the structural function is identified. Under eigenvalue decay and source conditions, the derived learning rates are shown to be optimal over fixed smoothness classes by matching lower bounds. The analysis introduces a link condition to measure the ill-posedness induced by the instrument and shows that general spectral regularization improves rates by avoiding saturation.

Core claim

The KIV estimator attains minimax optimal convergence rates in the strong L2 norm for nonparametric instrumental variable regression. These rates are derived under standard eigenvalue-decay and source assumptions and quantified via a link condition that compares the covariance structure of the endogenous regressor with that induced by the instrument. Replacing the first-stage Tikhonov step with general spectral regularization avoids saturation and improves rates for smoother targets. The matching lower bound confirms that instrumental regression induces an unavoidable slowdown relative to ordinary kernel ridge regression.

What carries the argument

The link condition comparing the covariance structure of the endogenous regressor with that induced by the instrument, which quantifies the degree of ill-posedness.

If this is right

When the structural function is not identified, the estimator converges to the minimum-norm IV solution in the associated reproducing kernel Hilbert space.
Convergence holds in the strong L2 norm rather than only in a weaker pseudo-norm.
General spectral regularization in the first stage avoids saturation and yields improved rates for smoother first-stage targets.
The rates are optimal over fixed smoothness classes and slower than those of ordinary kernel ridge regression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The degree of ill-posedness quantified by the link condition could guide instrument selection in applied nonparametric problems.
Similar rate analyses might apply to other two-stage kernel estimators that involve an initial inversion step.
Estimating the link condition from data could provide a practical diagnostic for the statistical difficulty of a given instrumental variable problem.

Load-bearing premise

The covariance operators satisfy standard eigenvalue decay and source conditions.

What would settle it

A simulation or dataset where the KIV estimator converges in L2 faster than the derived minimax lower bound under the paper's eigenvalue and source assumptions would disprove optimality.

read the original abstract

We study the kernel instrumental variable (KIV) algorithm, a kernel-based two-stage least-squares method for nonparametric instrumental variable regression. We provide a convergence analysis covering both identified and non-identified regimes: when the structural function is not identified, we show that the KIV estimator converges to the minimum-norm IV solution in the reproducing kernel Hilbert space associated with the kernel. Crucially, we establish convergence in the strong $L_2$ norm, rather than only in a pseudo-norm. We quantify statistical difficulty through a link condition that compares the covariance structure of the endogenous regressor with that induced by the instrument, yielding an interpretable measure of ill-posedness. Under standard eigenvalue-decay and source assumptions, we derive strong $L_2$ learning rates for KIV and prove that they are minimax-optimal over fixed smoothness classes. Finally, we replace the stage-1 Tikhonov step by general spectral regularization, thereby avoiding saturation and improving rates for smoother first-stage targets. The matching lower bound shows that instrumental regression induces an unavoidable slowdown relative to ordinary kernel ridge regression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KIV gets strong L2 minimax rates and a link condition for ill-posedness, but the lower bound match needs proof inspection.

read the letter

The paper shows that kernel instrumental variable regression achieves minimax optimal convergence rates in the strong L2 norm over fixed smoothness classes. It introduces a link condition comparing the covariance of the endogenous regressor to the instrument-induced covariance as a measure of ill-posedness, derives rates under standard eigenvalue decay and source conditions, and proves a matching lower bound that quantifies the slowdown versus ordinary kernel ridge regression. It also handles the non-identified case by converging to the minimum-norm solution in the RKHS and swaps the first-stage Tikhonov step for general spectral regularization to avoid saturation and gain better rates on smoother targets.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the kernel instrumental variable (KIV) estimator for nonparametric IV regression. It establishes strong L2-norm convergence in both identified and non-identified regimes, introduces a link condition comparing the covariance operator of the endogenous regressor to that induced by the instrument as a measure of ill-posedness, derives explicit learning rates under standard eigenvalue-decay and source conditions, proves these rates are minimax-optimal over fixed smoothness classes via a matching lower bound, and shows that general spectral regularization of the first stage avoids saturation and yields improved rates relative to Tikhonov regularization.

Significance. If the upper and lower bounds match exactly, the work supplies the first minimax-optimal theory for KIV in the strong L2 norm together with an interpretable, operator-theoretic measure of ill-posedness. The explicit comparison to ordinary kernel ridge regression rates and the saturation-avoiding extension constitute concrete advances for the nonparametric IV literature.

major comments (2)

[§4.3 and §5.2] §4.3, Theorem 4.4 (upper bound) and Theorem 5.2 (lower bound): the link condition (Definition 3.2) must be shown to produce identical exponents in both bounds; the manuscript should verify that the lower-bound adversary satisfies the same source condition and spectral link as the upper-bound analysis, or else a logarithmic gap may remain.
[§3.1] §3.1, Assumption 3.3 (eigenvalue decay): the rate expressions depend on the interplay between the decay parameter α and the link exponent β; the paper should state the precise range of (α,β) for which the claimed minimax rate holds without additional assumptions on the joint spectrum.

minor comments (2)

[Introduction] Notation for the minimum-norm IV solution in the non-identified case should be introduced earlier and used consistently when stating the strong-L2 convergence result.
[§6] The statement that general spectral regularization 'avoids saturation' would benefit from an explicit comparison table of attainable rates for Tikhonov versus the new filter under the same source condition.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading, the positive assessment of the contributions, and the recommendation for minor revision. We address each major comment below.

read point-by-point responses

Referee: [§4.3 and §5.2] §4.3, Theorem 4.4 (upper bound) and Theorem 5.2 (lower bound): the link condition (Definition 3.2) must be shown to produce identical exponents in both bounds; the manuscript should verify that the lower-bound adversary satisfies the same source condition and spectral link as the upper-bound analysis, or else a logarithmic gap may remain.

Authors: We thank the referee for this observation. The lower-bound construction in the proof of Theorem 5.2 explicitly selects the adversary function to obey the same source condition (with identical parameter) and the same spectral link condition (with identical exponent β) used in the upper-bound analysis of Theorem 4.4. As a result the exponents match exactly and no logarithmic gap arises. We will add a short clarifying remark after the statement of Theorem 5.2 to make this verification explicit. revision: yes
Referee: [§3.1] §3.1, Assumption 3.3 (eigenvalue decay): the rate expressions depend on the interplay between the decay parameter α and the link exponent β; the paper should state the precise range of (α,β) for which the claimed minimax rate holds without additional assumptions on the joint spectrum.

Authors: We agree that an explicit range improves readability. Under the stated eigenvalue-decay and link conditions the minimax rates hold for every α > 0 and β ≥ 0; the proofs rely solely on the separate decay rates of the two covariance operators and the link condition, without further joint-spectrum assumptions. We will insert a precise statement of this parameter range immediately after Assumption 3.3. revision: yes

Circularity Check

0 steps flagged

No circularity: rates derived from standard assumptions with independent lower bound

full rationale

The provided abstract and description present a standard nonparametric analysis deriving upper bounds on strong L2 error for KIV under eigenvalue-decay, source, and link conditions, then establishing matching minimax lower bounds over fixed smoothness classes. No equations reduce the claimed rates to fitted parameters from the same data, no self-definitional loops appear, and no load-bearing step collapses to a self-citation whose content is unverified. The lower-bound construction is described as respecting the same source and link conditions, yielding an independent slowdown result relative to ordinary KRR. This is the expected non-circular outcome for a pure theoretical derivation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 3 axioms · 0 invented entities

The analysis rests on standard RKHS and operator-theoretic assumptions plus the link condition and source conditions; no new entities are introduced.

axioms (3)

domain assumption The structural function lies in a reproducing kernel Hilbert space with known kernel.
Invoked to define the minimum-norm IV solution and the estimator.
domain assumption Eigenvalue decay and source conditions hold for the relevant covariance operators.
Used to obtain explicit learning rates and minimax lower bounds.
domain assumption The link condition relating the endogenous regressor and instrument covariances is satisfied.
Central to quantifying ill-posedness and deriving rates.

pith-pipeline@v0.9.0 · 5722 in / 1467 out tokens · 22179 ms · 2026-05-23T16:36:31.120526+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We quantify statistical difficulty through a link condition that compares the covariance structure of the endogenous regressor with that induced by the instrument... Under standard eigenvalue-decay and source assumptions, we derive strong L2 learning rates for KIV and prove that they are minimax-optimal
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LINK(γ0,γ1) ... PF C^γ0_X PF ≼ CF ≼ C^γ1_X

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Doubly Robust Proxy Causal Learning with Neural Mean Embeddings
cs.LG 2026-05 unverdicted novelty 6.0

A neural doubly robust proxy causal learning framework using mean embeddings for treatment bridges provides consistent estimators for causal dose-response functions under unobserved confounding for continuous and stru...

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 1 Pith paper

[1]

From the Tikhonov ﬁlter function gξ( x) = ( x + ξ) −1, we obtain the ridge regression algorithm introduced in Eq

Ridge regression. From the Tikhonov ﬁlter function gξ( x) = ( x + ξ) −1, we obtain the ridge regression algorithm introduced in Eq. ( 5). In this case, we have E = ρ = ω ρ = 1

work page
[2]

Gradient Descent. From the Landweber iteration ﬁlter function given by gk( x) ≐τ k−1 /summ⊗tion.disp i= 0 ( 1 −τ x) i for k ≐1/slash.l⟩ftξ, k ∈N we obtain the gradient descent scheme with constant step siz e τ > 0, which corresponds to the population gradient iteration given by Fk+ 1 ≐Fk −τ 2 ∇F /par⟩nl⟩ft.alt2EX,Z /parall⟩l.alt1φ X ( X) −F ( Z)/parall⟩l....

work page 2019
[3]

The truncation ﬁlter function gξ( x) = x−11[x ⩾ξ]yields kernel principal component regression, corresponding to a hard th resholding of eigenvalues at a truncation level ξ

Kernel principal component regression. The truncation ﬁlter function gξ( x) = x−11[x ⩾ξ]yields kernel principal component regression, corresponding to a hard th resholding of eigenvalues at a truncation level ξ. In this case we have E = ω ρ = 1 for arbitrary qualiﬁcation ρ

work page
[4]

Mixture between Landweber iteration and Tikhonov regulari zation

Iterated Tikhonov. Mixture between Landweber iteration and Tikhonov regulari zation. Unlike Tikhonov regularization which has ﬁnite qualiﬁcation and cannot exp loit the regularity of the solution beyond a certain regularity level, iterated Tikhonov overcomes this proble m by means of the following regularization: gξ,ν ( x) = ( x+ ξ) ν −ξν x( x+ ξ) ν with ...

work page
[5]

If we ﬁx the total distance in the Landweber iteration to ξ−1 ∶= τ k and take τ → 0+ , we obtain the gradient ﬂow ﬁlter function gξ( x) = ( 1 −e−x ξ ) x−1

Gradient Flow. If we ﬁx the total distance in the Landweber iteration to ξ−1 ∶= τ k and take τ → 0+ , we obtain the gradient ﬂow ﬁlter function gξ( x) = ( 1 −e−x ξ ) x−1. In this case we have E = 1 and ω ρ = ( τ/slash.l⟩fte) τ for arbitrary qualiﬁcation ρ. A.4 Interpolation spaces The interpolation spaces [HZ]β , [HX ]β and [G]β introduced previously corr...

work page 2012
[6]

standard

and using ˆFξ( ⋅) = ˆCX/divides.alt0 Z,ξ φ Z ( ⋅) , we obtain, F and Φ∗ ˆF in closed form: Φ∗ ˆF = 1 m Φ∗ ˜X gξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜ZZ F = 1 m2 KZ ˜Zgξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜X ˜X gξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜ZZ , where, K ˜ZZ = Φ ˜Z Φ∗ Z ∈Rm×n, [K ˜ZZ]ij = kZ ( ˜zi, z j) i ∈[m], j ∈[n] K ˜X ˜...

work page 2020
[7]

To verify this, we need to show that HF is indeed a RKHS

can be written as ¯rλ = arg min r∈HF 1 n n /summ⊗tion.disp i= 1 ( yi −r( zi)) 2 + λ/parall⟩l.alt1r/parall⟩l.alt12 HF , (26) which is now a kernel ridge regression objective in standard form. To verify this, we need to show that HF is indeed a RKHS. Fortunately, this was studied by Steinwart and Christmann (2008) (see also Blanchard and Mücke (2018) where ...

work page 2008
[8]

Using ( MOM) yields /inte∅r⊗l.dispR ( y −⟨h∗, F ∗( z)⟩HX ) mP ( dy /divid⟩s.alt0z) ⩽1 2 m!σ 2Lm−2. We therefore have E /parall⟩l.alt1θ( Z, Y )/parall⟩l.alt1m HX ⩽1 2 m! /par⟩nl⟩ft.alt1σA Z /parall⟩l.alt1ˆFξ −F∗/parall⟩l.alt1α Z /par⟩nright.alt1 2 /par⟩nl⟩ft.alt1LAZ /parall⟩l.alt1ˆFξ −F∗/parall⟩l.alt1α Z /par⟩nright.alt1 m−2 . Using Theorem 16, we have wit...

work page
[9]

We start by verifying λ −1 n n−1 log λ −1 n = O( 1)

are satisﬁed. We start by verifying λ −1 n n−1 log λ −1 n = O( 1) . We have log λ −1 n nλ n = γ0 βX −1 + γ0 + γ 0 γ 1 pX log( n) n n γ 0 β X −1+ γ 0+ γ 0 γ 1 pX . As γ0/slash.l⟩ft(βX −1 + γ0 + γ 0 γ 1 pX ) < 1, we have log( λ −1 n )/slash.l⟩ft(nλ n) →0, as n →∞. Therefore, the ﬁrst constraint Eq. ( 18) is satisﬁed. We next check λ −1 n r1( 0, m ) ⪅1 ⇐ ⇒n ...

work page
[10]

We start by verifying λ −1 n n−1 log λ −1 n = O( 1)

are satisﬁed. We start by verifying λ −1 n n−1 log λ −1 n = O( 1) . We have log λ −1 n nλ n = a ⋅ βZ βZ + pZ ⋅ γ0 βX −1 + 2γ0 + ( 1 −γ) cF log( n) n na⋅ β Z β Z + pZ ⋅ γ 0 β X −1+ 2γ 0+( 1−γ ) cF . Note that a ⋅ β Z β Z + pZ ⋅ γ 0 β X −1+ 2γ 0+ ( 1−γ ) cF < 1 ⇐ ⇒ a < β Z + pZ β Z β X −1+ 2γ 0+ ( 1−γ ) cF γ 0 , which is satisﬁed under Eq. ( 36) since γ0 ≤β...

work page 2011
[11]

( 2): r0 = T h0

and the assumption that r0 ∈R( T ) , h0 is identiﬁed as the unique solution to the integral equation given in Eq. ( 2): r0 = T h0. We deﬁne ˜F , as the set of models (NPIV) with ( πX,Y,Z , h 0) such that r0 ∈R( T ) and Assumption 11 hold. We saw in Section E.1.2 that when T is known, ( NPIV) can be reformulated as the NPIR model Y = T h0( Z) + ξ, ξ = h0( ...

work page 2011
[12]

For h, h ′∈HX , we therefore have, KL( Ph, P h′) = /inte∅r⊗l.dispEZ KL( Ph( ⋅/divid⟩s.alt0z) , P h′( ⋅/divid⟩s.alt0z)) dπZ( z) = 1 2 /inte∅r⊗l.dispEZ ⟨h −h′, F ∗( z)⟩2 HX σ 2( z) dπZ ( z) ≤ 1 2σ 2 0 /inte∅r⊗l.dispEZ ⟨h −h′, F ∗( z)⟩2 HX dπZ ( z) = 1 2σ 2 0 /parall⟩l.alt1C1/slash.left 2 F ( h −h′)/parall⟩l.alt12 HX . By Assumptions ( LINK) and ( EVDX), we ...

work page 2020
[13]

For πZ −almost all z ∈EZ , k( z, z ) ⩽κ

work page
[14]

There exist σ, L > 0 such that for all m ⩾2, E[(Y −f∗( Z)) m /divid⟩s.alt0Z]⩽1 2 m!σ 2Lm−2, πZ −almost surely

work page
[15]

There exist p ∈( 0, 1]and a constant D > 0 such that NΣ( λ) ⩽Dλ −p

work page
[16]

There exists β ∈[1, 2]such that /parall⟩l.alt1Σ−β −1 2 f∗/parall⟩l.alt1H ⩽B. Then for the abbreviations gλ ≐log /par⟩nl⟩ft.alt42eNΣ( λ) /parall⟩l.alt1Σ/parall⟩l.alt1H→H + λ /parall⟩l.alt1Σ/parall⟩l.alt1H→H /par⟩nright.alt4 Aλ,τ ≐8τ gλ κ 2λ −1, (42) and 0 ⩽θ ⩽1, τ ⩾1, 0 < λ ⩽1, and n ⩾Aλ,τ , the following bound is satisﬁed with P n-probability not less tha...

work page 2020
[17]

If there is a constant c < +∞ , such that /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽c/parall⟩l.alt1Bx/parall⟩l.alt1H for all x ∈H, then R( A) ⊆R( B) and /parall⟩l.alt1B†A/parall⟩l.alt1H→H ⩽c

work page
[18]

For details on the pseudo-inverse B†, see Engl et al

If R( A) ⊆R( B) , then B†A is a well-deﬁned bounded operator on H and /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽c/parall⟩l.alt1Bx/parall⟩l.alt1H for all x ∈H with c = /parall⟩l.alt1B†A/parall⟩l.alt1H→H . For details on the pseudo-inverse B†, see Engl et al. (2000). Proof. 1. Consider the operator S0 deﬁned on R ( B) by S0 ( Bx) = Ax. The operator S0 is well-deﬁned...

work page 2000
[19]

Under the assumption that R( A) ⊆R( B) , Q ≐B†A is well-deﬁned, bounded and such that A = BQ (Theorem A.1 Klebanov et al. , 2021). Therefore A = Q∗B which implies that for all x ∈H, /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽ /parall⟩l.alt1Q∗/parall⟩l.alt1H /parall⟩l.alt1Bx/parall⟩l.alt1H = /parall⟩l.alt1Q/parall⟩l.alt1H/parall⟩l.alt1Bx/parall⟩l.alt1H . Lemma 15. L...

work page 2021
[20]

50 Theorem 16 (Theorem 26 Fischer and Steinwart (2020) - Bernstein’s Inequality)

This implies that f ∈N ( ˆC1/slash.left 2 n ) ⊆N ( ˆCn) , which concludes the proof. 50 Theorem 16 (Theorem 26 Fischer and Steinwart (2020) - Bernstein’s Inequality) . Let ( Ω, B, P ) be a proba- bility space and ξ ∶Ω →H be a random variable with EP /parall⟩l.alt1ξ/parall⟩l.alt1m H ⩽1 2 m!˜σ 2 ˜Lm−2 for all m ⩾2. Then, for τ ⩾1 and n ⩾1, the following con...

work page 2020

[1] [1]

From the Tikhonov ﬁlter function gξ( x) = ( x + ξ) −1, we obtain the ridge regression algorithm introduced in Eq

Ridge regression. From the Tikhonov ﬁlter function gξ( x) = ( x + ξ) −1, we obtain the ridge regression algorithm introduced in Eq. ( 5). In this case, we have E = ρ = ω ρ = 1

work page

[2] [2]

Gradient Descent. From the Landweber iteration ﬁlter function given by gk( x) ≐τ k−1 /summ⊗tion.disp i= 0 ( 1 −τ x) i for k ≐1/slash.l⟩ftξ, k ∈N we obtain the gradient descent scheme with constant step siz e τ > 0, which corresponds to the population gradient iteration given by Fk+ 1 ≐Fk −τ 2 ∇F /par⟩nl⟩ft.alt2EX,Z /parall⟩l.alt1φ X ( X) −F ( Z)/parall⟩l....

work page 2019

[3] [3]

The truncation ﬁlter function gξ( x) = x−11[x ⩾ξ]yields kernel principal component regression, corresponding to a hard th resholding of eigenvalues at a truncation level ξ

Kernel principal component regression. The truncation ﬁlter function gξ( x) = x−11[x ⩾ξ]yields kernel principal component regression, corresponding to a hard th resholding of eigenvalues at a truncation level ξ. In this case we have E = ω ρ = 1 for arbitrary qualiﬁcation ρ

work page

[4] [4]

Mixture between Landweber iteration and Tikhonov regulari zation

Iterated Tikhonov. Mixture between Landweber iteration and Tikhonov regulari zation. Unlike Tikhonov regularization which has ﬁnite qualiﬁcation and cannot exp loit the regularity of the solution beyond a certain regularity level, iterated Tikhonov overcomes this proble m by means of the following regularization: gξ,ν ( x) = ( x+ ξ) ν −ξν x( x+ ξ) ν with ...

work page

[5] [5]

If we ﬁx the total distance in the Landweber iteration to ξ−1 ∶= τ k and take τ → 0+ , we obtain the gradient ﬂow ﬁlter function gξ( x) = ( 1 −e−x ξ ) x−1

Gradient Flow. If we ﬁx the total distance in the Landweber iteration to ξ−1 ∶= τ k and take τ → 0+ , we obtain the gradient ﬂow ﬁlter function gξ( x) = ( 1 −e−x ξ ) x−1. In this case we have E = 1 and ω ρ = ( τ/slash.l⟩fte) τ for arbitrary qualiﬁcation ρ. A.4 Interpolation spaces The interpolation spaces [HZ]β , [HX ]β and [G]β introduced previously corr...

work page 2012

[6] [6]

standard

and using ˆFξ( ⋅) = ˆCX/divides.alt0 Z,ξ φ Z ( ⋅) , we obtain, F and Φ∗ ˆF in closed form: Φ∗ ˆF = 1 m Φ∗ ˜X gξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜ZZ F = 1 m2 KZ ˜Zgξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜X ˜X gξ /par⟩nl⟩ft.alt3K ˜Z ˜Z m /par⟩nright.alt3K ˜ZZ , where, K ˜ZZ = Φ ˜Z Φ∗ Z ∈Rm×n, [K ˜ZZ]ij = kZ ( ˜zi, z j) i ∈[m], j ∈[n] K ˜X ˜...

work page 2020

[7] [7]

To verify this, we need to show that HF is indeed a RKHS

can be written as ¯rλ = arg min r∈HF 1 n n /summ⊗tion.disp i= 1 ( yi −r( zi)) 2 + λ/parall⟩l.alt1r/parall⟩l.alt12 HF , (26) which is now a kernel ridge regression objective in standard form. To verify this, we need to show that HF is indeed a RKHS. Fortunately, this was studied by Steinwart and Christmann (2008) (see also Blanchard and Mücke (2018) where ...

work page 2008

[8] [8]

Using ( MOM) yields /inte∅r⊗l.dispR ( y −⟨h∗, F ∗( z)⟩HX ) mP ( dy /divid⟩s.alt0z) ⩽1 2 m!σ 2Lm−2. We therefore have E /parall⟩l.alt1θ( Z, Y )/parall⟩l.alt1m HX ⩽1 2 m! /par⟩nl⟩ft.alt1σA Z /parall⟩l.alt1ˆFξ −F∗/parall⟩l.alt1α Z /par⟩nright.alt1 2 /par⟩nl⟩ft.alt1LAZ /parall⟩l.alt1ˆFξ −F∗/parall⟩l.alt1α Z /par⟩nright.alt1 m−2 . Using Theorem 16, we have wit...

work page

[9] [9]

We start by verifying λ −1 n n−1 log λ −1 n = O( 1)

are satisﬁed. We start by verifying λ −1 n n−1 log λ −1 n = O( 1) . We have log λ −1 n nλ n = γ0 βX −1 + γ0 + γ 0 γ 1 pX log( n) n n γ 0 β X −1+ γ 0+ γ 0 γ 1 pX . As γ0/slash.l⟩ft(βX −1 + γ0 + γ 0 γ 1 pX ) < 1, we have log( λ −1 n )/slash.l⟩ft(nλ n) →0, as n →∞. Therefore, the ﬁrst constraint Eq. ( 18) is satisﬁed. We next check λ −1 n r1( 0, m ) ⪅1 ⇐ ⇒n ...

work page

[10] [10]

We start by verifying λ −1 n n−1 log λ −1 n = O( 1)

are satisﬁed. We start by verifying λ −1 n n−1 log λ −1 n = O( 1) . We have log λ −1 n nλ n = a ⋅ βZ βZ + pZ ⋅ γ0 βX −1 + 2γ0 + ( 1 −γ) cF log( n) n na⋅ β Z β Z + pZ ⋅ γ 0 β X −1+ 2γ 0+( 1−γ ) cF . Note that a ⋅ β Z β Z + pZ ⋅ γ 0 β X −1+ 2γ 0+ ( 1−γ ) cF < 1 ⇐ ⇒ a < β Z + pZ β Z β X −1+ 2γ 0+ ( 1−γ ) cF γ 0 , which is satisﬁed under Eq. ( 36) since γ0 ≤β...

work page 2011

[11] [11]

( 2): r0 = T h0

and the assumption that r0 ∈R( T ) , h0 is identiﬁed as the unique solution to the integral equation given in Eq. ( 2): r0 = T h0. We deﬁne ˜F , as the set of models (NPIV) with ( πX,Y,Z , h 0) such that r0 ∈R( T ) and Assumption 11 hold. We saw in Section E.1.2 that when T is known, ( NPIV) can be reformulated as the NPIR model Y = T h0( Z) + ξ, ξ = h0( ...

work page 2011

[12] [12]

For h, h ′∈HX , we therefore have, KL( Ph, P h′) = /inte∅r⊗l.dispEZ KL( Ph( ⋅/divid⟩s.alt0z) , P h′( ⋅/divid⟩s.alt0z)) dπZ( z) = 1 2 /inte∅r⊗l.dispEZ ⟨h −h′, F ∗( z)⟩2 HX σ 2( z) dπZ ( z) ≤ 1 2σ 2 0 /inte∅r⊗l.dispEZ ⟨h −h′, F ∗( z)⟩2 HX dπZ ( z) = 1 2σ 2 0 /parall⟩l.alt1C1/slash.left 2 F ( h −h′)/parall⟩l.alt12 HX . By Assumptions ( LINK) and ( EVDX), we ...

work page 2020

[13] [13]

For πZ −almost all z ∈EZ , k( z, z ) ⩽κ

work page

[14] [14]

There exist σ, L > 0 such that for all m ⩾2, E[(Y −f∗( Z)) m /divid⟩s.alt0Z]⩽1 2 m!σ 2Lm−2, πZ −almost surely

work page

[15] [15]

There exist p ∈( 0, 1]and a constant D > 0 such that NΣ( λ) ⩽Dλ −p

work page

[16] [16]

There exists β ∈[1, 2]such that /parall⟩l.alt1Σ−β −1 2 f∗/parall⟩l.alt1H ⩽B. Then for the abbreviations gλ ≐log /par⟩nl⟩ft.alt42eNΣ( λ) /parall⟩l.alt1Σ/parall⟩l.alt1H→H + λ /parall⟩l.alt1Σ/parall⟩l.alt1H→H /par⟩nright.alt4 Aλ,τ ≐8τ gλ κ 2λ −1, (42) and 0 ⩽θ ⩽1, τ ⩾1, 0 < λ ⩽1, and n ⩾Aλ,τ , the following bound is satisﬁed with P n-probability not less tha...

work page 2020

[17] [17]

If there is a constant c < +∞ , such that /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽c/parall⟩l.alt1Bx/parall⟩l.alt1H for all x ∈H, then R( A) ⊆R( B) and /parall⟩l.alt1B†A/parall⟩l.alt1H→H ⩽c

work page

[18] [18]

For details on the pseudo-inverse B†, see Engl et al

If R( A) ⊆R( B) , then B†A is a well-deﬁned bounded operator on H and /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽c/parall⟩l.alt1Bx/parall⟩l.alt1H for all x ∈H with c = /parall⟩l.alt1B†A/parall⟩l.alt1H→H . For details on the pseudo-inverse B†, see Engl et al. (2000). Proof. 1. Consider the operator S0 deﬁned on R ( B) by S0 ( Bx) = Ax. The operator S0 is well-deﬁned...

work page 2000

[19] [19]

Under the assumption that R( A) ⊆R( B) , Q ≐B†A is well-deﬁned, bounded and such that A = BQ (Theorem A.1 Klebanov et al. , 2021). Therefore A = Q∗B which implies that for all x ∈H, /parall⟩l.alt1Ax/parall⟩l.alt1H ⩽ /parall⟩l.alt1Q∗/parall⟩l.alt1H /parall⟩l.alt1Bx/parall⟩l.alt1H = /parall⟩l.alt1Q/parall⟩l.alt1H/parall⟩l.alt1Bx/parall⟩l.alt1H . Lemma 15. L...

work page 2021

[20] [20]

50 Theorem 16 (Theorem 26 Fischer and Steinwart (2020) - Bernstein’s Inequality)

This implies that f ∈N ( ˆC1/slash.left 2 n ) ⊆N ( ˆCn) , which concludes the proof. 50 Theorem 16 (Theorem 26 Fischer and Steinwart (2020) - Bernstein’s Inequality) . Let ( Ω, B, P ) be a proba- bility space and ξ ∶Ω →H be a random variable with EP /parall⟩l.alt1ξ/parall⟩l.alt1m H ⩽1 2 m!˜σ 2 ˜Lm−2 for all m ⩾2. Then, for τ ⩾1 and n ⩾1, the following con...

work page 2020