Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression

Kaizheng Wang; Seok-Jin Kim

arxiv: 2604.13410 · v1 · submitted 2026-04-15 · 📊 stat.ME · cs.LG· stat.ML

Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression

Seok-Jin Kim , Kaizheng Wang This is my paper

Pith reviewed 2026-05-10 13:20 UTC · model grok-4.3

classification 📊 stat.ME cs.LGstat.ML

keywords continuous treatment effectskernel ridge regressionconfounding adjustmentpseudo-outcomesadaptivitymodel selectionnonparametric causal inferencedose-response estimation

0 comments

The pith

A two-stage kernel ridge regression estimates the continuous treatment effect function by first modeling the full response surface then correcting confounding with pseudo-outcomes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an estimator for the average effect of a continuous treatment on an outcome when treatment assignment depends on covariates and creates selection bias. Direct regression of outcome on treatment alone fails under confounding, but the authors show that first fitting a joint model of outcome given treatment and covariates allows construction of adjusted pseudo-outcomes whose regression on treatment alone recovers the desired marginal effect. Because averaging over covariates usually produces a simpler target function than the original response surface, the second-stage fit can be more accurate and faster. The method includes a fully automatic procedure that chooses regularization levels to adapt to both the unknown strength of confounding (overlap) and the smoothness of the underlying functions without prior knowledge of either.

Core claim

We propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue

What carries the argument

Two-stage kernel ridge regression that learns a joint conditional response model in stage one and then regresses pseudo-outcomes on treatment alone in stage two to recover the marginal effect function while adapting to its relative simplicity.

If this is right

Consistent estimation of the continuous dose-response curve becomes possible even when treatment assignment depends strongly on covariates.
Faster convergence rates are achieved whenever the averaged effect function has lower complexity than the joint response surface.
Fully automatic regularization selection works without knowing the degree of overlap or the eigenvalue decay rate in advance.
The procedure applies in general reproducing kernel Hilbert spaces, supporting flexible nonparametric modeling of both stages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same two-stage logic could be paired with non-kernel first-stage estimators such as neural nets when covariates are high-dimensional.
The approach suggests that explicitly marginalizing over covariates via an intermediate model is often more efficient than attempting direct adjustment for continuous treatments.
Empirical tests on observational datasets with measured overlap variation would directly check whether the claimed adaptivity holds in practice.

Load-bearing premise

An accurate enough first-stage model of the response given treatment and covariates can be learned so that the derived pseudo-outcomes remove the selection bias induced by confounding.

What would settle it

In a simulation with known true effect function and controlled overlap, the two-stage estimator would produce higher mean squared error than a single-stage direct regression of outcome on treatment when overlap is moderate and the marginal effect is no simpler than the full surface.

read the original abstract

We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two-stage KRR with data-driven adaptivity to overlap and smoothness for continuous treatment effects is a clean extension, but first-stage error under weak overlap needs explicit control in the proofs.

read the letter

The main point is a two-stage kernel ridge regression for the marginal effect function under continuous treatment and confounding. The first stage fits the response on (T, X), then pseudo-outcomes feed a second KRR on T alone, with a fully data-driven selector that claims to adapt to both the unknown overlap and the kernel eigenvalue decay without manual tuning. The induced marginal effect after averaging over X is treated as simpler, which the method exploits.

Referee Report

2 major / 2 minor

Summary. The paper introduces a two-stage kernel ridge regression (KRR) method for estimating the continuous treatment effect function, which is the population-averaged outcome as a function of treatment. The first stage involves fitting a KRR model to the response using both treatment and covariates. This model is then used to generate pseudo-outcomes that adjust for confounding due to covariate-dependent treatment assignment. In the second stage, a KRR is applied to these pseudo-outcomes regressed on treatment alone to estimate the effect function. The authors claim that the estimator adapts to the simpler structure of the induced effect function obtained by averaging over covariates. Additionally, they propose a fully data-driven model selection procedure that achieves provable adaptivity to the unknown degree of overlap and the regularity of the kernel as measured by its eigenvalue decay.

Significance. This research tackles a challenging problem in causal inference involving continuous treatments and confounding. The proposed method's ability to adapt to unknown overlap and kernel regularity through data-driven selection is a key strength, potentially leading to more robust and efficient estimation in practice. If the theoretical guarantees are established rigorously, it could advance the field by providing a flexible nonparametric approach that does not require prior knowledge of smoothness or overlap parameters. The two-stage structure exploits the fact that the marginal effect function is typically less complex than the full conditional expectation, which is a clever insight.

major comments (2)

[§3.2, Theorem 3.1] §3.2, Theorem 3.1: The adaptivity result for the second-stage estimator assumes that the first-stage pseudo-outcomes have error rates that are negligible compared to the second-stage rates. However, in regions of poor overlap, the first-stage KRR may have slower convergence, and it is not clear from the proof how the averaging over covariates mitigates this without additional assumptions on the conditional density of T given X or explicit bounds on the propagation of first-stage variance into the second-stage objective.
[§4.2] §4.2, the data-driven selection procedure: The cross-validation criterion for choosing regularization parameters in both stages is claimed to achieve oracle rates simultaneously for overlap and eigenvalue decay, but the analysis does not appear to include a term controlling the contribution of first-stage estimation error to the pseudo-outcome variability; this could invalidate the adaptivity when overlap is weak and unknown a priori.

minor comments (2)

[§2] The notation for the effect function τ(t) and the pseudo-outcomes could be introduced more explicitly in §2 to distinguish them from standard regression functions.
[Table 1] Table 1: The simulation results for varying overlap levels would benefit from reporting standard errors across replications to assess variability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below, outlining revisions to strengthen the theoretical analysis where needed.

read point-by-point responses

Referee: [§3.2, Theorem 3.1] The adaptivity result for the second-stage estimator assumes that the first-stage pseudo-outcomes have error rates that are negligible compared to the second-stage rates. However, in regions of poor overlap, the first-stage KRR may have slower convergence, and it is not clear from the proof how the averaging over covariates mitigates this without additional assumptions on the conditional density of T given X or explicit bounds on the propagation of first-stage variance into the second-stage objective.

Authors: We agree that the propagation of first-stage errors merits more explicit treatment. Our proof of Theorem 3.1 relies on the marginal nature of the target effect function and uses the overlap condition together with eigenvalue decay to show that first-stage contributions are of lower order after averaging over covariates. To address the concern, we will add a dedicated lemma in the appendix that derives explicit high-probability bounds on the first-stage error term in the second-stage objective, explicitly incorporating dependence on the conditional density of T given X and the overlap parameter. This will make the negligible-error assumption fully rigorous and clarify the mitigation mechanism. revision: yes
Referee: [§4.2] The cross-validation criterion for choosing regularization parameters in both stages is claimed to achieve oracle rates simultaneously for overlap and eigenvalue decay, but the analysis does not appear to include a term controlling the contribution of first-stage estimation error to the pseudo-outcome variability; this could invalidate the adaptivity when overlap is weak and unknown a priori.

Authors: This observation correctly identifies a gap in the current CV analysis. The existing argument controls first-stage error under a uniform bound but does not explicitly fold the resulting pseudo-outcome variability into the concentration inequalities for the data-driven selector. We will revise Section 4.2 and the associated theorem to incorporate an additional error term that accounts for first-stage estimation in the pseudo-outcomes. We will then show that the cross-validation procedure still attains the claimed oracle rates for both overlap and eigenvalue decay, provided the overlap satisfies the paper's standing assumptions. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents a two-stage KRR procedure: first-stage regression of response on (T,X) to form pseudo-outcomes that debias for confounding, followed by second-stage regression on T alone to recover the marginal effect function. The claimed data-driven model selection for adaptivity to overlap and eigenvalue decay follows from standard kernel ridge analysis and cross-validation arguments without reducing any claimed rate or estimator to a fitted input by definition. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation remains self-contained against external kernel theory benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The abstract implies standard causal-inference assumptions (unconfoundedness given covariates) and kernel-method regularity conditions that are not stated explicitly.

axioms (1)

domain assumption Unconfoundedness: treatment assignment is independent of potential outcomes given the observed covariates
Required for the pseudo-outcome construction to remove selection bias as described in the abstract.

pith-pipeline@v0.9.0 · 5467 in / 1237 out tokens · 65678 ms · 2026-05-10T13:20:30.611368+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ

Sample{a ′ j}n j=1 fromP samp and create pseudo-outcomes{mj :=m(a ′ j)}n j=1. Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ

work page
[2]

Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ

Define the design operator of{ϕ(a′ j)}n j=1 asA. Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ. for a regularizerλ >0. 19 For a generica, we define m(a) := 1 n nX i=1 ˆf(x i, a). A.2 Good Events for Proof We now define the high-probability events used in the proof. Specifically, we defineE1,E 2,E 3, and finallyE g...

work page 2026
[3]

=nETr ∆J∆J ,(A.5) with generic∆ :=Q(a ′)−Qfora ′ ∼ Psamp. 21 Next, using Hilbert–Schmidt norms, Tr(∆J∆J) =∥J 1/2∆J1/2∥2 HS, and the inequality∥U−V∥ 2 HS ≤ 2∥U∥ 2 HS + 2∥V∥ 2 HS with U =J 1/2Q(a′)J1/2 and V =J 1/2QJ1/2, we obtain Tr(∆J∆J)≤2 Tr Q(a′)J Q(a′)J + 2 Tr Q J Q J .(A.6) We now boundTr(AJAJ)for a generic PSD operatorA⪰0: Tr(AJAJ) = Tr (J1/2AJ1/2)2 ...

work page 2026
[4]

1 nA⊤A 1 nA⊤A+λI −2# . A useful empirical-effective-dimension bound.Define bΓ(λ) := Tr

By the same logic as in Appendix A.2, the following bounds hold, each with probability at least1−n −11, for all{a ′ 1j}n1 j=1 and{a ′ 2j}n2 j=1: 1 n1 n1X i=1 f ⋆(x1i, a′ 1j)−E x∼PX[f ⋆(x, a′ 1j)] ≲ ξ∥θ ⋆∥F √logn√n 1 n2 n2X i=1 f ⋆(x2i, a′ 2j)−E x∼PX[f ⋆(x, a′ 2j)] ≲ ξ∥θ ⋆∥F √logn√n (B.1) Again, by the same logic and Lemma F.2, applied separately to the sa...

work page 2026

[1] [1]

Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ

Sample{a ′ j}n j=1 fromP samp and create pseudo-outcomes{mj :=m(a ′ j)}n j=1. Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ

work page

[2] [2]

Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ

Define the design operator of{ϕ(a′ j)}n j=1 asA. Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ. for a regularizerλ >0. 19 For a generica, we define m(a) := 1 n nX i=1 ˆf(x i, a). A.2 Good Events for Proof We now define the high-probability events used in the proof. Specifically, we defineE1,E 2,E 3, and finallyE g...

work page 2026

[3] [3]

=nETr ∆J∆J ,(A.5) with generic∆ :=Q(a ′)−Qfora ′ ∼ Psamp. 21 Next, using Hilbert–Schmidt norms, Tr(∆J∆J) =∥J 1/2∆J1/2∥2 HS, and the inequality∥U−V∥ 2 HS ≤ 2∥U∥ 2 HS + 2∥V∥ 2 HS with U =J 1/2Q(a′)J1/2 and V =J 1/2QJ1/2, we obtain Tr(∆J∆J)≤2 Tr Q(a′)J Q(a′)J + 2 Tr Q J Q J .(A.6) We now boundTr(AJAJ)for a generic PSD operatorA⪰0: Tr(AJAJ) = Tr (J1/2AJ1/2)2 ...

work page 2026

[4] [4]

1 nA⊤A 1 nA⊤A+λI −2# . A useful empirical-effective-dimension bound.Define bΓ(λ) := Tr

By the same logic as in Appendix A.2, the following bounds hold, each with probability at least1−n −11, for all{a ′ 1j}n1 j=1 and{a ′ 2j}n2 j=1: 1 n1 n1X i=1 f ⋆(x1i, a′ 1j)−E x∼PX[f ⋆(x, a′ 1j)] ≲ ξ∥θ ⋆∥F √logn√n 1 n2 n2X i=1 f ⋆(x2i, a′ 2j)−E x∼PX[f ⋆(x, a′ 2j)] ≲ ξ∥θ ⋆∥F √logn√n (B.1) Again, by the same logic and Lemma F.2, applied separately to the sa...

work page 2026