Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression
Pith reviewed 2026-05-10 13:20 UTC · model grok-4.3
The pith
A two-stage kernel ridge regression estimates the continuous treatment effect function by first modeling the full response surface then correcting confounding with pseudo-outcomes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue
What carries the argument
Two-stage kernel ridge regression that learns a joint conditional response model in stage one and then regresses pseudo-outcomes on treatment alone in stage two to recover the marginal effect function while adapting to its relative simplicity.
If this is right
- Consistent estimation of the continuous dose-response curve becomes possible even when treatment assignment depends strongly on covariates.
- Faster convergence rates are achieved whenever the averaged effect function has lower complexity than the joint response surface.
- Fully automatic regularization selection works without knowing the degree of overlap or the eigenvalue decay rate in advance.
- The procedure applies in general reproducing kernel Hilbert spaces, supporting flexible nonparametric modeling of both stages.
Where Pith is reading between the lines
- The same two-stage logic could be paired with non-kernel first-stage estimators such as neural nets when covariates are high-dimensional.
- The approach suggests that explicitly marginalizing over covariates via an intermediate model is often more efficient than attempting direct adjustment for continuous treatments.
- Empirical tests on observational datasets with measured overlap variation would directly check whether the claimed adaptivity holds in practice.
Load-bearing premise
An accurate enough first-stage model of the response given treatment and covariates can be learned so that the derived pseudo-outcomes remove the selection bias induced by confounding.
What would settle it
In a simulation with known true effect function and controlled overlap, the two-stage estimator would produce higher mean squared error than a single-stage direct regression of outcome on treatment when overlap is moderate and the marginal effect is no simpler than the full surface.
read the original abstract
We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the regularity (eigenvalue decay) of the underlying kernel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a two-stage kernel ridge regression (KRR) method for estimating the continuous treatment effect function, which is the population-averaged outcome as a function of treatment. The first stage involves fitting a KRR model to the response using both treatment and covariates. This model is then used to generate pseudo-outcomes that adjust for confounding due to covariate-dependent treatment assignment. In the second stage, a KRR is applied to these pseudo-outcomes regressed on treatment alone to estimate the effect function. The authors claim that the estimator adapts to the simpler structure of the induced effect function obtained by averaging over covariates. Additionally, they propose a fully data-driven model selection procedure that achieves provable adaptivity to the unknown degree of overlap and the regularity of the kernel as measured by its eigenvalue decay.
Significance. This research tackles a challenging problem in causal inference involving continuous treatments and confounding. The proposed method's ability to adapt to unknown overlap and kernel regularity through data-driven selection is a key strength, potentially leading to more robust and efficient estimation in practice. If the theoretical guarantees are established rigorously, it could advance the field by providing a flexible nonparametric approach that does not require prior knowledge of smoothness or overlap parameters. The two-stage structure exploits the fact that the marginal effect function is typically less complex than the full conditional expectation, which is a clever insight.
major comments (2)
- [§3.2, Theorem 3.1] §3.2, Theorem 3.1: The adaptivity result for the second-stage estimator assumes that the first-stage pseudo-outcomes have error rates that are negligible compared to the second-stage rates. However, in regions of poor overlap, the first-stage KRR may have slower convergence, and it is not clear from the proof how the averaging over covariates mitigates this without additional assumptions on the conditional density of T given X or explicit bounds on the propagation of first-stage variance into the second-stage objective.
- [§4.2] §4.2, the data-driven selection procedure: The cross-validation criterion for choosing regularization parameters in both stages is claimed to achieve oracle rates simultaneously for overlap and eigenvalue decay, but the analysis does not appear to include a term controlling the contribution of first-stage estimation error to the pseudo-outcome variability; this could invalidate the adaptivity when overlap is weak and unknown a priori.
minor comments (2)
- [§2] The notation for the effect function τ(t) and the pseudo-outcomes could be introduced more explicitly in §2 to distinguish them from standard regression functions.
- [Table 1] Table 1: The simulation results for varying overlap levels would benefit from reporting standard errors across replications to assess variability.
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive feedback on our manuscript. We address the major comments point by point below, outlining revisions to strengthen the theoretical analysis where needed.
read point-by-point responses
-
Referee: [§3.2, Theorem 3.1] The adaptivity result for the second-stage estimator assumes that the first-stage pseudo-outcomes have error rates that are negligible compared to the second-stage rates. However, in regions of poor overlap, the first-stage KRR may have slower convergence, and it is not clear from the proof how the averaging over covariates mitigates this without additional assumptions on the conditional density of T given X or explicit bounds on the propagation of first-stage variance into the second-stage objective.
Authors: We agree that the propagation of first-stage errors merits more explicit treatment. Our proof of Theorem 3.1 relies on the marginal nature of the target effect function and uses the overlap condition together with eigenvalue decay to show that first-stage contributions are of lower order after averaging over covariates. To address the concern, we will add a dedicated lemma in the appendix that derives explicit high-probability bounds on the first-stage error term in the second-stage objective, explicitly incorporating dependence on the conditional density of T given X and the overlap parameter. This will make the negligible-error assumption fully rigorous and clarify the mitigation mechanism. revision: yes
-
Referee: [§4.2] The cross-validation criterion for choosing regularization parameters in both stages is claimed to achieve oracle rates simultaneously for overlap and eigenvalue decay, but the analysis does not appear to include a term controlling the contribution of first-stage estimation error to the pseudo-outcome variability; this could invalidate the adaptivity when overlap is weak and unknown a priori.
Authors: This observation correctly identifies a gap in the current CV analysis. The existing argument controls first-stage error under a uniform bound but does not explicitly fold the resulting pseudo-outcome variability into the concentration inequalities for the data-driven selector. We will revise Section 4.2 and the associated theorem to incorporate an additional error term that accounts for first-stage estimation in the pseudo-outcomes. We will then show that the cross-validation procedure still attains the claimed oracle rates for both overlap and eigenvalue decay, provided the overlap satisfies the paper's standing assumptions. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper presents a two-stage KRR procedure: first-stage regression of response on (T,X) to form pseudo-outcomes that debias for confounding, followed by second-stage regression on T alone to recover the marginal effect function. The claimed data-driven model selection for adaptivity to overlap and eigenvalue decay follows from standard kernel ridge analysis and cross-validation arguments without reducing any claimed rate or estimator to a fitted input by definition. No self-citations are invoked as load-bearing uniqueness theorems, no ansatz is smuggled, and no known empirical pattern is merely renamed. The derivation remains self-contained against external kernel theory benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Unconfoundedness: treatment assignment is independent of potential outcomes given the observed covariates
Reference graph
Works this paper leans on
-
[1]
Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ
Sample{a ′ j}n j=1 fromP samp and create pseudo-outcomes{mj :=m(a ′ j)}n j=1. Explicitly, the pseudo-outcome fora′ j is: m(a′ j) := 1 n nX i=1 ψ(xi, a′ j)⊤ˆθ= ˆ¯ψ(a′ j)⊤ˆθ
-
[2]
Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ
Define the design operator of{ϕ(a′ j)}n j=1 asA. Obtain the final estimator ˆηλ = (A⊤A+nλI) −1 nX j=1 ϕ(a′ j)m(a′ j) = (A⊤A+nλI) −1A⊤Wˆθ. for a regularizerλ >0. 19 For a generica, we define m(a) := 1 n nX i=1 ˆf(x i, a). A.2 Good Events for Proof We now define the high-probability events used in the proof. Specifically, we defineE1,E 2,E 3, and finallyE g...
work page 2026
-
[3]
=nETr ∆J∆J ,(A.5) with generic∆ :=Q(a ′)−Qfora ′ ∼ Psamp. 21 Next, using Hilbert–Schmidt norms, Tr(∆J∆J) =∥J 1/2∆J1/2∥2 HS, and the inequality∥U−V∥ 2 HS ≤ 2∥U∥ 2 HS + 2∥V∥ 2 HS with U =J 1/2Q(a′)J1/2 and V =J 1/2QJ1/2, we obtain Tr(∆J∆J)≤2 Tr Q(a′)J Q(a′)J + 2 Tr Q J Q J .(A.6) We now boundTr(AJAJ)for a generic PSD operatorA⪰0: Tr(AJAJ) = Tr (J1/2AJ1/2)2 ...
work page 2026
-
[4]
1 nA⊤A 1 nA⊤A+λI −2# . A useful empirical-effective-dimension bound.Define bΓ(λ) := Tr
By the same logic as in Appendix A.2, the following bounds hold, each with probability at least1−n −11, for all{a ′ 1j}n1 j=1 and{a ′ 2j}n2 j=1: 1 n1 n1X i=1 f ⋆(x1i, a′ 1j)−E x∼PX[f ⋆(x, a′ 1j)] ≲ ξ∥θ ⋆∥F √logn√n 1 n2 n2X i=1 f ⋆(x2i, a′ 2j)−E x∼PX[f ⋆(x, a′ 2j)] ≲ ξ∥θ ⋆∥F √logn√n (B.1) Again, by the same logic and Lemma F.2, applied separately to the sa...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.