Propensity Patchwork Kriging for Scalable Inference on Heterogeneous Treatment Effects

Hajime Ogawa; Shonosuke Sugasawa

arxiv: 2512.23467 · v2 · submitted 2025-12-29 · 📊 stat.ME

Propensity Patchwork Kriging for Scalable Inference on Heterogeneous Treatment Effects

Hajime Ogawa , Shonosuke Sugasawa This is my paper

Pith reviewed 2026-05-16 19:21 UTC · model grok-4.3

classification 📊 stat.ME

keywords heterogeneous treatment effectspropensity scorepatchwork krigingGaussian processcausal inferencescalable estimationcontinuous surfaces

0 comments

The pith

Partitioning data on the propensity score and enforcing continuity only along that dimension yields scalable, continuous estimates of heterogeneous treatment effects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper extends Patchwork Kriging to causal inference by dividing observations according to the estimated propensity score and applying continuity constraints solely along this one dimension. This produces Gaussian-process-based estimates of how treatment effects vary across individuals without the full computational cost of enforcing smoothness over the entire covariate space. A sympathetic reader would care because standard Gaussian process models for heterogeneous treatment effects become impractical on large datasets, while simple local approximations introduce unwanted jumps in the estimated surfaces.

Core claim

By extending Patchwork Kriging into the causal framework, the method partitions the data on the estimated propensity score and imposes continuity constraints only along the propensity score dimension. The resulting estimator avoids the discontinuities of naive local approximations while remaining far cheaper than full-covariate continuity enforcement, and it can be interpreted as a smoothing extension of propensity-score stratification.

What carries the argument

Propensity Patchwork Kriging, which partitions observations by estimated propensity score and links adjacent regions through continuity constraints applied only along that single dimension.

If this is right

Gaussian-process models for heterogeneous treatment effects become feasible on datasets that exceed the scale of standard implementations.
The estimates remain continuous across region boundaries without requiring continuity constraints in all covariate dimensions.
The procedure supplies a smoothed counterpart to ordinary propensity-score stratification.
Computational cost drops substantially relative to patchwork kriging applied over the full covariate space.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The single-dimension constraint may extend naturally to other low-dimensional summaries used in causal work, such as prognostic scores.
Hybrid constructions could combine this partitioning with inducing-point or sparse Gaussian-process approximations for even larger problems.
In policy applications the resulting surfaces could improve targeting when treatment effects vary smoothly with selection probability.
Empirical checks on real data would reveal whether the propensity-score axis alone captures enough dependence to keep estimates stable under moderate propensity misspecification.

Load-bearing premise

Partitioning on the estimated propensity score and enforcing continuity only along that single dimension is sufficient to produce accurate and continuous heterogeneous treatment effect surfaces across the full covariate space.

What would settle it

A held-out test set or simulation in which the true heterogeneous treatment effect surface exhibits substantial variation or discontinuities in directions orthogonal to the propensity score, causing the method to produce visibly discontinuous or biased estimates.

Figures

Figures reproduced from arXiv: 2512.23467 by Hajime Ogawa, Shonosuke Sugasawa.

**Figure 1.** Figure 1: MSEs on settings A and C K = 2. In K = 2 case, the number of boundaries between regions is only one, so effect of the patchwork scheme is expected to be weaker than in the K = 5 or 10 cases. In divided region, local model can only use smaller sample than baseline model, but the negative effect appears to be relatively small. Our proposed model and local approximation can use hyperparameters different for e… view at source ↗

**Figure 2.** Figure 2: Computation time and MSEs on settings A and C [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of HTE estimation bias near partition boundaries on Setup A. The [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Spline-smoothed trends of estimated HTE versus propensity scores. Gray dots [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Scatter plots of estimated HTE versus propensity scores: (Left) Proposed [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Overlapped scatter plot of the estimates shown in Figure 5. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Gaussian process-based models are attractive for estimating heterogeneous treatment effects (HTE), but their computational cost limits scalability in causal inference settings. In this work, we address this challenge by extending Patchwork Kriging into the causal inference framework. Our proposed method partitions the data according to the estimated propensity score and applies Patchwork Kriging to enforce continuity of HTE estimates across adjacent regions. By imposing continuity constraints only along the propensity score dimension, rather than the full covariate space, the proposed approach substantially reduces computational cost while avoiding discontinuities inherent in simple local approximations. The resulting method can be interpreted as a smoothing extension of stratification and provides an efficient approach to HTE estimation. The proposed method is demonstrated through simulation studies and a real data application.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Propensity Patchwork Kriging scales continuous HTE estimation by restricting continuity to one dimension, but that restriction is also its main vulnerability.

read the letter

This paper adapts Patchwork Kriging to heterogeneous treatment effect estimation by first partitioning on the estimated propensity score and then enforcing continuity constraints only along that single axis. The result is positioned as a computationally lighter smoothing layer on top of standard stratification. That targeted restriction is the concrete new piece: it is not just another Gaussian process variant, but a deliberate reduction of the continuity problem to the propensity dimension to gain scalability. The abstract and framing make clear how this sits between crude binning and full nonparametric models, which is a reasonable practical niche. The simulations and real-data example are mentioned as support, so the authors have at least attempted to show the method runs and produces plausible output. That counts as basic due diligence for a methods paper. The central assumption, however, is that relevant treatment-effect variation is either captured by or constant conditional on the propensity score. When that does not hold, movement in orthogonal covariate directions can still produce jumps between patches. The stress-test concern lands because the paper does not appear to provide a general error bound or a diagnostic that would catch this failure mode. If the simulations only test cases where the assumption is satisfied, they do not rule out the problem. The method is therefore best viewed as a useful heuristic under strong conditions rather than a general solution. Readers working on large observational causal problems who already use propensity scores and want something smoother than discrete strata will find the most immediate value. It is not a paradigm shift, but the computational angle is relevant enough that a serious editor should send it to referees who can examine the simulation design and any continuity diagnostics in the full manuscript.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Propensity Patchwork Kriging (PPK), an extension of Patchwork Kriging to the causal setting for scalable heterogeneous treatment effect (HTE) estimation. Data are partitioned according to the estimated propensity score, after which Patchwork Kriging is applied to enforce continuity of the HTE surface only along the one-dimensional propensity-score axis. The method is positioned as a smoothing extension of stratification that achieves substantial computational savings relative to full Gaussian-process models while avoiding the discontinuities of naive local approximations. Claims are supported by simulation studies and a real-data application.

Significance. If the central construction holds, PPK would supply a computationally attractive middle ground between stratification and full-dimensional GP models for HTE, with the one-dimensional continuity constraint offering a principled way to trade off cost against smoothness. The approach is novel in its targeted use of the propensity score as the sole continuity axis and could be useful in large observational studies where standard GPs are infeasible.

major comments (2)

[Abstract and §2] Abstract and §2 (Method): The claim that 'imposing continuity constraints only along the propensity score dimension... avoids discontinuities inherent in simple local approximations' rests on the unexamined assumption that all relevant HTE heterogeneity is captured by (or constant conditional on) the propensity score. When heterogeneity is driven by covariates orthogonal to the propensity score, movement across patch boundaries in those directions can still produce jumps, undermining both continuity and unbiasedness. No theorem, bias bound, or targeted simulation addressing orthogonal heterogeneity is provided.
[§4] §4 (Simulations): The reported simulation designs do not include scenarios in which treatment-effect heterogeneity is deliberately placed in directions independent of the propensity score. Without such stress tests, the empirical results cannot confirm that the one-dimensional continuity enforcement preserves accuracy across the full covariate space.

minor comments (2)

[§2] Notation for the propensity-score partitioning and the Patchwork Kriging kernel should be introduced with explicit definitions before the continuity constraints are stated.
[§5] The real-data application would benefit from a table reporting both point estimates and uncertainty quantification for the HTE surface at representative covariate values.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of the continuity properties in Propensity Patchwork Kriging. We respond point by point below, indicating revisions where appropriate.

read point-by-point responses

Referee: [Abstract and §2] Abstract and §2 (Method): The claim that 'imposing continuity constraints only along the propensity score dimension... avoids discontinuities inherent in simple local approximations' rests on the unexamined assumption that all relevant HTE heterogeneity is captured by (or constant conditional on) the propensity score. When heterogeneity is driven by covariates orthogonal to the propensity score, movement across patch boundaries in those directions can still produce jumps, undermining both continuity and unbiasedness. No theorem, bias bound, or targeted simulation addressing orthogonal heterogeneity is provided.

Authors: We appreciate this clarification. The method enforces continuity only along the one-dimensional propensity-score axis to smooth transitions between adjacent propensity-based patches, addressing the discontinuities that arise in naive stratification or local approximations when crossing propensity thresholds. It does not assume or claim that all HTE heterogeneity is captured by the propensity score, nor does it enforce continuity in the full covariate space; within patches, HTE can depend on the complete covariate vector. We will revise the abstract and §2 to explicitly state that continuity is restricted to the propensity dimension and to remove any implication of global smoothness. We acknowledge that no formal theorem or bias bound is provided and that such analysis would require substantial additional theoretical development. revision: partial
Referee: [§4] §4 (Simulations): The reported simulation designs do not include scenarios in which treatment-effect heterogeneity is deliberately placed in directions independent of the propensity score. Without such stress tests, the empirical results cannot confirm that the one-dimensional continuity enforcement preserves accuracy across the full covariate space.

Authors: We agree that the current simulations do not stress-test orthogonal heterogeneity and that this limits the strength of the empirical claims. In the revised manuscript we will add new simulation scenarios in which the true HTE surface depends on covariates that are independent of the propensity score. These will be used to evaluate whether the one-dimensional continuity constraint preserves accuracy relative to baselines. revision: yes

standing simulated objections not resolved

A formal theorem or bias bound addressing continuity and unbiasedness under heterogeneity orthogonal to the propensity score.

Circularity Check

0 steps flagged

No circularity: derivation builds on external Patchwork Kriging and propensity concepts

full rationale

The paper extends Patchwork Kriging by partitioning on the estimated propensity score and enforcing continuity constraints solely along that dimension. No quoted equations or steps reduce a claimed prediction, uniqueness result, or performance guarantee to a quantity defined by the same procedure or to a self-citation chain. The central construction is presented as a new methodological combination whose continuity and scalability properties follow from the imposed constraints rather than from any fitted input being renamed as output. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard Gaussian process assumptions for the underlying response surface and on the validity of propensity score partitioning for dimension reduction. No new free parameters or invented entities are introduced in the abstract description.

axioms (2)

domain assumption Gaussian process models can represent heterogeneous treatment effects with appropriate kernel choices
Invoked implicitly when extending Patchwork Kriging to the causal setting
domain assumption Propensity score partitioning sufficiently captures the relevant heterogeneity for continuity enforcement
Central modeling choice stated in the abstract

pith-pipeline@v0.9.0 · 5419 in / 1310 out tokens · 33799 ms · 2026-05-16T19:21:39.367825+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By imposing continuity constraints only along the propensity score dimension, rather than the full covariate space, the proposed approach substantially reduces computational cost while avoiding discontinuities inherent in simple local approximations.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose estimating the propensity score e(x)=P(T=1|x) in advance and using it as the sole partitioning variable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages

[1]

Alaa, A. M. and M. van der Schaar (2017). Bayesian inference of individualized treatment effects using multi-task gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems , Volume 30. Curran Associates, Inc

work page 2017
[2]

Banerjee, A

Datta, A., S. Banerjee, A. O. Finley, and A. E. Gelfand (2016). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association\/ 111\/ (514), 800--812. PMID: 29720777

work page 2016
[3]

Engle, R. F., C. W. J. Granger, J. Rice, and A. Weiss (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association\/ 81\/ (394), 310--320

work page 1986
[4]

Finley, A. O., A. Datta, B. C. Cook, D. C. Morton, H. E. Andersen, and S. Banerjee (2018). Efficient algorithms for bayesian nearest neighbor gaussian processes

work page 2018
[5]

Hahn, P. R., J. S. Murray, and C. Carvalho (2019). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects

work page 2019
[6]

Horii, S. and Y. Chikahara (2023). Uncertainty quantification in heterogeneous treatment effect estimation with gaussian-process-based partially linear model

work page 2023
[7]

Imai, K. and D. A. van Dyk (2004). Causal inference with general treatment regimes. Journal of the American Statistical Association\/ 99\/ (467), 854--866

work page 2004
[8]

Dominici, M

Johnson, E., F. Dominici, M. Griswold, and S. L. Zeger (2003). Disease cases and their medical costs attributable to smoking: an analysis of the national medical expenditure survey. Journal of Econometrics\/ 112\/ (1), 135--151. Analysis of data on health: 2

work page 2003
[9]

Shalit, J

Louizos, C., U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling (2017). Causal effect inference with deep latent-variable models

work page 2017
[10]

McCandless, L. C., P. Gustafson, and P. C. Austin (2009). Bayesian propensity score analysis for observational data. Statistics in Medicine\/ 28\/ (1), 94--112

work page 2009
[11]

Nie, X. and S. Wager (2020). Quasi-oracle estimation of heterogeneous treatment effects

work page 2020
[12]

Orihara, S. and T. Momozaki (2024). Bayesian-based propensity score subclassification estimator

work page 2024
[13]

Park, C. and D. Apley (2018). Patchwork kriging for large-scale gaussian process regression

work page 2018
[14]

Rasmussen, C. E. and C. K. I. Williams (2006). Gaussian Processes for Machine Learning . The MIT Press

work page 2006
[15]

Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika\/ 70\/ (1), 41--55

work page 1983
[16]

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology\/ 66 , 688--701

work page 1974
[17]

Snelson, E. and Z. Ghahramani (2005). Sparse gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch\" o lkopf, and J. Platt (Eds.), Advances in Neural Information Processing Systems , Volume 18. MIT Press

work page 2005
[18]

(2009, 16--18 Apr)

Titsias, M. (2009, 16--18 Apr). Variational learning of inducing variables in sparse gaussian processes. In D. van Dyk and M. Welling (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics , Volume 5 of Proceedings of Machine Learning Research , Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, p...

work page 2009
[19]

Williams, C. and M. Seeger (2000). Using the nystr\" o m method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems , Volume 13. MIT Press

work page 2000
[20]

Pleiss, and J

Wu, L., G. Pleiss, and J. P. Cunningham (2022). Variational nearest neighbor gaussian processes. CoRR\/ abs/2202.01694

work page arXiv 2022
[21]

Mitra, and J

Zhu, Y., N. Mitra, and J. Roy (2022). Addressing positivity violations in causal effect estimation using gaussian process priors

work page 2022

[1] [1]

Alaa, A. M. and M. van der Schaar (2017). Bayesian inference of individualized treatment effects using multi-task gaussian processes. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Advances in Neural Information Processing Systems , Volume 30. Curran Associates, Inc

work page 2017

[2] [2]

Banerjee, A

Datta, A., S. Banerjee, A. O. Finley, and A. E. Gelfand (2016). Hierarchical nearest-neighbor gaussian process models for large geostatistical datasets. Journal of the American Statistical Association\/ 111\/ (514), 800--812. PMID: 29720777

work page 2016

[3] [3]

Engle, R. F., C. W. J. Granger, J. Rice, and A. Weiss (1986). Semiparametric estimates of the relation between weather and electricity sales. Journal of the American Statistical Association\/ 81\/ (394), 310--320

work page 1986

[4] [4]

Finley, A. O., A. Datta, B. C. Cook, D. C. Morton, H. E. Andersen, and S. Banerjee (2018). Efficient algorithms for bayesian nearest neighbor gaussian processes

work page 2018

[5] [5]

Hahn, P. R., J. S. Murray, and C. Carvalho (2019). Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects

work page 2019

[6] [6]

Horii, S. and Y. Chikahara (2023). Uncertainty quantification in heterogeneous treatment effect estimation with gaussian-process-based partially linear model

work page 2023

[7] [7]

Imai, K. and D. A. van Dyk (2004). Causal inference with general treatment regimes. Journal of the American Statistical Association\/ 99\/ (467), 854--866

work page 2004

[8] [8]

Dominici, M

Johnson, E., F. Dominici, M. Griswold, and S. L. Zeger (2003). Disease cases and their medical costs attributable to smoking: an analysis of the national medical expenditure survey. Journal of Econometrics\/ 112\/ (1), 135--151. Analysis of data on health: 2

work page 2003

[9] [9]

Shalit, J

Louizos, C., U. Shalit, J. Mooij, D. Sontag, R. Zemel, and M. Welling (2017). Causal effect inference with deep latent-variable models

work page 2017

[10] [10]

McCandless, L. C., P. Gustafson, and P. C. Austin (2009). Bayesian propensity score analysis for observational data. Statistics in Medicine\/ 28\/ (1), 94--112

work page 2009

[11] [11]

Nie, X. and S. Wager (2020). Quasi-oracle estimation of heterogeneous treatment effects

work page 2020

[12] [12]

Orihara, S. and T. Momozaki (2024). Bayesian-based propensity score subclassification estimator

work page 2024

[13] [13]

Park, C. and D. Apley (2018). Patchwork kriging for large-scale gaussian process regression

work page 2018

[14] [14]

Rasmussen, C. E. and C. K. I. Williams (2006). Gaussian Processes for Machine Learning . The MIT Press

work page 2006

[15] [15]

Rosenbaum, P. R. and D. B. Rubin (1983). The central role of the propensity score in observational studies for causal effects. Biometrika\/ 70\/ (1), 41--55

work page 1983

[16] [16]

Rubin, D. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology\/ 66 , 688--701

work page 1974

[17] [17]

Snelson, E. and Z. Ghahramani (2005). Sparse gaussian processes using pseudo-inputs. In Y. Weiss, B. Sch\" o lkopf, and J. Platt (Eds.), Advances in Neural Information Processing Systems , Volume 18. MIT Press

work page 2005

[18] [18]

(2009, 16--18 Apr)

Titsias, M. (2009, 16--18 Apr). Variational learning of inducing variables in sparse gaussian processes. In D. van Dyk and M. Welling (Eds.), Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics , Volume 5 of Proceedings of Machine Learning Research , Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA, p...

work page 2009

[19] [19]

Williams, C. and M. Seeger (2000). Using the nystr\" o m method to speed up kernel machines. In T. Leen, T. Dietterich, and V. Tresp (Eds.), Advances in Neural Information Processing Systems , Volume 13. MIT Press

work page 2000

[20] [20]

Pleiss, and J

Wu, L., G. Pleiss, and J. P. Cunningham (2022). Variational nearest neighbor gaussian processes. CoRR\/ abs/2202.01694

work page arXiv 2022

[21] [21]

Mitra, and J

Zhu, Y., N. Mitra, and J. Roy (2022). Addressing positivity violations in causal effect estimation using gaussian process priors

work page 2022