Predicting Heterogeneous Treatment Effects Of Building Energy Saving Retrofits Using Causal Machine Learning

David Zapata Gonzalez; Kevin Zalipski; Oliver M\"uller

arxiv: 2607.01891 · v1 · pith:ZRWY6Z5Xnew · submitted 2026-07-02 · 💻 cs.CE

Predicting Heterogeneous Treatment Effects Of Building Energy Saving Retrofits Using Causal Machine Learning

Kevin Zalipski , David Zapata Gonzalez , Oliver M\"uller This is my paper

Pith reviewed 2026-07-03 03:17 UTC · model grok-4.3

classification 💻 cs.CE

keywords causal machine learningheterogeneous treatment effectsbuilding retrofitsDoubleMLmetalearnersenergy savingssimulation studyconfounding

0 comments

The pith

DoubleML produces the lowest-error estimates of energy savings from building retrofits by orthogonalizing treatment assignment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks several causal machine learning estimators on a simulation of building energy retrofits in which the true heterogeneous treatment effects and realistic adoption biases are known by construction. Predictive models trained only on observed outcomes produce biased estimates because household and building traits shape both retrofit uptake and energy use. DoubleML yields smaller errors than S-, T-, and X-learners, with the advantage clearest for complex envelope measures. The central mechanism is the explicit separation of the treatment-assignment model from the outcome model, which removes confounding that would otherwise leak into the effect estimates. These results supply a methodological template for evaluating large-scale retrofit programs where randomized trials are infeasible.

Core claim

When true treatment effects are known from a physically grounded simulation, DoubleML recovers heterogeneous energy-saving effects with lower error than metalearners; the performance gap widens for complex envelope retrofits because orthogonalization removes the correlation between treatment probability and outcome residuals.

What carries the argument

DoubleML, which fits separate models for treatment assignment and outcome then combines their residuals to estimate conditional average treatment effects.

If this is right

Policy evaluations of retrofit subsidies can obtain less biased heterogeneous-effect estimates without randomized assignment.
Orthogonalization is most valuable when treatment uptake correlates strongly with baseline energy use.
Simpler metalearners suffice for uniform interventions but degrade on heterogeneous envelope measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same simulation-plus-benchmark design could test causal estimators in other domains where adoption is driven by the same variables that determine outcomes, such as health interventions or educational programs.
If real retrofit data exhibit stronger unmeasured confounding than the simulation allows, the reported ranking of methods may reverse.
Embedding the best-performing estimator inside an optimization loop could directly target subsidy allocations that maximize total energy reduction under budget constraints.

Load-bearing premise

The simulation reproduces the actual selection biases and physical energy responses that occur in real retrofit decisions.

What would settle it

Re-running the same estimators on a new simulation in which treatment assignment follows a different functional form from the one used to train the nuisance models, or on matched real-world utility data with measured post-retrofit consumption, would show whether DoubleML retains its error advantage.

read the original abstract

Information Systems research increasingly relies on machine learning (ML) to predict outcomes in complex sociotechnical systems, yet predictive models are not designed to identify causal effects. This limitation is particularly critical in building retrofits, where unbiased estimates of energy savings are essential for climate policy and investment decisions. Because retrofit adoption is shaped by household and building characteristics that also affect energy consumption, predictive ML can yield biased effect estimates. This paper systematically benchmarks leading causal ML estimators, including metalearners (S-, T- and X-Learners) and DoubleML across multiple retrofit interventions. To enable this comparison, we construct a physically grounded simulation in which true treatment effects and realistic adoption biases are known. Results show that DoubleML achieves the lowest estimation errors, particularly for complex envelope retrofits. These findings demonstrate that orthogonalising the treatment assignment improves causal effect estimation and provides a methodological foundation for large-scale energy retrofit and policy evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Simulation benchmark finds DoubleML lowest error on retrofit effects, but the ranking rests on an unvalidated generative model with no real-data check.

read the letter

The main takeaway is that this is a domain application of existing causal ML tools rather than a methodological advance. They build a simulation with known ground truth for treatment effects and selection bias, then compare S-, T-, X-learners against DoubleML across several retrofit types. DoubleML comes out ahead, especially on envelope measures.

What the paper does is straightforward: it shows how orthogonalization helps when treatment assignment correlates with the outcome in a building context. The simulation is physically grounded, which is better than purely synthetic data, and the benchmark is run cleanly enough to produce a clear ranking.

The soft spot is the lack of any external validation. The treatment propensity and potential outcome surfaces come from the authors' own model. If those surfaces miss real-world selection patterns or heterogeneity that appear in actual retrofit programs, the ranking could flip under a different but plausible DGP. No comparison to observed uptake or measured savings is reported, so the claim that DoubleML is reliably superior stays conditional on the simulation matching reality.

The work is useful for researchers who already work on energy policy evaluation and want a starting point for causal estimation in this setting. It is not required reading for causal ML methodologists. A serious referee should see it because the application matters for climate decisions and the simulation approach is replicable; the main questions will be about how sensitive the results are to the generative assumptions.

Referee Report

2 major / 1 minor

Summary. The paper claims that causal ML estimators can be systematically benchmarked for heterogeneous treatment effect estimation in building energy retrofits by constructing a physically grounded simulation with known ground-truth effects and realistic adoption biases; results indicate DoubleML yields the lowest estimation errors (especially for complex envelope retrofits) compared to S-, T-, and X-Learners, offering a methodological foundation for unbiased policy evaluation.

Significance. If the simulation's generative assumptions hold, the work would provide actionable guidance for estimator selection in energy policy and IS research on sociotechnical systems, highlighting the value of orthogonalization for reducing bias in observational settings with selection on observables.

major comments (2)

[Abstract] Abstract: the central claim that DoubleML achieves the lowest estimation errors rests on a simulation whose construction, data generation process, treatment propensity model, potential-outcome surfaces, and error metrics are not described; without these details it is impossible to determine whether the reported ranking is robust or an artifact of the authors' specific DGP.
[Abstract] The paper reports no external validation or calibration of the simulated treatment assignment and outcome models against real retrofit uptake data or measured energy savings; because the estimator comparison occurs entirely inside this single unvalidated DGP, any mismatch in propensity scores or covariate–outcome relationships directly determines whether DoubleML remains superior.

minor comments (1)

[Abstract] The abstract refers to 'multiple retrofit interventions' without stating how many or which ones were simulated, which would help readers assess the scope of the benchmark.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed comments on transparency and validation of the simulation. We respond to each point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that DoubleML achieves the lowest estimation errors rests on a simulation whose construction, data generation process, treatment propensity model, potential-outcome surfaces, and error metrics are not described; without these details it is impossible to determine whether the reported ranking is robust or an artifact of the authors' specific DGP.

Authors: The abstract is intentionally concise, but the manuscript fully specifies the simulation in Sections 3 and 4: the physically grounded building energy model (EnergyPlus-based), the DGP for covariates and outcomes, the logistic propensity model incorporating realistic adoption biases from the literature, the heterogeneous treatment effect surfaces for different retrofit types, and the error metrics (RMSE, MAE, and bias). The reported ranking is therefore not an artifact but the result of this controlled design. We will revise the abstract to include one sentence referencing the physically grounded DGP and known ground truth. revision: partial
Referee: [Abstract] The paper reports no external validation or calibration of the simulated treatment assignment and outcome models against real retrofit uptake data or measured energy savings; because the estimator comparison occurs entirely inside this single unvalidated DGP, any mismatch in propensity scores or covariate–outcome relationships directly determines whether DoubleML remains superior.

Authors: We acknowledge that the paper does not include external calibration against real retrofit uptake or savings data. The simulation is instead constructed from first-principles physical models and empirically motivated selection mechanisms to provide known ground-truth effects, which is the standard approach for isolating estimator performance when real-world ground truth is unavailable. We disagree that this invalidates the comparative ranking within the stated scope; however, we will add an explicit limitations paragraph discussing generalizability and the desirability of future real-data checks. revision: partial

Circularity Check

0 steps flagged

No significant circularity; simulation-based benchmarking is self-contained against external benchmarks.

full rationale

The paper constructs a physically grounded simulation to generate known ground-truth treatment effects and biases, then evaluates metalearners and DoubleML on that data. This is a standard methodological approach for causal estimator comparison when real-world ground truth is unavailable. No step reduces a claimed result to a self-definition, a fitted parameter renamed as prediction, or a self-citation chain; the ranking of estimators is measured against the simulation's independent generative process rather than being forced by the paper's own equations. The derivation chain therefore remains non-circular.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are described.

pith-pipeline@v0.9.1-grok · 5688 in / 1043 out tokens · 38765 ms · 2026-07-03T03:17:25.470133+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 2 canonical work pages · 1 internal anchor

[1]

This limitation is particularly critical in building retrofits, where unbiased estimates of energy savings are essential for climate policy and investment decisions

Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 1 PREDICTING HETEROGENEOUS TREATMENT EFFECTS OF BUILDING ENERGY SAVING RETROFITS USING CAUSAL MACHINE LEARNING Completed Research Paper Kevin Zalipski, Paderborn University, Paderborn, Germany, kevin.zalipski@upb.de David Zapata Gonzalez, Paderborn University, Paderborn, Ge...

2026
[2]

This tool is straightforward to use and can provide reasonable results when assessing retrofit measures (Dermentzis et al., 2019; Moran et al., 2014)

or simplified algorithms such as the Passive House Planning Package (PHPP) (Passive House Institute, 2015), which is based on Microsoft Excel. This tool is straightforward to use and can provide reasonable results when assessing retrofit measures (Dermentzis et al., 2019; Moran et al., 2014). The second category includes studies that employ detailed dynam...

2015
[3]

and its graphical front-ends (Beagon et al., 2020; Chuah et al., 2013; Crawley et al., 2001; Garg et al., 2020). These provide finer physical detail and modelling of buildings, potentially resulting in more accurate representations of building performance, but also require a higher level of expertise, greater effort and thus significantly more time settin...

2020
[4]

fundamental problem of causal inference

and often fail to capture the underlying causal structure of the data-generating process. As a result, they are vulnerable to overfitting and confounding bias, particularly in cases where variables simultaneously influence both the treatment (e.g., retrofit measures) and the outcome (e.g., energy consumption). For example, houses located in colder regions...

1974
[5]

true CATE

House characteristics per class; for the U-values, the first value represents the pre-retrofitting status, while the second value indicates the post-retrofitting value. Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 6 Table 1 summarises the parameters used per building period class. Fo...

2026
[6]

Treatment # Houses Rel

DAG showing the confounding present in the data. Treatment # Houses Rel. # Houses Effect in kWh Location (% in cold climate) Income in USD Construction year No retrofit 11,865 0.237 0.0 (0.0) 44.2 40,880 (10,887) 1977.4 (14.8) Windows only 6,256 0.125 -2,020.4 (1,306.7) 50.9 53,448 (6,904) 1966.5 (19.4) Windows + roof 9,300 0.186 -5,975.5 (4,196.9) 49.2 5...

1977
[7]

Statistics for the different treatments; the columns regarding the effect, location, income and construction year show the mean and the standard deviation in parentheses. As visible in Table 2, more thorough retrofits (exterior wall only, and windows, roof and exterior wall) tend to occur in older buildings, in colder locations and in households with high...

2026
[8]

Figure 2 displays the one-vs-rest propensity score distributions (histograms with density overlays) for each treatment

Overlap assumption per treatment. Figure 2 displays the one-vs-rest propensity score distributions (histograms with density overlays) for each treatment. Each plot compares the treated group with the corresponding control group (gray). Across all four panels, the distributions show substantial overlap across a broad range of propensity scores, with no tre...

2015
[9]

At prediction time, the CATE is obtained by subtracting the predicted control outcome from the predicted treated outcome for the same feature vector (Künzel et al., 2019)

A separate model is then trained on each subset to learn the outcome function for that group. At prediction time, the CATE is obtained by subtracting the predicted control outcome from the predicted treated outcome for the same feature vector (Künzel et al., 2019). This approach ensures that treatment-specific outcome patterns are captured, even when the ...

2019
[10]

The previously mentioned GitHub repository also contains the implementation code for the training and evaluation of the models. The X-Learner and the DoubleML models are implemented via the Python EconML Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 10 library (Battocchi et al.,

2026
[11]

Parameter Values # Estimators 50, 100, 150 Max

and all models use a Random Forest as the base learner, implemented via the Python sklearn library (Pedregosa et al., 2011). Parameter Values # Estimators 50, 100, 150 Max. depth 5, 10, - Min. samples leaf 5, 10, 15 Max. features -, sqrt, log2 Table

2011
[12]

However, the ranking is not uniform across individual treatments

MAE for the different model and treatment combinations; best model based on the MAE in bold; normalized MAE in parentheses. However, the ranking is not uniform across individual treatments. For the smallest retrofit (windows only), the X-Learner is most accurate (376.60; nMAE: 18.3%), followed closely by the T-Learner with an MAE of 395.58 (nMAE: 19.2%). ...

2026
[13]

At the smallest sample size, all models exhibit high errors and DoubleML in particular requires more data to outperform the metalearners

shows that model performance improves substantially as sample sizes increase from 1,000 to 10,000 total observations, with diminishing returns beyond 30,000. At the smallest sample size, all models exhibit high errors and DoubleML in particular requires more data to outperform the metalearners. From approximately 10,000 observations onward (corresponding ...

2033
[14]

Average MAE across all treatments for different sample sizes; best model in bold. A SHAP analysis of the individual model predictions reveals that confounding variables, especially construction year, remain influential in the metalearners' CATE estimates, whereas DoubleML Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information ...

2026
[15]

SHAP beeswarm plots for the comprehensive treatment across all models. 5 Discussion The results demonstrate that causal ML estimators are not only theoretically sound but also effective within our simulation-based evaluation for estimating retrofit-related energy savings under realistic treatment bias. The large gap between OLS and the more advanced causa...

2025
[16]

and trim observations in regions with poor overlap to improve validity. SUTVA is likely to hold in retrofit settings, as a building's energy consumption is generally unaffected by whether neighbouring buildings are retrofitted and treatments are typically well-defined through specific material and component specifications. The importance of comprehensive ...

2024
[17]

Chen, S., Zhang, G., Xia, X., Chen, Y., Setunge, S., & Shi, L. (2021). The impacts of occupant behavior on building energy consumption: A review. Sustainable Energy Technologies and Assessments, 45, 101212. Chen, X., Abualdenien, J., Singh, M. M., Borrmann, A., & Geyer, P. (2022). Introducing causal inference in the energy-efficient building design proces...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[18]

Ma, Z., Cooper, P., Daly, D., & Ledo, L. (2012). Existing building retrofits: Methodology and state-of-the-art. Energy and Buildings, 55, 889–902. Molak, A. (2023). Causal Inference and Discovery in Python: Unlock the secrets of modern causal machine learning with DoWhy, EconML, PyTorch and more. Packt Publishing Ltd. Moran, F., Blight, T., Natarajan, S.,...

2012
[19]

Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American Statistical Association, 75(371), 591–593. Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 17 Saffari, M., & Beagon, P. (2022). Home energy retrofit: Reviewin...

work page doi:10.1007/s12599-026-00999-x 1980
[20]

building class periods spanning 1850-1994, with eight distinct classes (EFH A through EFH H) • Air density * specific thermal air capacity: 0.34 • Full load hours: 1130 • Norm temperatures: Outside: -12.0 for cold and -5.0 for moderate climates; Inside: 20

1994

[1] [1]

This limitation is particularly critical in building retrofits, where unbiased estimates of energy savings are essential for climate policy and investment decisions

Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 1 PREDICTING HETEROGENEOUS TREATMENT EFFECTS OF BUILDING ENERGY SAVING RETROFITS USING CAUSAL MACHINE LEARNING Completed Research Paper Kevin Zalipski, Paderborn University, Paderborn, Germany, kevin.zalipski@upb.de David Zapata Gonzalez, Paderborn University, Paderborn, Ge...

2026

[2] [2]

This tool is straightforward to use and can provide reasonable results when assessing retrofit measures (Dermentzis et al., 2019; Moran et al., 2014)

or simplified algorithms such as the Passive House Planning Package (PHPP) (Passive House Institute, 2015), which is based on Microsoft Excel. This tool is straightforward to use and can provide reasonable results when assessing retrofit measures (Dermentzis et al., 2019; Moran et al., 2014). The second category includes studies that employ detailed dynam...

2015

[3] [3]

and its graphical front-ends (Beagon et al., 2020; Chuah et al., 2013; Crawley et al., 2001; Garg et al., 2020). These provide finer physical detail and modelling of buildings, potentially resulting in more accurate representations of building performance, but also require a higher level of expertise, greater effort and thus significantly more time settin...

2020

[4] [4]

fundamental problem of causal inference

and often fail to capture the underlying causal structure of the data-generating process. As a result, they are vulnerable to overfitting and confounding bias, particularly in cases where variables simultaneously influence both the treatment (e.g., retrofit measures) and the outcome (e.g., energy consumption). For example, houses located in colder regions...

1974

[5] [5]

true CATE

House characteristics per class; for the U-values, the first value represents the pre-retrofitting status, while the second value indicates the post-retrofitting value. Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 6 Table 1 summarises the parameters used per building period class. Fo...

2026

[6] [6]

Treatment # Houses Rel

DAG showing the confounding present in the data. Treatment # Houses Rel. # Houses Effect in kWh Location (% in cold climate) Income in USD Construction year No retrofit 11,865 0.237 0.0 (0.0) 44.2 40,880 (10,887) 1977.4 (14.8) Windows only 6,256 0.125 -2,020.4 (1,306.7) 50.9 53,448 (6,904) 1966.5 (19.4) Windows + roof 9,300 0.186 -5,975.5 (4,196.9) 49.2 5...

1977

[7] [7]

Statistics for the different treatments; the columns regarding the effect, location, income and construction year show the mean and the standard deviation in parentheses. As visible in Table 2, more thorough retrofits (exterior wall only, and windows, roof and exterior wall) tend to occur in older buildings, in colder locations and in households with high...

2026

[8] [8]

Figure 2 displays the one-vs-rest propensity score distributions (histograms with density overlays) for each treatment

Overlap assumption per treatment. Figure 2 displays the one-vs-rest propensity score distributions (histograms with density overlays) for each treatment. Each plot compares the treated group with the corresponding control group (gray). Across all four panels, the distributions show substantial overlap across a broad range of propensity scores, with no tre...

2015

[9] [9]

At prediction time, the CATE is obtained by subtracting the predicted control outcome from the predicted treated outcome for the same feature vector (Künzel et al., 2019)

A separate model is then trained on each subset to learn the outcome function for that group. At prediction time, the CATE is obtained by subtracting the predicted control outcome from the predicted treated outcome for the same feature vector (Künzel et al., 2019). This approach ensures that treatment-specific outcome patterns are captured, even when the ...

2019

[10] [10]

The previously mentioned GitHub repository also contains the implementation code for the training and evaluation of the models. The X-Learner and the DoubleML models are implemented via the Python EconML Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 10 library (Battocchi et al.,

2026

[11] [11]

Parameter Values # Estimators 50, 100, 150 Max

and all models use a Random Forest as the base learner, implemented via the Python sklearn library (Pedregosa et al., 2011). Parameter Values # Estimators 50, 100, 150 Max. depth 5, 10, - Min. samples leaf 5, 10, 15 Max. features -, sqrt, log2 Table

2011

[12] [12]

However, the ranking is not uniform across individual treatments

MAE for the different model and treatment combinations; best model based on the MAE in bold; normalized MAE in parentheses. However, the ranking is not uniform across individual treatments. For the smallest retrofit (windows only), the X-Learner is most accurate (376.60; nMAE: 18.3%), followed closely by the T-Learner with an MAE of 395.58 (nMAE: 19.2%). ...

2026

[13] [13]

At the smallest sample size, all models exhibit high errors and DoubleML in particular requires more data to outperform the metalearners

shows that model performance improves substantially as sample sizes increase from 1,000 to 10,000 total observations, with diminishing returns beyond 30,000. At the smallest sample size, all models exhibit high errors and DoubleML in particular requires more data to outperform the metalearners. From approximately 10,000 observations onward (corresponding ...

2033

[14] [14]

Average MAE across all treatments for different sample sizes; best model in bold. A SHAP analysis of the individual model predictions reveals that confounding variables, especially construction year, remain influential in the metalearners' CATE estimates, whereas DoubleML Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information ...

2026

[15] [15]

SHAP beeswarm plots for the comprehensive treatment across all models. 5 Discussion The results demonstrate that causal ML estimators are not only theoretically sound but also effective within our simulation-based evaluation for estimating retrofit-related energy savings under realistic treatment bias. The large gap between OLS and the more advanced causa...

2025

[16] [16]

and trim observations in regions with poor overlap to improve validity. SUTVA is likely to hold in retrofit settings, as a building's energy consumption is generally unaffected by whether neighbouring buildings are retrofitted and treatments are typically well-defined through specific material and component specifications. The importance of comprehensive ...

2024

[17] [17]

Chen, S., Zhang, G., Xia, X., Chen, Y., Setunge, S., & Shi, L. (2021). The impacts of occupant behavior on building energy consumption: A review. Sustainable Energy Technologies and Assessments, 45, 101212. Chen, X., Abualdenien, J., Singh, M. M., Borrmann, A., & Geyer, P. (2022). Introducing causal inference in the energy-efficient building design proces...

work page internal anchor Pith review Pith/arXiv arXiv 2021

[18] [18]

Ma, Z., Cooper, P., Daly, D., & Ledo, L. (2012). Existing building retrofits: Methodology and state-of-the-art. Energy and Buildings, 55, 889–902. Molak, A. (2023). Causal Inference and Discovery in Python: Unlock the secrets of modern causal machine learning with DoWhy, EconML, PyTorch and more. Packt Publishing Ltd. Moran, F., Blight, T., Natarajan, S.,...

2012

[19] [19]

Rubin, D. B. (1980). Randomization analysis of experimental data: The Fisher randomization test comment. Journal of the American Statistical Association, 75(371), 591–593. Causal Effects of Energy Retrofits Thirty-Fourth European Conference on Information Systems (ECIS 2026), Milan, Italy 17 Saffari, M., & Beagon, P. (2022). Home energy retrofit: Reviewin...

work page doi:10.1007/s12599-026-00999-x 1980

[20] [20]

building class periods spanning 1850-1994, with eight distinct classes (EFH A through EFH H) • Air density * specific thermal air capacity: 0.34 • Full load hours: 1130 • Norm temperatures: Outside: -12.0 for cold and -5.0 for moderate climates; Inside: 20

1994