pith. sign in

arxiv: 1906.12179 · v1 · pith:DKGDZ46Gnew · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Causal Regularization

Pith reviewed 2026-05-25 13:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords causalconfoundingmodelregressionbounddatafirstgeneralization
0
0 comments X

The pith

Regularization in regression reduces confounding effects in causal models even in the population limit, and a causal generalization bound limits the error of treating non-linear regressions as causal under a symmetric confounder model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard regression adds a penalty term to stop models from fitting noise in small datasets. This paper claims the same penalty can also make the fitted function closer to the true causal effect when an unobserved variable influences both the inputs and the target. In the linear case the penalty shrinks the influence of the confounder. Choosing the penalty strength is done by first estimating how strong the confounding is, using an earlier technique from the same author. For non-linear functions the paper proves an upper bound on how far the observational fit can be from the interventional causal effect, provided the function class is restricted. The bound exists only because the assumed confounder model has certain symmetry properties that link observational and interventional distributions.

Core claim

Regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also yield better causal models in the infinite sample regime. The error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class.

Load-bearing premise

The benefit of regularization for causal models and the causal generalization bound both rest on a particular model of confounding whose symmetries allow generalization from observational to interventional distributions.

Figures

Figures reproduced from arXiv: 1906.12179 by Dominik Janzing.

Figure 1
Figure 1. Figure 1: Left: In scenario 1, the empirical correlations between X and E are only finite sample effects. Right: In scenario 2, X and E are correlated due to their common cause Z. We sample the structural parameters M and c from distributions in a way that entails a simple analogy between scenario 1 and 2. of (2) reads a˜ = argmin0 a{kY −Xa0 k 2 } = X−1Y = a+X−1E, (7) where the square length is induced by the inner … view at source ↗
Figure 2
Figure 2. Figure 2: Results for Ridge (top) and Lasso (bottom) regression with ConCorr (left) versus cross-validated version (right) for the unconfounded case where artifacts are only due to overfitting. The results are roughly the same [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results for Ridge (top) and Lasso (bottom) regression with ConCorr (left) versus cross-validated version (right) for the confounded case with large sample size where artifacts are almost only due to confounding. The results are roughly comparable, if we abstain from over-interpretations. In the regime where the unregularized relative squared error is around 1/2, all 4 methods yield errors that are most of … view at source ↗
Figure 4
Figure 4. Figure 4: Results for Ridge (left) and Lasso (right) regression for the data from the optical device in Janzing & Scholkopf ¨ (2018). The y-axis is the relative squared error achieved by ConCorr, while the x-axis is the cross-validated baseline. the confounded large sample regime are pointless since our theory states the equivalence of scenario 1 and 2. We show the experiments nevertheless for two reasons. First, it… view at source ↗
Figure 5
Figure 5. Figure 5: Our confounding scenario: the high-dimensional com￾mon cause Z influences Y in a linear additive way, while the influence on X is arbitrary. error of f w.r.t. interventional distribution ≤ error of f w.r.t. observational distribution + C(F) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

I argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also yield better causal models in the infinite sample regime. I first consider a multi-dimensional variable linearly influencing a target variable with some multi-dimensional unobserved common cause, where the confounding effect can be decreased by keeping the penalizing term in Ridge and Lasso regression even in the population limit. Choosing the size of the penalizing term, is however challenging, because cross validation is pointless. Here it is done by first estimating the strength of confounding via a method proposed earlier, which yielded some reasonable results for simulated and real data. Further, I prove a `causal generalization bound' which states (subject to a particular model of confounding) that the error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class. In other words, the bound guarantees "generalization" from observational to interventional distributions, which is usually not subject of statistical learning theory (and is only possible due to the underlying symmetries of the confounder model).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that regularization terms in regression (Ridge and Lasso) can reduce confounding bias and yield better causal models even in the infinite-sample limit under a specific multi-dimensional linear model with unobserved common causes. Penalty size is chosen by estimating confounding strength via a prior method by the same author. It further proves a causal generalization bound showing that the error of treating non-linear regressions as causal can be upper-bounded for sufficiently restricted function classes, due to symmetries in the confounder model that map observational to interventional distributions.

Significance. If the derivations hold, the results would be notable for demonstrating a population-level causal benefit from regularization and for providing an explicit bound on causal generalization error under the assumed confounding structure. The work highlights how model symmetries enable observational-to-interventional transfer, which is outside standard statistical learning theory. Strengths include the attempt to address infinite-sample behavior and the explicit dependence on a concrete confounder model.

major comments (3)
  1. [Linear case (population limit)] The linear population-limit analysis (following the multi-dimensional confounding setup in the abstract): the bias reduction from retaining the penalty term in Ridge/Lasso is derived under one specific model of unobserved common cause whose symmetries enable the observational-to-interventional mapping; the paper provides no argument that this model is representative or that results degrade gracefully outside it, which is load-bearing for the central claim of causal improvement at n=∞.
  2. [Penalty selection method] Penalty-size selection (described after the linear case): the size is chosen by estimating confounding strength via the author's earlier method, which itself presupposes the same confounding structure; this creates a dependence whose independence is not established, undermining the claim that the procedure yields reasonable results on simulated and real data.
  3. [Causal generalization bound] Causal generalization bound (non-linear section): the bound is stated to hold subject to the particular confounder model; no robustness analysis or sensitivity to misspecification is provided, and the 'not too rich class' of functions is not given an explicit characterization (e.g., via capacity measure) that would allow verification of the bound's tightness.
minor comments (2)
  1. [Abstract] Abstract: the phrase 'yielded some reasonable results for simulated and real data' is vague and should reference specific quantitative metrics, figures, or tables.
  2. [Abstract] Notation for multi-dimensional variables and the unobserved common cause is introduced without explicit symbols in the abstract, making the setup harder to follow before the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below we respond to each major comment, indicating where we will make revisions to the manuscript.

read point-by-point responses
  1. Referee: [Linear case (population limit)] The linear population-limit analysis (following the multi-dimensional confounding setup in the abstract): the bias reduction from retaining the penalty term in Ridge/Lasso is derived under one specific model of unobserved common cause whose symmetries enable the observational-to-interventional mapping; the paper provides no argument that this model is representative or that results degrade gracefully outside it, which is load-bearing for the central claim of causal improvement at n=∞.

    Authors: The results are derived specifically for the multi-dimensional linear model with unobserved common causes that has the required symmetries for the observational-interventional mapping. This model is not claimed to be representative of all possible confounding structures; the contribution lies in showing that regularization can have a causal benefit even at infinite samples under this setup. We will revise the manuscript to explicitly state that the analysis is model-specific and that generalization to other models is an open question. revision: partial

  2. Referee: [Penalty selection method] Penalty-size selection (described after the linear case): the size is chosen by estimating confounding strength via the author's earlier method, which itself presupposes the same confounding structure; this creates a dependence whose independence is not established, undermining the claim that the procedure yields reasonable results on simulated and real data.

    Authors: We agree that the penalty selection method depends on the confounding estimation procedure from our prior work, which assumes the same structure. This is a feature of the approach rather than a flaw, as the estimation is used to tune the regularization for the assumed model. The simulated and real data experiments are conducted under this framework and show reasonable performance. We will update the manuscript to clarify this dependence. revision: partial

  3. Referee: [Causal generalization bound] Causal generalization bound (non-linear section): the bound is stated to hold subject to the particular confounder model; no robustness analysis or sensitivity to misspecification is provided, and the 'not too rich class' of functions is not given an explicit characterization (e.g., via capacity measure) that would allow verification of the bound's tightness.

    Authors: The causal generalization bound is indeed derived under the particular confounder model, consistent with the manuscript's statements. We do not provide a robustness analysis to misspecification, which represents a limitation of the current work. For the function class, we describe it as 'not too rich' to ensure the bound holds due to the model symmetries; an explicit capacity measure characterization is not included. We will revise to add a more detailed discussion of the function class assumptions and note the absence of robustness checks. revision: partial

Circularity Check

1 steps flagged

Penalty size selection relies on self-cited prior method for confounding estimation

specific steps
  1. self citation load bearing [Abstract]
    "Here it is done by first estimating the strength of confounding via a method proposed earlier, which yielded some reasonable results for simulated and real data."

    The size of the penalizing term (central to showing that regularization yields better causal models even at n=∞) is selected using an estimation procedure from the author's prior work. This makes the concrete application of the regularization benefit dependent on the independence and correctness of that self-referenced method, which is not re-derived or externally validated in the present paper.

full rationale

The paper explicitly derives the population-level benefit of retaining the penalty term (Ridge/Lasso) under a stated linear confounding model with unobserved common causes, and proves the causal generalization bound subject to a particular confounder model whose symmetries map observational to interventional distributions. These steps are self-contained given the model assumptions and do not reduce to self-citation. However, the practical choice of penalty size is performed by estimating confounding strength via a method proposed earlier by the same author. This introduces a self-citation_load_bearing element for the applied demonstration, though it is not required for the theoretical claims themselves. No other patterns (self-definitional, fitted inputs called predictions, uniqueness theorems, ansatz smuggling, or renaming) are present in the abstract or described derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on an assumed confounder model and on an earlier estimation procedure for confounding strength.

free parameters (1)
  • size of penalizing term
    Selected by estimating confounding strength via a prior method rather than cross-validation.
axioms (1)
  • domain assumption particular model of confounding with symmetries allowing observational-to-interventional generalization
    Invoked to support both the regularization benefit and the causal generalization bound.

pith-pipeline@v0.9.0 · 5703 in / 1092 out tokens · 41083 ms · 2026-05-25T13:46:51.200175+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

  1. [1]

    The operator theory of the pseudo-inverse I

    Beutler, F. The operator theory of the pseudo-inverse I. Bounded operators . Journal of Mathematical Analysis and Applications, 10 0 (3): 0 451 -- 470, 1965

  2. [2]

    and Ghosh, J

    Chakrabarti, A. and Ghosh, J. AIC, BIC and recent advances in model selection . In Bandyopadhyay, P. and Forster, M. (eds.), Philosophy of Statistics, volume 7 of Handbook of the Philosophy of Science, pp.\ 583 -- 605. North-Holland, Amsterdam, 2011

  3. [3]

    Double/debiased machine learning for treatment and structural parameters

    Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21 0 (1): 0 C1 -- C68, 2018

  4. [4]

    and Gupta, A

    Dasgupta, S. and Gupta, A. An elementary proof of a theorem of Johnson and Lindenstrauss . Structures and Algorithms, 22 0 (1): 0 60--65, 2003

  5. [5]

    The elements of statistical learning: Data mining, inference, and prediction

    Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag, New York, NY, 2001

  6. [6]

    Conditional Variance Penalties and Domain Shift Robustness

    Heinze-Deml, C. and Meinshausen, N. Conditional variance penalties and domain shift robustness. arXiv:1710.11469 , 2017

  7. [7]

    Invariant causal prediction for nonlinear models

    Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6: 0 20170016, 2017

  8. [8]

    and Kennard, R

    Hoerl, A. and Kennard, R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42 0 (1): 0 80--86, 2000

  9. [9]

    Estimation of causal effects using linear non-gaussian causal models with hidden variables

    Hoyer, P., Shimizu, S., Kerminen, A., and Palviainen, M. Estimation of causal effects using linear non-gaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49 0 (2): 0 362 -- 378, 2008

  10. [10]

    and Angrist, J

    Imbens, G. and Angrist, J. Identification and estimation of local average treatment effects. Econometrica, 62 0 (2): 0 467 -- 475, 1994

  11. [11]

    and Sch\"olkopf, B

    Janzing, D. and Sch\"olkopf, B. Detecting confounding in multivariate linear models via spectral analysis. Journal of Causal Inference, 6 0 (1), 2017

  12. [12]

    and Sch \"o lkopf, B

    Janzing, D. and Sch \"o lkopf, B. Detecting non-causal artifacts in multivariate linear regression models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018

  13. [13]

    Identifying latent confounders using additive noise models

    Janzing, D., Peters, J., Mooij, J., and Sch\"olkopf, B. Identifying latent confounders using additive noise models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 249-257. (Eds.) A. Ng and J. Bilmes, AUAI Press, Corvallis, OR, USA, 2009

  14. [14]

    Towards a learning theory of cause-effect inference

    Lopez-Paz, D., Muandet, K., Sch \"o lkopf, B., and Tolstikhin, I. Towards a learning theory of cause-effect inference. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, pp.\ 1452–1461. JMLR, 2015

  15. [15]

    J., Hettich, S., Blake, C

    Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository.html, 1998

  16. [16]

    Causality

    Pearl, J. Causality. Cambridge University Press, 2000

  17. [17]

    Causal inference using invariant prediction: identification and confidence intervals

    Peters, J., B \"u hlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 78 0 (5): 0 947--1012, 2016

  18. [18]

    Early stopping for non-parametric regression: An optimal data-dependent stopping rule

    Raskutti, G., Wainwright, M., and Yu, B. Early stopping for non-parametric regression: An optimal data-dependent stopping rule. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.\ 1318--1325, Sep. 2011

  19. [19]

    Direct and indirect causal effects via potential outcomes

    Rubin, D. Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics , 31: 0 161--170, 2004

  20. [20]

    and Smola, A

    Sch\" o lkopf, B. and Smola, A. Learning with kernels. MIT Press, Cambridge, MA, 2002

  21. [21]

    Causally Regularized Learning with Agnostic Data Selection Bias

    Shen, Z., Cui, P., Kuang, K., and Li, B. On image classification: Correlation v.s. causality. CoRR, abs/1708.06656, 2017

  22. [22]

    and L., W

    Tibshirani, R. and L., W. Course on Statistical Machine Learning, chapter: ``Sparsity and the Lasso'' , 2015. http://www.stat.cmu.edu/ ryantibs/statml/

  23. [23]

    Statistical learning theory

    Vapnik, V. Statistical learning theory. John Wileys & Sons, New York, 1998

  24. [24]

    Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination

    Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch \"o lkopf, B. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. IJCAI : proceedings of the conference, pp.\ 1347--1353, 2017