Causal Regularization

Dominik Janzing

arxiv: 1906.12179 · v1 · pith:DKGDZ46Gnew · submitted 2019-06-28 · 📊 stat.ML · cs.LG

Causal Regularization

Dominik Janzing This is my paper

Pith reviewed 2026-05-25 13:46 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords causalconfoundingmodelregressionbounddatafirstgeneralization

0 comments

The pith

Regularization in regression reduces confounding effects in causal models even in the population limit, and a causal generalization bound limits the error of treating non-linear regressions as causal under a symmetric confounder model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard regression adds a penalty term to stop models from fitting noise in small datasets. This paper claims the same penalty can also make the fitted function closer to the true causal effect when an unobserved variable influences both the inputs and the target. In the linear case the penalty shrinks the influence of the confounder. Choosing the penalty strength is done by first estimating how strong the confounding is, using an earlier technique from the same author. For non-linear functions the paper proves an upper bound on how far the observational fit can be from the interventional causal effect, provided the function class is restricted. The bound exists only because the assumed confounder model has certain symmetry properties that link observational and interventional distributions.

Core claim

Regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also yield better causal models in the infinite sample regime. The error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class.

Load-bearing premise

The benefit of regularization for causal models and the causal generalization bound both rest on a particular model of confounding whose symmetries allow generalization from observational to interventional distributions.

Figures

Figures reproduced from arXiv: 1906.12179 by Dominik Janzing.

**Figure 1.** Figure 1: Left: In scenario 1, the empirical correlations between X and E are only finite sample effects. Right: In scenario 2, X and E are correlated due to their common cause Z. We sample the structural parameters M and c from distributions in a way that entails a simple analogy between scenario 1 and 2. of (2) reads a˜ = argmin0 a{kY −Xa0 k 2 } = X−1Y = a+X−1E, (7) where the square length is induced by the inner … view at source ↗

**Figure 2.** Figure 2: Results for Ridge (top) and Lasso (bottom) regression with ConCorr (left) versus cross-validated version (right) for the unconfounded case where artifacts are only due to overfitting. The results are roughly the same [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Results for Ridge (top) and Lasso (bottom) regression with ConCorr (left) versus cross-validated version (right) for the confounded case with large sample size where artifacts are almost only due to confounding. The results are roughly comparable, if we abstain from over-interpretations. In the regime where the unregularized relative squared error is around 1/2, all 4 methods yield errors that are most of … view at source ↗

**Figure 4.** Figure 4: Results for Ridge (left) and Lasso (right) regression for the data from the optical device in Janzing & Scholkopf ¨ (2018). The y-axis is the relative squared error achieved by ConCorr, while the x-axis is the cross-validated baseline. the confounded large sample regime are pointless since our theory states the equivalence of scenario 1 and 2. We show the experiments nevertheless for two reasons. First, it… view at source ↗

**Figure 5.** Figure 5: Our confounding scenario: the high-dimensional common cause Z influences Y in a linear additive way, while the influence on X is arbitrary. error of f w.r.t. interventional distribution ≤ error of f w.r.t. observational distribution + C(F) [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

I argue that regularizing terms in standard regression methods not only help against overfitting finite data, but sometimes also yield better causal models in the infinite sample regime. I first consider a multi-dimensional variable linearly influencing a target variable with some multi-dimensional unobserved common cause, where the confounding effect can be decreased by keeping the penalizing term in Ridge and Lasso regression even in the population limit. Choosing the size of the penalizing term, is however challenging, because cross validation is pointless. Here it is done by first estimating the strength of confounding via a method proposed earlier, which yielded some reasonable results for simulated and real data. Further, I prove a `causal generalization bound' which states (subject to a particular model of confounding) that the error made by interpreting any non-linear regression as causal model can be bounded from above whenever functions are taken from a not too rich class. In other words, the bound guarantees "generalization" from observational to interventional distributions, which is usually not subject of statistical learning theory (and is only possible due to the underlying symmetries of the confounder model).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows regularization can cut confounding bias even at infinite samples under one symmetric confounder model, plus a bound on observational-to-interventional error for restricted function classes.

read the letter

The paper's main point is that keeping a penalty term in Ridge or Lasso regression can reduce bias from unobserved confounders even when you have infinite data, and that a bound exists on how far a non-linear regression can stray when treated as a causal model. Both results are derived under a specific multi-dimensional confounding setup whose symmetries map observational to interventional distributions. The penalty size is set by estimating confounding strength with an earlier method from the same author, and some simulations on synthetic and real data are reported as reasonable. The causal generalization bound is the clearest novelty here; standard statistical learning theory does not address this kind of transfer, and the derivation ties the bound directly to the richness of the function class and the model's structure. That is a clean formal step. The population-level regularization argument is also distinct from the usual finite-sample overfitting story. The central limitation is that both the bias reduction and the bound rely on the particular symmetries of the chosen confounder model. No argument is given that the model is representative or that the results degrade gracefully when those symmetries are absent. The penalty choice also depends on prior self-referenced work, so the validation chain is not independent. This is for readers already working in causal machine learning who are open to model-specific adjustments to standard regressors. It is not a general-purpose fix. The idea is coherent on its own terms and formally grounded enough to merit referee time, even if the assumptions are narrow.

Referee Report

3 major / 2 minor

Summary. The paper claims that regularization terms in regression (Ridge and Lasso) can reduce confounding bias and yield better causal models even in the infinite-sample limit under a specific multi-dimensional linear model with unobserved common causes. Penalty size is chosen by estimating confounding strength via a prior method by the same author. It further proves a causal generalization bound showing that the error of treating non-linear regressions as causal can be upper-bounded for sufficiently restricted function classes, due to symmetries in the confounder model that map observational to interventional distributions.

Significance. If the derivations hold, the results would be notable for demonstrating a population-level causal benefit from regularization and for providing an explicit bound on causal generalization error under the assumed confounding structure. The work highlights how model symmetries enable observational-to-interventional transfer, which is outside standard statistical learning theory. Strengths include the attempt to address infinite-sample behavior and the explicit dependence on a concrete confounder model.

major comments (3)

[Linear case (population limit)] The linear population-limit analysis (following the multi-dimensional confounding setup in the abstract): the bias reduction from retaining the penalty term in Ridge/Lasso is derived under one specific model of unobserved common cause whose symmetries enable the observational-to-interventional mapping; the paper provides no argument that this model is representative or that results degrade gracefully outside it, which is load-bearing for the central claim of causal improvement at n=∞.
[Penalty selection method] Penalty-size selection (described after the linear case): the size is chosen by estimating confounding strength via the author's earlier method, which itself presupposes the same confounding structure; this creates a dependence whose independence is not established, undermining the claim that the procedure yields reasonable results on simulated and real data.
[Causal generalization bound] Causal generalization bound (non-linear section): the bound is stated to hold subject to the particular confounder model; no robustness analysis or sensitivity to misspecification is provided, and the 'not too rich class' of functions is not given an explicit characterization (e.g., via capacity measure) that would allow verification of the bound's tightness.

minor comments (2)

[Abstract] Abstract: the phrase 'yielded some reasonable results for simulated and real data' is vague and should reference specific quantitative metrics, figures, or tables.
[Abstract] Notation for multi-dimensional variables and the unobserved common cause is introduced without explicit symbols in the abstract, making the setup harder to follow before the main text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for the thorough review and valuable suggestions. Below we respond to each major comment, indicating where we will make revisions to the manuscript.

read point-by-point responses

Referee: [Linear case (population limit)] The linear population-limit analysis (following the multi-dimensional confounding setup in the abstract): the bias reduction from retaining the penalty term in Ridge/Lasso is derived under one specific model of unobserved common cause whose symmetries enable the observational-to-interventional mapping; the paper provides no argument that this model is representative or that results degrade gracefully outside it, which is load-bearing for the central claim of causal improvement at n=∞.

Authors: The results are derived specifically for the multi-dimensional linear model with unobserved common causes that has the required symmetries for the observational-interventional mapping. This model is not claimed to be representative of all possible confounding structures; the contribution lies in showing that regularization can have a causal benefit even at infinite samples under this setup. We will revise the manuscript to explicitly state that the analysis is model-specific and that generalization to other models is an open question. revision: partial
Referee: [Penalty selection method] Penalty-size selection (described after the linear case): the size is chosen by estimating confounding strength via the author's earlier method, which itself presupposes the same confounding structure; this creates a dependence whose independence is not established, undermining the claim that the procedure yields reasonable results on simulated and real data.

Authors: We agree that the penalty selection method depends on the confounding estimation procedure from our prior work, which assumes the same structure. This is a feature of the approach rather than a flaw, as the estimation is used to tune the regularization for the assumed model. The simulated and real data experiments are conducted under this framework and show reasonable performance. We will update the manuscript to clarify this dependence. revision: partial
Referee: [Causal generalization bound] Causal generalization bound (non-linear section): the bound is stated to hold subject to the particular confounder model; no robustness analysis or sensitivity to misspecification is provided, and the 'not too rich class' of functions is not given an explicit characterization (e.g., via capacity measure) that would allow verification of the bound's tightness.

Authors: The causal generalization bound is indeed derived under the particular confounder model, consistent with the manuscript's statements. We do not provide a robustness analysis to misspecification, which represents a limitation of the current work. For the function class, we describe it as 'not too rich' to ensure the bound holds due to the model symmetries; an explicit capacity measure characterization is not included. We will revise to add a more detailed discussion of the function class assumptions and note the absence of robustness checks. revision: partial

Circularity Check

1 steps flagged

Penalty size selection relies on self-cited prior method for confounding estimation

specific steps

self citation load bearing [Abstract]
"Here it is done by first estimating the strength of confounding via a method proposed earlier, which yielded some reasonable results for simulated and real data."

The size of the penalizing term (central to showing that regularization yields better causal models even at n=∞) is selected using an estimation procedure from the author's prior work. This makes the concrete application of the regularization benefit dependent on the independence and correctness of that self-referenced method, which is not re-derived or externally validated in the present paper.

full rationale

The paper explicitly derives the population-level benefit of retaining the penalty term (Ridge/Lasso) under a stated linear confounding model with unobserved common causes, and proves the causal generalization bound subject to a particular confounder model whose symmetries map observational to interventional distributions. These steps are self-contained given the model assumptions and do not reduce to self-citation. However, the practical choice of penalty size is performed by estimating confounding strength via a method proposed earlier by the same author. This introduces a self-citation_load_bearing element for the applied demonstration, though it is not required for the theoretical claims themselves. No other patterns (self-definitional, fitted inputs called predictions, uniqueness theorems, ansatz smuggling, or renaming) are present in the abstract or described derivations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on an assumed confounder model and on an earlier estimation procedure for confounding strength.

free parameters (1)

size of penalizing term
Selected by estimating confounding strength via a prior method rather than cross-validation.

axioms (1)

domain assumption particular model of confounding with symmetries allowing observational-to-interventional generalization
Invoked to support both the regularization benefit and the causal generalization bound.

pith-pipeline@v0.9.0 · 5703 in / 1092 out tokens · 41083 ms · 2026-05-25T13:46:51.200175+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the bound guarantees 'generalization' from observational to interventional distributions, which is only possible due to the underlying symmetries of the confounder model
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 2 (justification of population Ridge and Lasso)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 2 internal anchors

[1]

The operator theory of the pseudo-inverse I

Beutler, F. The operator theory of the pseudo-inverse I. Bounded operators . Journal of Mathematical Analysis and Applications, 10 0 (3): 0 451 -- 470, 1965

work page 1965
[2]

and Ghosh, J

Chakrabarti, A. and Ghosh, J. AIC, BIC and recent advances in model selection . In Bandyopadhyay, P. and Forster, M. (eds.), Philosophy of Statistics, volume 7 of Handbook of the Philosophy of Science, pp.\ 583 -- 605. North-Holland, Amsterdam, 2011

work page 2011
[3]

Double/debiased machine learning for treatment and structural parameters

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21 0 (1): 0 C1 -- C68, 2018

work page 2018
[4]

and Gupta, A

Dasgupta, S. and Gupta, A. An elementary proof of a theorem of Johnson and Lindenstrauss . Structures and Algorithms, 22 0 (1): 0 60--65, 2003

work page 2003
[5]

The elements of statistical learning: Data mining, inference, and prediction

Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag, New York, NY, 2001

work page 2001
[6]

Conditional Variance Penalties and Domain Shift Robustness

Heinze-Deml, C. and Meinshausen, N. Conditional variance penalties and domain shift robustness. arXiv:1710.11469 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[7]

Invariant causal prediction for nonlinear models

Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6: 0 20170016, 2017

work page 2017
[8]

and Kennard, R

Hoerl, A. and Kennard, R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42 0 (1): 0 80--86, 2000

work page 2000
[9]

Estimation of causal effects using linear non-gaussian causal models with hidden variables

Hoyer, P., Shimizu, S., Kerminen, A., and Palviainen, M. Estimation of causal effects using linear non-gaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49 0 (2): 0 362 -- 378, 2008

work page 2008
[10]

and Angrist, J

Imbens, G. and Angrist, J. Identification and estimation of local average treatment effects. Econometrica, 62 0 (2): 0 467 -- 475, 1994

work page 1994
[11]

and Sch\"olkopf, B

Janzing, D. and Sch\"olkopf, B. Detecting confounding in multivariate linear models via spectral analysis. Journal of Causal Inference, 6 0 (1), 2017

work page 2017
[12]

and Sch \"o lkopf, B

Janzing, D. and Sch \"o lkopf, B. Detecting non-causal artifacts in multivariate linear regression models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018

work page 2018
[13]

Identifying latent confounders using additive noise models

Janzing, D., Peters, J., Mooij, J., and Sch\"olkopf, B. Identifying latent confounders using additive noise models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 249-257. (Eds.) A. Ng and J. Bilmes, AUAI Press, Corvallis, OR, USA, 2009

work page 2009
[14]

Towards a learning theory of cause-effect inference

Lopez-Paz, D., Muandet, K., Sch \"o lkopf, B., and Tolstikhin, I. Towards a learning theory of cause-effect inference. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, pp.\ 1452–1461. JMLR, 2015

work page 2015
[15]

J., Hettich, S., Blake, C

Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository.html, 1998

work page 1998
[16]

Causality

Pearl, J. Causality. Cambridge University Press, 2000

work page 2000
[17]

Causal inference using invariant prediction: identification and confidence intervals

Peters, J., B \"u hlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 78 0 (5): 0 947--1012, 2016

work page 2016
[18]

Early stopping for non-parametric regression: An optimal data-dependent stopping rule

Raskutti, G., Wainwright, M., and Yu, B. Early stopping for non-parametric regression: An optimal data-dependent stopping rule. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.\ 1318--1325, Sep. 2011

work page 2011
[19]

Direct and indirect causal effects via potential outcomes

Rubin, D. Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics , 31: 0 161--170, 2004

work page 2004
[20]

and Smola, A

Sch\" o lkopf, B. and Smola, A. Learning with kernels. MIT Press, Cambridge, MA, 2002

work page 2002
[21]

Causally Regularized Learning with Agnostic Data Selection Bias

Shen, Z., Cui, P., Kuang, K., and Li, B. On image classification: Correlation v.s. causality. CoRR, abs/1708.06656, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

and L., W

Tibshirani, R. and L., W. Course on Statistical Machine Learning, chapter: ``Sparsity and the Lasso'' , 2015. http://www.stat.cmu.edu/ ryantibs/statml/

work page 2015
[23]

Statistical learning theory

Vapnik, V. Statistical learning theory. John Wileys & Sons, New York, 1998

work page 1998
[24]

Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination

Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch \"o lkopf, B. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. IJCAI : proceedings of the conference, pp.\ 1347--1353, 2017

work page 2017

[1] [1]

The operator theory of the pseudo-inverse I

Beutler, F. The operator theory of the pseudo-inverse I. Bounded operators . Journal of Mathematical Analysis and Applications, 10 0 (3): 0 451 -- 470, 1965

work page 1965

[2] [2]

and Ghosh, J

Chakrabarti, A. and Ghosh, J. AIC, BIC and recent advances in model selection . In Bandyopadhyay, P. and Forster, M. (eds.), Philosophy of Statistics, volume 7 of Handbook of the Philosophy of Science, pp.\ 583 -- 605. North-Holland, Amsterdam, 2011

work page 2011

[3] [3]

Double/debiased machine learning for treatment and structural parameters

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21 0 (1): 0 C1 -- C68, 2018

work page 2018

[4] [4]

and Gupta, A

Dasgupta, S. and Gupta, A. An elementary proof of a theorem of Johnson and Lindenstrauss . Structures and Algorithms, 22 0 (1): 0 60--65, 2003

work page 2003

[5] [5]

The elements of statistical learning: Data mining, inference, and prediction

Hastie, T., Tibshirani, R., and Friedman, J. The elements of statistical learning: Data mining, inference, and prediction. Springer-Verlag, New York, NY, 2001

work page 2001

[6] [6]

Conditional Variance Penalties and Domain Shift Robustness

Heinze-Deml, C. and Meinshausen, N. Conditional variance penalties and domain shift robustness. arXiv:1710.11469 , 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[7] [7]

Invariant causal prediction for nonlinear models

Heinze-Deml, C., Peters, J., and Meinshausen, N. Invariant causal prediction for nonlinear models. Journal of Causal Inference, 6: 0 20170016, 2017

work page 2017

[8] [8]

and Kennard, R

Hoerl, A. and Kennard, R. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 42 0 (1): 0 80--86, 2000

work page 2000

[9] [9]

Estimation of causal effects using linear non-gaussian causal models with hidden variables

Hoyer, P., Shimizu, S., Kerminen, A., and Palviainen, M. Estimation of causal effects using linear non-gaussian causal models with hidden variables. International Journal of Approximate Reasoning, 49 0 (2): 0 362 -- 378, 2008

work page 2008

[10] [10]

and Angrist, J

Imbens, G. and Angrist, J. Identification and estimation of local average treatment effects. Econometrica, 62 0 (2): 0 467 -- 475, 1994

work page 1994

[11] [11]

and Sch\"olkopf, B

Janzing, D. and Sch\"olkopf, B. Detecting confounding in multivariate linear models via spectral analysis. Journal of Causal Inference, 6 0 (1), 2017

work page 2017

[12] [12]

and Sch \"o lkopf, B

Janzing, D. and Sch \"o lkopf, B. Detecting non-causal artifacts in multivariate linear regression models. In Proceedings of the 35th International Conference on Machine Learning (ICML 2018), 2018

work page 2018

[13] [13]

Identifying latent confounders using additive noise models

Janzing, D., Peters, J., Mooij, J., and Sch\"olkopf, B. Identifying latent confounders using additive noise models. In Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (UAI 2009), 249-257. (Eds.) A. Ng and J. Bilmes, AUAI Press, Corvallis, OR, USA, 2009

work page 2009

[14] [14]

Towards a learning theory of cause-effect inference

Lopez-Paz, D., Muandet, K., Sch \"o lkopf, B., and Tolstikhin, I. Towards a learning theory of cause-effect inference. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of JMLR Workshop and Conference Proceedings, pp.\ 1452–1461. JMLR, 2015

work page 2015

[15] [15]

J., Hettich, S., Blake, C

Newman, D. J., Hettich, S., Blake, C. L., and Merz, C. J. UCI repository of machine learning databases. http://www.ics.uci.edu/ mlearn/MLRepository.html, 1998

work page 1998

[16] [16]

Causality

Pearl, J. Causality. Cambridge University Press, 2000

work page 2000

[17] [17]

Causal inference using invariant prediction: identification and confidence intervals

Peters, J., B \"u hlmann, P., and Meinshausen, N. Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society, Series B (Statistical Methodology), 78 0 (5): 0 947--1012, 2016

work page 2016

[18] [18]

Early stopping for non-parametric regression: An optimal data-dependent stopping rule

Raskutti, G., Wainwright, M., and Yu, B. Early stopping for non-parametric regression: An optimal data-dependent stopping rule. In 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton), pp.\ 1318--1325, Sep. 2011

work page 2011

[19] [19]

Direct and indirect causal effects via potential outcomes

Rubin, D. Direct and indirect causal effects via potential outcomes. Scandinavian Journal of Statistics , 31: 0 161--170, 2004

work page 2004

[20] [20]

and Smola, A

Sch\" o lkopf, B. and Smola, A. Learning with kernels. MIT Press, Cambridge, MA, 2002

work page 2002

[21] [21]

Causally Regularized Learning with Agnostic Data Selection Bias

Shen, Z., Cui, P., Kuang, K., and Li, B. On image classification: Correlation v.s. causality. CoRR, abs/1708.06656, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

and L., W

Tibshirani, R. and L., W. Course on Statistical Machine Learning, chapter: ``Sparsity and the Lasso'' , 2015. http://www.stat.cmu.edu/ ryantibs/statml/

work page 2015

[23] [23]

Statistical learning theory

Vapnik, V. Statistical learning theory. John Wileys & Sons, New York, 1998

work page 1998

[24] [24]

Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination

Zhang, K., Huang, B., Zhang, J., Glymour, C., and Sch \"o lkopf, B. Causal discovery from nonstationary/heterogeneous data: Skeleton estimation and orientation determination. IJCAI : proceedings of the conference, pp.\ 1347--1353, 2017

work page 2017