pith. machine review for the scientific record. sign in

arxiv: 1907.02893 · v3 · submitted 2019-07-05 · 📊 stat.ML · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Invariant Risk Minimization

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:25 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords invariant risk minimizationout-of-distribution generalizationcausal inferencedomain generalizationrepresentation learningmachine learning
0
0 comments X

The pith

Invariant Risk Minimization finds a data representation where the same classifier is optimal for every training distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training objective that searches for features whose relationship to the target stays fixed even as the surrounding data distribution changes. It shows that this objective recovers features tied to the stable causal mechanisms in the data rather than environment-specific correlations. A reader should care because standard empirical risk minimization often latches onto spurious patterns that break when the test distribution differs from training. By enforcing invariance of the optimal predictor, the method aims to produce models that continue to work under distribution shift.

Core claim

Invariant Risk Minimization (IRM) learns a representation such that the optimal linear classifier on top of that representation is identical across all training environments. This is achieved by jointly minimizing the average risk while adding a penalty that forces the gradient of each environment's risk with respect to the classifier parameters to vanish at the shared optimum. The resulting invariant features correspond to the causal factors that govern the label in the underlying data-generating process, enabling generalization to environments not seen during training.

What carries the argument

The IRM penalty term that requires the gradient of the risk with respect to a fixed classifier to be zero in every environment, thereby enforcing that the same predictor is optimal everywhere.

Load-bearing premise

The observed environments must share the same causal mechanisms that determine the label while differing only in the distributions of non-causal variables.

What would settle it

A controlled experiment on synthetic data with known causal graph where IRM is shown to recover exactly the causal features (or fails to do so) when the environments are generated by intervening only on non-causal variables.

read the original abstract

We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Invariant Risk Minimization (IRM), a learning paradigm that estimates a data representation such that the optimal classifier on top of this representation is the same across multiple training distributions. It claims through theory and experiments that the learned invariances correspond to causal structures governing the data and enable out-of-distribution generalization.

Significance. If the central claims hold, this work offers a principled objective for learning predictors that exploit invariance across environments to achieve robust OOD performance, with a direct link to identifying causal features. This is significant for bridging empirical risk minimization with causal inference in non-i.i.d. settings, and the reproducible experimental protocols and parameter-free aspects of the formulation (where applicable) strengthen its potential impact.

major comments (2)
  1. [§4] §4: The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.
  2. [Eq. (3)] Eq. (3): The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly qualify the scope of the theoretical results to linear cases to avoid overstatement of the causal connection.
  2. Experimental sections would benefit from additional details on environment construction and sensitivity to the penalty hyperparameter to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications to better delineate the scope of our theoretical and practical results.

read point-by-point responses
  1. Referee: [§4] The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.

    Authors: We agree that the equivalence result in Section 4 is derived under the specific assumptions of linear structural causal models with additive noise and a fixed number of environments, relying on the linearity of the representation and the identifiability of the shared optimal classifier weights. The manuscript does not provide a uniqueness result for non-linear feature maps or general non-linear SCMs. The broader statements linking invariances to causal structures are presented as holding under these assumptions, with supporting experimental evidence in more general settings. We will revise Section 4, the abstract, and related discussion to explicitly state the assumptions and note that extensions to non-linear cases remain an open direction. revision: partial

  2. Referee: [Eq. (3)] The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.

    Authors: The practical objective in Equation (3) uses a gradient penalty (evaluated at w=1) to enforce the invariance condition, which is exact under the linear classifier assumption but reduces to a first-order stationarity condition more generally. We do not provide a proof that this approximation identifies causal features or guarantees OOD generalization for non-linear representations or SCMs. The formulation is motivated by the linear theory, and our experiments demonstrate improved OOD performance in non-linear regimes. We will add a clarifying discussion of the approximation's nature and limitations in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in IRM derivation chain

full rationale

The core IRM definition (a representation Φ such that argmin_w R^e(w ∘ Φ) is identical across environments e) is stated directly from the multi-environment setup and does not reduce to any fitted target quantity or self-referential loop. Section 4 derives the link to causal parents only under explicit linear SCM + additive noise assumptions; this is a one-directional implication proved from the SCM, not a tautology or renaming of the input risks. The practical objective (Eq. 3 with gradient penalty) is an explicit relaxation of the definition, not a statistical fit called a prediction. No load-bearing self-citation or ansatz smuggling is present; the derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that multiple training distributions share invariant causal structures while varying in spurious correlations.

axioms (1)
  • domain assumption Multiple training distributions share the same causal mechanisms but differ in non-causal aspects.
    This premise is required for the shared optimal classifier to isolate causal invariants.

pith-pipeline@v0.9.0 · 5361 in / 1150 out tokens · 43767 ms · 2026-05-12T06:25:21.245199+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. The Statistical Cost of Adaptation in Multi-Source Transfer Learning

    math.ST 2026-05 unverdicted novelty 8.0

    Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

  2. TILT: Target-induced loss tilting under covariate shift

    cs.LG 2026-05 conditional novelty 7.0

    TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.

  3. Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model

    cs.LG 2026-05 conditional novelty 7.0

    A minimal model analytically separates shortcut attraction during training from the switch to a shortcut rule and from cross-family out-of-distribution failure.

  4. Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection

    cs.CV 2026-05 unverdicted novelty 7.0

    A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.

  5. Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...

  6. Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study

    cs.CV 2026-05 unverdicted novelty 7.0

    A large-scale benchmark finds that recent multimodal domain generalization methods give only marginal gains over a plain ERM baseline, with no method winning consistently and all degrading sharply under corruption or ...

  7. eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts

    cs.CV 2026-05 unverdicted novelty 7.0

    eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.

  8. Domain Generalization through Spatial Relation Induction over Visual Primitives

    cs.CV 2026-05 unverdicted novelty 7.0

    PARSE improves domain generalization accuracy by factoring recognition into visual primitives and their spatial relational compositions learned end-to-end with differentiable predicates.

  9. ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

    cs.CV 2026-05 unverdicted novelty 7.0

    ScriptHOI decomposes HOI phrases into state slots and uses script coverage, conflict, interval partial-label learning, and counterfactual contrast to improve rare and unseen interaction detection while cutting afforda...

  10. ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

    cs.LG 2026-05 unverdicted novelty 7.0

    ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.

  11. Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection

    cs.LG 2026-04 conditional novelty 7.0

    A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...

  12. Synthetic Designed Experiments for Diagnosing Vision Model Failure

    cs.CV 2026-03 unverdicted novelty 7.0

    SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.

  13. Rethinking Molecular OOD Generalization via Target-Aware Source Selection

    cs.LG 2026-05 unverdicted novelty 6.0

    SCOPE-BENCH shows state-of-the-art molecular models suffer up to 8x higher errors under extreme OOD, while POMA reduces mean absolute error by up to 11.2% via target-aware source selection and dual-scale adaptation.

  14. Understanding Generalization through Decision Pattern Shift

    cs.LG 2026-05 unverdicted novelty 6.0

    DPS quantifies deviation of per-sample decision patterns from class averages and shows linear correlation with generalization gaps while unifying degradation scenarios into a continuous trajectory.

  15. DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift

    cs.LG 2026-05 unverdicted novelty 6.0

    DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.

  16. Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training

    cs.LG 2026-05 unverdicted novelty 6.0

    Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.

  17. Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions

    cs.LG 2026-05 unverdicted novelty 6.0

    SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator ...

  18. The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory

    cs.LG 2026-05 unverdicted novelty 6.0

    Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.

  19. CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators

    cs.AI 2026-05 unverdicted novelty 6.0

    CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.

  20. TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

    cs.LG 2026-05 unverdicted novelty 6.0

    TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...

  21. Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability

    cs.LG 2026-05 unverdicted novelty 6.0

    EEG model predictions on the same brain signals flip for up to 42% of trials under different preprocessing choices, with new tools introduced to measure and mitigate the resulting instability.

  22. ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection

    cs.CV 2026-05 unverdicted novelty 6.0

    ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain trai...

  23. Anatomy of a failure: When, how, and why deep vision fails in scientific domains

    cs.CV 2026-05 unverdicted novelty 6.0

    Deep learning on information-rich scientific images collapses to one-dimensional predictions due to a mismatch between data priors and the model's simplicity bias, even after robustification techniques.

  24. Learning to Theorize the World from Observation

    cs.LG 2026-05 unverdicted novelty 6.0

    NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.

  25. Attribution-Guided Masking for Robust Cross-Domain Sentiment Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    AGM adds a gradient-based masking loss during fine-tuning to suppress reliance on spurious tokens, achieving competitive zero-shot transfer on sentiment tasks while providing token-level interpretability.

  26. Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective

    cs.AI 2026-05 unverdicted novelty 6.0

    Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut ...

  27. Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.

  28. Robust Representation Learning through Explicit Environment Modeling

    stat.ML 2026-04 unverdicted novelty 6.0

    Explicitly modeling and marginalizing environment variation via generalized random-intercept models produces representations that support robust average prediction across unseen environments and outperform invariant-l...

  29. Bayesian Environment Invariant Regression

    stat.ME 2026-04 unverdicted novelty 6.0

    A Bayesian spike-and-slab model separates invariant regression mechanisms from environment-specific associations, with proven selection consistency and posterior contraction under a working model.

  30. Deep sprite-based image models: An analysis

    cs.CV 2026-04 unverdicted novelty 6.0

    A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.

  31. Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization

    cs.LG 2026-04 unverdicted novelty 6.0

    RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.

  32. Learning Stable Predictors from Weak Supervision under Distribution Shift

    cs.LG 2026-04 unverdicted novelty 6.0

    Weak supervision supports in-domain learning for CRISPR transcriptomic perturbations but temporal shifts cause negative R-squared and near-zero correlation across linear and tree models, unlike partial cell-line transfer.

  33. SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    cs.LG 2023-10 accept novelty 6.0

    SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.

  34. On the Opportunities and Risks of Foundation Models

    cs.LG 2021-08 accept novelty 6.0

    Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.

  35. Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization

    cs.LG 2019-11 conditional novelty 6.0

    Increased regularization is required for group DRO to achieve good worst-group generalization in overparameterized neural networks.

  36. Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging

    cs.CV 2026-05 unverdicted novelty 5.0

    A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...

  37. Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation

    cs.LG 2026-05 unverdicted novelty 5.0

    A framework using structural causal models simulates parametric drifts to evaluate classifier robustness more realistically than static tests or noise perturbations.

  38. Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models

    cs.LG 2026-05 unverdicted novelty 5.0

    Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.

  39. When Brain Networks Travel: Learning Beyond Site

    cs.LG 2026-05 unverdicted novelty 5.0

    CORE decouples site confounders in fMRI networks, profiles transient dynamics on a population scaffold using line graphs, and applies subject-adaptive gating to achieve up to 6.7% better cross-site generalization on A...

  40. MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization

    cs.LG 2026-05 unverdicted novelty 5.0

    MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC be...

  41. Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA

    cs.RO 2026-04 unverdicted novelty 5.0

    Semantic rollout prediction plus town-adversarial regularization on a Dreamer agent raises mean zero-shot success rate for fixed-route driving across held-out CARLA towns under fixed weather and no traffic.

  42. Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging

    cs.LG 2026-04 unverdicted novelty 5.0

    AFU-IC decouples client unlearning from global federated training in medical imaging and adds server-side invariance calibration to prevent relearning of erased data.

  43. Sensitivity Uncertainty Alignment in Large Language Models

    cs.CR 2026-04 unverdicted novelty 5.0

    SUA measures the gap between how much an LLM's output changes under perturbations and how uncertain the model claims to be, with a training procedure to reduce that gap.

  44. Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities

    cs.CV 2026-04 unverdicted novelty 5.0

    Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.

  45. Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It

    eess.IV 2026-04 unverdicted novelty 5.0

    MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.

  46. Investigating Data Interventions for Subgroup Fairness: An ICU Case Study

    cs.LG 2026-04 unverdicted novelty 4.0

    Data addition from different sources does not reliably boost subgroup fairness in ICU models and often requires post-hoc calibration to work.

  47. Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

    cs.LG 2020-05 unverdicted novelty 2.0

    Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 46 Pith papers

  1. [1]

    Autonomy

    John Aldrich. Autonomy. Oxford Economic Papers, 1989

  2. [2]

    Robust supervised learning

    James Andrew Bagnell. Robust supervised learning. In AAAI, 2005

  3. [3]

    Bartlett, Philip M

    Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign Overfitting in Linear Regression. arXiv, 2019

  4. [4]

    Recognition in terra incognita

    Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In ECCV, 2018

  5. [5]

    Analysis of representations for domain adaptation

    Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS. 2007

  6. [6]

    Robust optimiza- tion

    Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimiza- tion. Princeton University Press, 2009

  7. [7]

    A meta- transfer objective for learning to disentangle causal mechanisms

    Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S´ ebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta- transfer objective for learning to disentangle causal mechanisms. arXiv, 2019

  8. [8]

    Denker, Harris Drucker, Isabelle Guyon, Lawrence D

    L´ eon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard S¨ ackinger, Patrice Simard, and Vladimir Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In ICPR, 1994

  9. [9]

    Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet

    Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet. In ICLR, 2019

  10. [10]

    Invariant scattering convolution networks

    Joan Bruna and Stephane Mallat. Invariant scattering convolution networks. TPAMI, 2013

  11. [11]

    In- termittent process analysis with scattering moments

    Joan Bruna, Stephane Mallat, Emmanuel Bacry, and Jean-Franois Muzy. In- termittent process analysis with scattering moments. The Annals of Statistics , 2015

  12. [12]

    Two theorems on invariance and causality

    Nancy Cartwright. Two theorems on invariance and causality. Philosophy of Science, 2003. 23

  13. [13]

    Cheng and Hongjing Lu

    Patricia W. Cheng and Hongjing Lu. Causal invariance as an essential constraint for creating a causal representation of the world. The Oxford handbook of causal reasoning, 2017

  14. [14]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019

  15. [15]

    Statistics of robust optimization: A generalized empirical likelihood approach

    John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv, 2016

  16. [16]

    Domain- adversarial training of neural networks

    Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸ cois Laviolette, Mario March, and Victor Lempitsky. Domain- adversarial training of neural networks. JMLR, 2016

  17. [17]

    Wichmann, and Wieland Brendel

    Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2019

  18. [18]

    Learning causal structures using regression invariance

    AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In NIPS, 2017

  19. [19]

    Patrick J. Grother. NIST Special Database 19: Handprinted forms and char- acters database. https://www.nist.gov/srd/nist-special-database-19 ,

  20. [20]

    File doc/doc.ps in the 1995 NIST CD ROM NIST Special Database 19

  21. [21]

    The probability approach in econometrics

    Trygve Haavelmo. The probability approach in econometrics. Econometrica: Journal of the Econometric Society , 1944

  22. [22]

    Conditional variance penalties and domain shift robustness

    Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv, 2017

  23. [23]

    Invariant causal prediction for nonlinear models

    Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference , 2018

  24. [24]

    Revisiting visual question answering baselines

    Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Revisiting visual question answering baselines. In ECCV, 2016

  25. [25]

    Johansson, David A

    Fredrik D. Johansson, David A. Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. AISTATS, 2019

  26. [26]

    General- ization in anti-causal learning

    Niki Kilbertus, Giambattista Parascandolo, and Bernhard Sch¨ olkopf. General- ization in anti-causal learning. arXiv, 2018

  27. [27]

    Stable prediction across unknown environments

    Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. Stable prediction across unknown environments. In SIGKDD, 2018

  28. [28]

    Lake, Tomer D

    Brenden M. Lake, Tomer D. Ullman, Joshua B Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 2017. 24

  29. [29]

    James M. Lee. Introduction to Smooth Manifolds . Springer, 2003

  30. [30]

    Counterfactuals

    David Lewis. Counterfactuals. John Wiley & Sons, 2013

  31. [31]

    Deep domain generalization via conditional invariant adver- sarial networks

    Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adver- sarial networks. In ECCV, 2018

  32. [32]

    From dependence to causation

    David Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016

  33. [33]

    Discovering causal signals in images

    David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and L´ eon Bottou. Discovering causal signals in images. In CVPR, 2017

  34. [34]

    Learning to pivot with adversarial networks

    Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks. In Advances in neural information processing systems , pages 981–990, 2017

  35. [35]

    Domain adaptation by using causal inference to predict invariant conditional distributions

    Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In NIPS, 2018

  36. [36]

    Deep learning: A critical appraisal

    Gary Marcus. Deep learning: A critical appraisal. arXiv, 2018

  37. [37]

    Causality from a distributional robustness point of view

    Nicolai Meinshausen. Causality from a distributional robustness point of view. In Data Science Workshop (DSW) , 2018

  38. [38]

    Maximin effects in inhomogeneous large-scale data

    Nicolai Meinshausen and Peter B¨ uhlmann. Maximin effects in inhomogeneous large-scale data. The Annals of Statistics , 2015

  39. [39]

    Mitchell

    Sandra D. Mitchell. Dimensions of scientific law. Philosophy of Science , 2000

  40. [40]

    Causality: Models, Reasoning, and Inference

    Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

  41. [41]

    Causal inference using invariant prediction: identification and confidence intervals

    Jonas Peters, Peter B¨ uhlmann, and Nicolai Meinshausen. Causal inference using invariant prediction: identification and confidence intervals. JRSS B , 2016

  42. [42]

    MIT press, 2017

    Jonas Peters, Dominik Janzing, and Bernhard Sch¨ olkopf.Elements of causal inference: foundations and learning algorithms . MIT press, 2017

  43. [43]

    Incompleteness, non locality and realism

    Michael Redhead. Incompleteness, non locality and realism. a prolegomenon to the philosophy of quantum mechanics. 1987

  44. [44]

    Invariant models for causal transfer learning

    Mateo Rojas-Carulla, Bernhard Sch¨ olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. JMLR, 2018

  45. [45]

    Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology , 1974. 25

  46. [46]

    On causal and anticausal learning

    Bernhard Sch¨ olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In ICML, 2012

  47. [47]

    Certifying some distribu- tional robustness with principled adversarial training

    Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distribu- tional robustness with principled adversarial training. ICLR, 2018

  48. [48]

    Causal necessity: a pragmatic investigation of the necessity of laws

    Brian Skyrms. Causal necessity: a pragmatic investigation of the necessity of laws. Yale University Press, 1980

  49. [49]

    Bob L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia , 2014

  50. [50]

    Unbiased look at dataset bias

    Antonio Torralba and Alexei Efros. Unbiased look at dataset bias. In CVPR, 2011

  51. [51]

    Principles of risk minimization for learning theory

    Vladimir Vapnik. Principles of risk minimization for learning theory. In NIPS. 1992

  52. [52]

    Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998

  53. [53]

    Do we still need models or just more data and compute?, 2019

    Max Welling. Do we still need models or just more data and compute?, 2019

  54. [54]

    Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht

    Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS. 2017

  55. [55]

    Making things happen: A theory of causal explanation

    James Woodward. Making things happen: A theory of causal explanation . Oxford university press, 2005

  56. [56]

    Correlation and causation

    Sewall Wright. Correlation and causation. Journal of agricultural research , 1921

  57. [57]

    Understanding deep learning requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2016. 26 A Additional theorems Theorem 10. Let Σe X,X := EX e[XeXe⊤] ∈ Sd×d + , with Sd×d + the space of symmetric positive semi-definite matrices, and Σe X,ϵ := EX e[Xeϵe] ∈ Rd. Then, for any arbitrary tuple ( ...

  58. [58]

    Since domain adaptation enforcesP (Φ(Xes)) =P (Φ(Xet)), it consequently enforces P ( ˆYes) =P ( ˆYet), where ˆYe = w(Φ(Xe)), for all e ∈ {es,et}

    Using these data and the domain adaptation recipe outlined above, we build a classifierw◦Φ. Since domain adaptation enforcesP (Φ(Xes)) =P (Φ(Xet)), it consequently enforces P ( ˆYes) =P ( ˆYet), where ˆYe = w(Φ(Xe)), for all e ∈ {es,et}. Then, the classification accuracy will be at most 20%. This is worse than random guessing, in a problem where simply trai...