arxiv: 1907.02893 · v3 · submitted 2019-07-05 · 📊 stat.ML · cs.AI· cs.LG

Recognition: 2 theorem links

· Lean Theorem

Invariant Risk Minimization

Martin Arjovsky , L\'eon Bottou , Ishaan Gulrajani , David Lopez-Paz

Authors on Pith no claims yet

Pith reviewed 2026-05-12 06:25 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords invariant risk minimizationout-of-distribution generalizationcausal inferencedomain generalizationrepresentation learningmachine learning

0 comments

The pith

Invariant Risk Minimization finds a data representation where the same classifier is optimal for every training distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a training objective that searches for features whose relationship to the target stays fixed even as the surrounding data distribution changes. It shows that this objective recovers features tied to the stable causal mechanisms in the data rather than environment-specific correlations. A reader should care because standard empirical risk minimization often latches onto spurious patterns that break when the test distribution differs from training. By enforcing invariance of the optimal predictor, the method aims to produce models that continue to work under distribution shift.

Core claim

Invariant Risk Minimization (IRM) learns a representation such that the optimal linear classifier on top of that representation is identical across all training environments. This is achieved by jointly minimizing the average risk while adding a penalty that forces the gradient of each environment's risk with respect to the classifier parameters to vanish at the shared optimum. The resulting invariant features correspond to the causal factors that govern the label in the underlying data-generating process, enabling generalization to environments not seen during training.

What carries the argument

The IRM penalty term that requires the gradient of the risk with respect to a fixed classifier to be zero in every environment, thereby enforcing that the same predictor is optimal everywhere.

Load-bearing premise

The observed environments must share the same causal mechanisms that determine the label while differing only in the distributions of non-causal variables.

What would settle it

A controlled experiment on synthetic data with known causal graph where IRM is shown to recover exactly the causal features (or fails to do so) when the environments are generated by intervening only on non-causal variables.

read the original abstract

We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Invariant Risk Minimization (IRM), a learning paradigm that estimates a data representation such that the optimal classifier on top of this representation is the same across multiple training distributions. It claims through theory and experiments that the learned invariances correspond to causal structures governing the data and enable out-of-distribution generalization.

Significance. If the central claims hold, this work offers a principled objective for learning predictors that exploit invariance across environments to achieve robust OOD performance, with a direct link to identifying causal features. This is significant for bridging empirical risk minimization with causal inference in non-i.i.d. settings, and the reproducible experimental protocols and parameter-free aspects of the formulation (where applicable) strengthen its potential impact.

major comments (2)

[§4] §4: The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.
[Eq. (3)] Eq. (3): The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.

minor comments (2)

[Abstract] The abstract and introduction could more explicitly qualify the scope of the theoretical results to linear cases to avoid overstatement of the causal connection.
Experimental sections would benefit from additional details on environment construction and sensitivity to the penalty hyperparameter to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications to better delineate the scope of our theoretical and practical results.

read point-by-point responses

Referee: [§4] The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.

Authors: We agree that the equivalence result in Section 4 is derived under the specific assumptions of linear structural causal models with additive noise and a fixed number of environments, relying on the linearity of the representation and the identifiability of the shared optimal classifier weights. The manuscript does not provide a uniqueness result for non-linear feature maps or general non-linear SCMs. The broader statements linking invariances to causal structures are presented as holding under these assumptions, with supporting experimental evidence in more general settings. We will revise Section 4, the abstract, and related discussion to explicitly state the assumptions and note that extensions to non-linear cases remain an open direction. revision: partial
Referee: [Eq. (3)] The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.

Authors: The practical objective in Equation (3) uses a gradient penalty (evaluated at w=1) to enforce the invariance condition, which is exact under the linear classifier assumption but reduces to a first-order stationarity condition more generally. We do not provide a proof that this approximation identifies causal features or guarantees OOD generalization for non-linear representations or SCMs. The formulation is motivated by the linear theory, and our experiments demonstrate improved OOD performance in non-linear regimes. We will add a clarifying discussion of the approximation's nature and limitations in the revised manuscript. revision: partial

Circularity Check

0 steps flagged

No significant circularity in IRM derivation chain

full rationale

The core IRM definition (a representation Φ such that argmin_w R^e(w ∘ Φ) is identical across environments e) is stated directly from the multi-environment setup and does not reduce to any fitted target quantity or self-referential loop. Section 4 derives the link to causal parents only under explicit linear SCM + additive noise assumptions; this is a one-directional implication proved from the SCM, not a tautology or renaming of the input risks. The practical objective (Eq. 3 with gradient penalty) is an explicit relaxation of the definition, not a statistical fit called a prediction. No load-bearing self-citation or ansatz smuggling is present; the derivation remains self-contained against the stated assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that multiple training distributions share invariant causal structures while varying in spurious correlations.

axioms (1)

domain assumption Multiple training distributions share the same causal mechanisms but differ in non-causal aspects.
This premise is required for the shared optimal classifier to isolate causal invariants.

pith-pipeline@v0.9.0 · 5361 in / 1150 out tokens · 43767 ms · 2026-05-12T06:25:21.245199+00:00 · methodology

discussion (0)

Forward citations

Cited by 47 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning
math.ST 2026-05 unverdicted novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
TILT: Target-induced loss tilting under covariate shift
cs.LG 2026-05 conditional novelty 7.0

TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
cs.LG 2026-05 conditional novelty 7.0

A minimal model analytically separates shortcut attraction during training from the switch to a shortcut rule and from cross-family out-of-distribution failure.
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
cs.CV 2026-05 unverdicted novelty 7.0

A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
cs.LG 2026-05 unverdicted novelty 7.0

Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
cs.CV 2026-05 unverdicted novelty 7.0

A large-scale benchmark finds that recent multimodal domain generalization methods give only marginal gains over a plain ERM baseline, with no method winning consistently and all degrading sharply under corruption or ...
eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
cs.CV 2026-05 unverdicted novelty 7.0

eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.
Domain Generalization through Spatial Relation Induction over Visual Primitives
cs.CV 2026-05 unverdicted novelty 7.0

PARSE improves domain generalization accuracy by factoring recognition into visual primitives and their spatial relational compositions learned end-to-end with differentiable predicates.
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 7.0

ScriptHOI decomposes HOI phrases into state slots and uses script coverage, conflict, interval partial-label learning, and counterfactual contrast to improve rare and unseen interaction detection while cutting afforda...
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction
cs.LG 2026-05 unverdicted novelty 7.0

ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection
cs.LG 2026-04 conditional novelty 7.0

A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...
Synthetic Designed Experiments for Diagnosing Vision Model Failure
cs.CV 2026-03 unverdicted novelty 7.0

SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
Rethinking Molecular OOD Generalization via Target-Aware Source Selection
cs.LG 2026-05 unverdicted novelty 6.0

SCOPE-BENCH shows state-of-the-art molecular models suffer up to 8x higher errors under extreme OOD, while POMA reduces mean absolute error by up to 11.2% via target-aware source selection and dual-scale adaptation.
Understanding Generalization through Decision Pattern Shift
cs.LG 2026-05 unverdicted novelty 6.0

DPS quantifies deviation of per-sample decision patterns from class averages and shows linear correlation with generalization gaps while unifying degradation scenarios into a continuous trajectory.
DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift
cs.LG 2026-05 unverdicted novelty 6.0

DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
cs.LG 2026-05 unverdicted novelty 6.0

Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions
cs.LG 2026-05 unverdicted novelty 6.0

SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator ...
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
cs.LG 2026-05 unverdicted novelty 6.0

Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
cs.AI 2026-05 unverdicted novelty 6.0

CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
cs.LG 2026-05 unverdicted novelty 6.0

TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...
Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability
cs.LG 2026-05 unverdicted novelty 6.0

EEG model predictions on the same brain signals flip for up to 42% of trials under different preprocessing choices, with new tools introduced to measure and mitigate the resulting instability.
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
cs.CV 2026-05 unverdicted novelty 6.0

ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain trai...
Anatomy of a failure: When, how, and why deep vision fails in scientific domains
cs.CV 2026-05 unverdicted novelty 6.0

Deep learning on information-rich scientific images collapses to one-dimensional predictions due to a mismatch between data priors and the model's simplicity bias, even after robustification techniques.
Learning to Theorize the World from Observation
cs.LG 2026-05 unverdicted novelty 6.0

NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
Attribution-Guided Masking for Robust Cross-Domain Sentiment Classification
cs.LG 2026-05 unverdicted novelty 6.0

AGM adds a gradient-based masking loss during fine-tuning to suppress reliance on spurious tokens, achieving competitive zero-shot transfer on sentiment tasks while providing token-level interpretability.
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
cs.AI 2026-05 unverdicted novelty 6.0

Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut ...
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
cs.LG 2026-04 unverdicted novelty 6.0

CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.
Robust Representation Learning through Explicit Environment Modeling
stat.ML 2026-04 unverdicted novelty 6.0

Explicitly modeling and marginalizing environment variation via generalized random-intercept models produces representations that support robust average prediction across unseen environments and outperform invariant-l...
Bayesian Environment Invariant Regression
stat.ME 2026-04 unverdicted novelty 6.0

A Bayesian spike-and-slab model separates invariant regression mechanisms from environment-specific associations, with proven selection consistency and posterior contraction under a working model.
Deep sprite-based image models: An analysis
cs.CV 2026-04 unverdicted novelty 6.0

A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
cs.LG 2026-04 unverdicted novelty 6.0

RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.
Learning Stable Predictors from Weak Supervision under Distribution Shift
cs.LG 2026-04 unverdicted novelty 6.0

Weak supervision supports in-domain learning for CRISPR transcriptomic perturbations but temporal shifts cause negative R-squared and near-zero correlation across linear and tree models, unlike partial cell-line transfer.
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
cs.LG 2023-10 accept novelty 6.0

SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
On the Opportunities and Risks of Foundation Models
cs.LG 2021-08 accept novelty 6.0

Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
cs.LG 2019-11 conditional novelty 6.0

Increased regularization is required for group DRO to achieve good worst-group generalization in overparameterized neural networks.
Beyond Instance-Level Self-Supervision in 3D Multi-Modal Medical Imaging
cs.CV 2026-05 unverdicted novelty 5.0

A self-supervised approach uses consistent spatial relationships of anatomical structures across patients to improve 3D multi-modal medical image representations, yielding modest gains on segmentation and classificati...
Causal Parametric Drift Simulation: A Digital Twin Framework for Classifier Robustness Evaluation
cs.LG 2026-05 unverdicted novelty 5.0

A framework using structural causal models simulates parametric drifts to evaluate classifier robustness more realistically than static tests or noise perturbations.
Agentic AIs Are the Missing Paradigm for Out-of-Distribution Generalization in Foundation Models
cs.LG 2026-05 unverdicted novelty 5.0

Agentic AI systems are required to overcome the parameter coverage ceiling that prevents foundation models from handling certain out-of-distribution cases.
When Brain Networks Travel: Learning Beyond Site
cs.LG 2026-05 unverdicted novelty 5.0

CORE decouples site confounders in fMRI networks, profiles transient dynamics on a population scaffold using line graphs, and applies subject-adaptive gating to achieve up to 6.7% better cross-site generalization on A...
MER-DG: Modality-Entropy Regularization for Multimodal Domain Generalization
cs.LG 2026-05 unverdicted novelty 5.0

MER-DG applies modality-entropy regularization to reduce fusion overfitting in multimodal domain generalization, reporting average gains of 5% over standard fusion and 2% over prior methods on EPIC-Kitchens and HAC be...
Dreaming Across Towns: Semantic Rollout and Town-Adversarial Regularization for Zero-Shot Held-Out-Town Fixed-Route Driving in CARLA
cs.RO 2026-04 unverdicted novelty 5.0

Semantic rollout prediction plus town-adversarial regularization on a Dreamer agent raises mean zero-shot success rate for fixed-route driving across held-out CARLA towns under fixed weather and no traffic.
Asynchronous Federated Unlearning with Invariance Calibration for Medical Imaging
cs.LG 2026-04 unverdicted novelty 5.0

AFU-IC decouples client unlearning from global federated training in medical imaging and adds server-side invariance calibration to prevent relearning of erased data.
Sensitivity Uncertainty Alignment in Large Language Models
cs.CR 2026-04 unverdicted novelty 5.0

SUA measures the gap between how much an LLM's output changes under perturbations and how uncertain the model claims to be, with a training procedure to reduce that gap.
Beyond Surface Artifacts: Capturing Shared Latent Forgery Knowledge Across Modalities
cs.CV 2026-04 unverdicted novelty 5.0

Introduces MAF framework and DeepModal-Bench to capture universal cross-modal forgery traces for better generalization in multimodal deepfake detection.
Why Invariance is Not Enough for Biomedical Domain Generalization and How to Fix It
eess.IV 2026-04 unverdicted novelty 5.0

MaskGen improves domain generalization for biomedical image segmentation by using source intensities plus domain-stable foundation model representations with minimal added complexity.
Investigating Data Interventions for Subgroup Fairness: An ICU Case Study
cs.LG 2026-04 unverdicted novelty 4.0

Data addition from different sources does not reliably boost subgroup fairness in ICU models and often requires post-hoc calibration to work.
Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems
cs.LG 2020-05 unverdicted novelty 2.0

Offline RL promises to extract high-utility policies from static datasets but faces fundamental challenges that current methods only partially address.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · cited by 46 Pith papers

[1]

Autonomy

John Aldrich. Autonomy. Oxford Economic Papers, 1989

work page 1989
[2]

Robust supervised learning

James Andrew Bagnell. Robust supervised learning. In AAAI, 2005

work page 2005
[3]

Bartlett, Philip M

Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign Overﬁtting in Linear Regression. arXiv, 2019

work page 2019
[4]

Recognition in terra incognita

Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In ECCV, 2018

work page 2018
[5]

Analysis of representations for domain adaptation

Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS. 2007

work page 2007
[6]

Robust optimiza- tion

Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimiza- tion. Princeton University Press, 2009

work page 2009
[7]

A meta- transfer objective for learning to disentangle causal mechanisms

Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S´ ebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta- transfer objective for learning to disentangle causal mechanisms. arXiv, 2019

work page 2019
[8]

Denker, Harris Drucker, Isabelle Guyon, Lawrence D

L´ eon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard S¨ ackinger, Patrice Simard, and Vladimir Vapnik. Comparison of classiﬁer methods: a case study in handwritten digit recognition. In ICPR, 1994

work page 1994
[9]

Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet

Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet. In ICLR, 2019

work page 2019
[10]

Invariant scattering convolution networks

Joan Bruna and Stephane Mallat. Invariant scattering convolution networks. TPAMI, 2013

work page 2013
[11]

In- termittent process analysis with scattering moments

Joan Bruna, Stephane Mallat, Emmanuel Bacry, and Jean-Franois Muzy. In- termittent process analysis with scattering moments. The Annals of Statistics , 2015

work page 2015
[12]

Two theorems on invariance and causality

Nancy Cartwright. Two theorems on invariance and causality. Philosophy of Science, 2003. 23

work page 2003
[13]

Cheng and Hongjing Lu

Patricia W. Cheng and Hongjing Lu. Causal invariance as an essential constraint for creating a causal representation of the world. The Oxford handbook of causal reasoning, 2017

work page 2017
[14]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019

work page 2019
[15]

Statistics of robust optimization: A generalized empirical likelihood approach

John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv, 2016

work page 2016
[16]

Domain- adversarial training of neural networks

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸ cois Laviolette, Mario March, and Victor Lempitsky. Domain- adversarial training of neural networks. JMLR, 2016

work page 2016
[17]

Wichmann, and Wieland Brendel

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2019

work page 2019
[18]

Learning causal structures using regression invariance

AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In NIPS, 2017

work page 2017
[19]

Patrick J. Grother. NIST Special Database 19: Handprinted forms and char- acters database. https://www.nist.gov/srd/nist-special-database-19 ,

work page
[20]

File doc/doc.ps in the 1995 NIST CD ROM NIST Special Database 19

work page 1995
[21]

The probability approach in econometrics

Trygve Haavelmo. The probability approach in econometrics. Econometrica: Journal of the Econometric Society , 1944

work page 1944
[22]

Conditional variance penalties and domain shift robustness

Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv, 2017

work page 2017
[23]

Invariant causal prediction for nonlinear models

Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference , 2018

work page 2018
[24]

Revisiting visual question answering baselines

Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Revisiting visual question answering baselines. In ECCV, 2016

work page 2016
[25]

Johansson, David A

Fredrik D. Johansson, David A. Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. AISTATS, 2019

work page 2019
[26]

General- ization in anti-causal learning

Niki Kilbertus, Giambattista Parascandolo, and Bernhard Sch¨ olkopf. General- ization in anti-causal learning. arXiv, 2018

work page 2018
[27]

Stable prediction across unknown environments

Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. Stable prediction across unknown environments. In SIGKDD, 2018

work page 2018
[28]

Lake, Tomer D

Brenden M. Lake, Tomer D. Ullman, Joshua B Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 2017. 24

work page 2017
[29]

James M. Lee. Introduction to Smooth Manifolds . Springer, 2003

work page 2003
[30]

Counterfactuals

David Lewis. Counterfactuals. John Wiley & Sons, 2013

work page 2013
[31]

Deep domain generalization via conditional invariant adver- sarial networks

Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adver- sarial networks. In ECCV, 2018

work page 2018
[32]

From dependence to causation

David Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016

work page 2016
[33]

Discovering causal signals in images

David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and L´ eon Bottou. Discovering causal signals in images. In CVPR, 2017

work page 2017
[34]

Learning to pivot with adversarial networks

Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks. In Advances in neural information processing systems , pages 981–990, 2017

work page 2017
[35]

Domain adaptation by using causal inference to predict invariant conditional distributions

Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In NIPS, 2018

work page 2018
[36]

Deep learning: A critical appraisal

Gary Marcus. Deep learning: A critical appraisal. arXiv, 2018

work page 2018
[37]

Causality from a distributional robustness point of view

Nicolai Meinshausen. Causality from a distributional robustness point of view. In Data Science Workshop (DSW) , 2018

work page 2018
[38]

Maximin eﬀects in inhomogeneous large-scale data

Nicolai Meinshausen and Peter B¨ uhlmann. Maximin eﬀects in inhomogeneous large-scale data. The Annals of Statistics , 2015

work page 2015
[39]

Mitchell

Sandra D. Mitchell. Dimensions of scientiﬁc law. Philosophy of Science , 2000

work page 2000
[40]

Causality: Models, Reasoning, and Inference

Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009

work page 2009
[41]

Causal inference using invariant prediction: identiﬁcation and conﬁdence intervals

Jonas Peters, Peter B¨ uhlmann, and Nicolai Meinshausen. Causal inference using invariant prediction: identiﬁcation and conﬁdence intervals. JRSS B , 2016

work page 2016
[42]

MIT press, 2017

Jonas Peters, Dominik Janzing, and Bernhard Sch¨ olkopf.Elements of causal inference: foundations and learning algorithms . MIT press, 2017

work page 2017
[43]

Incompleteness, non locality and realism

Michael Redhead. Incompleteness, non locality and realism. a prolegomenon to the philosophy of quantum mechanics. 1987

work page 1987
[44]

Invariant models for causal transfer learning

Mateo Rojas-Carulla, Bernhard Sch¨ olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. JMLR, 2018

work page 2018
[45]

Donald B. Rubin. Estimating causal eﬀects of treatments in randomized and nonrandomized studies. Journal of educational Psychology , 1974. 25

work page 1974
[46]

On causal and anticausal learning

Bernhard Sch¨ olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In ICML, 2012

work page 2012
[47]

Certifying some distribu- tional robustness with principled adversarial training

Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distribu- tional robustness with principled adversarial training. ICLR, 2018

work page 2018
[48]

Causal necessity: a pragmatic investigation of the necessity of laws

Brian Skyrms. Causal necessity: a pragmatic investigation of the necessity of laws. Yale University Press, 1980

work page 1980
[49]

Bob L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia , 2014

work page 2014
[50]

Unbiased look at dataset bias

Antonio Torralba and Alexei Efros. Unbiased look at dataset bias. In CVPR, 2011

work page 2011
[51]

Principles of risk minimization for learning theory

Vladimir Vapnik. Principles of risk minimization for learning theory. In NIPS. 1992

work page 1992
[52]

Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998

work page 1998
[53]

Do we still need models or just more data and compute?, 2019

Max Welling. Do we still need models or just more data and compute?, 2019

work page 2019
[54]

Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht

Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS. 2017

work page 2017
[55]

Making things happen: A theory of causal explanation

James Woodward. Making things happen: A theory of causal explanation . Oxford university press, 2005

work page 2005
[56]

Correlation and causation

Sewall Wright. Correlation and causation. Journal of agricultural research , 1921

work page 1921
[57]

Understanding deep learning requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2016. 26 A Additional theorems Theorem 10. Let Σe X,X := EX e[XeXe⊤] ∈ Sd×d + , with Sd×d + the space of symmetric positive semi-deﬁnite matrices, and Σe X,ϵ := EX e[Xeϵe] ∈ Rd. Then, for any arbitrary tuple ( ...

work page 2016
[58]

Since domain adaptation enforcesP (Φ(Xes)) =P (Φ(Xet)), it consequently enforces P ( ˆYes) =P ( ˆYet), where ˆYe = w(Φ(Xe)), for all e ∈ {es,et}

Using these data and the domain adaptation recipe outlined above, we build a classiﬁerw◦Φ. Since domain adaptation enforcesP (Φ(Xes)) =P (Φ(Xet)), it consequently enforces P ( ˆYes) =P ( ˆYet), where ˆYe = w(Φ(Xe)), for all e ∈ {es,et}. Then, the classiﬁcation accuracy will be at most 20%. This is worse than random guessing, in a problem where simply trai...

work page