Invariant Risk Minimization
Pith reviewed 2026-05-12 06:25 UTC · model grok-4.3
The pith
Invariant Risk Minimization finds a data representation where the same classifier is optimal for every training distribution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Invariant Risk Minimization (IRM) learns a representation such that the optimal linear classifier on top of that representation is identical across all training environments. This is achieved by jointly minimizing the average risk while adding a penalty that forces the gradient of each environment's risk with respect to the classifier parameters to vanish at the shared optimum. The resulting invariant features correspond to the causal factors that govern the label in the underlying data-generating process, enabling generalization to environments not seen during training.
What carries the argument
The IRM penalty term that requires the gradient of the risk with respect to a fixed classifier to be zero in every environment, thereby enforcing that the same predictor is optimal everywhere.
Load-bearing premise
The observed environments must share the same causal mechanisms that determine the label while differing only in the distributions of non-causal variables.
What would settle it
A controlled experiment on synthetic data with known causal graph where IRM is shown to recover exactly the causal features (or fails to do so) when the environments are generated by intervening only on non-causal variables.
read the original abstract
We introduce Invariant Risk Minimization (IRM), a learning paradigm to estimate invariant correlations across multiple training distributions. To achieve this goal, IRM learns a data representation such that the optimal classifier, on top of that data representation, matches for all training distributions. Through theory and experiments, we show how the invariances learned by IRM relate to the causal structures governing the data and enable out-of-distribution generalization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Invariant Risk Minimization (IRM), a learning paradigm that estimates a data representation such that the optimal classifier on top of this representation is the same across multiple training distributions. It claims through theory and experiments that the learned invariances correspond to causal structures governing the data and enable out-of-distribution generalization.
Significance. If the central claims hold, this work offers a principled objective for learning predictors that exploit invariance across environments to achieve robust OOD performance, with a direct link to identifying causal features. This is significant for bridging empirical risk minimization with causal inference in non-i.i.d. settings, and the reproducible experimental protocols and parameter-free aspects of the formulation (where applicable) strengthen its potential impact.
major comments (2)
- [§4] §4: The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.
- [Eq. (3)] Eq. (3): The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.
minor comments (2)
- [Abstract] The abstract and introduction could more explicitly qualify the scope of the theoretical results to linear cases to avoid overstatement of the causal connection.
- Experimental sections would benefit from additional details on environment construction and sensitivity to the penalty hyperparameter to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the careful reading and insightful comments on our manuscript. We address each major comment below and will incorporate clarifications to better delineate the scope of our theoretical and practical results.
read point-by-point responses
-
Referee: [§4] The theoretical equivalence showing that IRM recovers causal parents is derived only for linear structural causal models with additive noise and a fixed number of environments; the proof relies on linearity of the representation and identifiability of the shared optimal w. No uniqueness result is given for non-linear feature maps or general non-linear SCMs, so the broader claim that invariances learned by IRM relate to causal structures does not follow in full generality.
Authors: We agree that the equivalence result in Section 4 is derived under the specific assumptions of linear structural causal models with additive noise and a fixed number of environments, relying on the linearity of the representation and the identifiability of the shared optimal classifier weights. The manuscript does not provide a uniqueness result for non-linear feature maps or general non-linear SCMs. The broader statements linking invariances to causal structures are presented as holding under these assumptions, with supporting experimental evidence in more general settings. We will revise Section 4, the abstract, and related discussion to explicitly state the assumptions and note that extensions to non-linear cases remain an open direction. revision: partial
-
Referee: [Eq. (3)] The practical IRM objective (with the gradient penalty at w=1) enforces only a first-order stationarity condition under the linear classifier assumption. The manuscript does not show that this approximation identifies causal features or guarantees OOD generalization when the representation or SCM is non-linear, which is load-bearing for the central claim.
Authors: The practical objective in Equation (3) uses a gradient penalty (evaluated at w=1) to enforce the invariance condition, which is exact under the linear classifier assumption but reduces to a first-order stationarity condition more generally. We do not provide a proof that this approximation identifies causal features or guarantees OOD generalization for non-linear representations or SCMs. The formulation is motivated by the linear theory, and our experiments demonstrate improved OOD performance in non-linear regimes. We will add a clarifying discussion of the approximation's nature and limitations in the revised manuscript. revision: partial
Circularity Check
No significant circularity in IRM derivation chain
full rationale
The core IRM definition (a representation Φ such that argmin_w R^e(w ∘ Φ) is identical across environments e) is stated directly from the multi-environment setup and does not reduce to any fitted target quantity or self-referential loop. Section 4 derives the link to causal parents only under explicit linear SCM + additive noise assumptions; this is a one-directional implication proved from the SCM, not a tautology or renaming of the input risks. The practical objective (Eq. 3 with gradient penalty) is an explicit relaxation of the definition, not a statistical fit called a prediction. No load-bearing self-citation or ansatz smuggling is present; the derivation remains self-contained against the stated assumptions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multiple training distributions share the same causal mechanisms but differ in non-causal aspects.
Forward citations
Cited by 60 Pith papers
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
I-SAFE: Wasserstein Coherence Metrics for Structural Auditing of Scientific AI Models
I-SAFE uses Wasserstein Coherence Metrics to audit distributional coherence of scientific AI models under structurally guided perturbations, revealing differences among DTI predictors that accuracy metrics miss.
-
Cumulative Meta-Learning from Active Learning Queries for Robustness to Spurious Correlations
CAML meta-learns a progressively refined inductive bias from active-learning queries to improve robustness to spurious correlations, reporting accuracy gains on minority groups across several benchmarks.
-
Identifiable Multimodal Causal Representation Learning under Partial Latent Sharing
Establishes component-wise identifiability guarantees for partially shared causal latents in multimodal nonlinear mixing and introduces a differentiable Wasserstein-based module for recovery.
-
Prediction-Intervention Games and Invariant Sets
In prediction-intervention games, stable-blanket predictors are at least as good as causal-parent predictors for two classes of follower objectives and can be worst-case optimal under additional conditions.
-
Continual Learning of Domain-Invariant Representations
Introduces replay-based continual learning with sequential invariance alignment to learn domain-invariant representations, outperforming baselines on generalization to unseen domains across six datasets in vision, med...
-
TILT: Target-induced loss tilting under covariate shift
TILT adds a target-data penalty on an auxiliary predictor component to induce effective importance weighting for unsupervised domain adaptation under covariate shift.
-
Separating Shortcut Transition from Cross-Family OOD Failure in a Minimal Model
A minimal model analytically separates shortcut attraction during training from the switch to a shortcut rule and from cross-family out-of-distribution failure.
-
Spectral Gradient Surgery for Domain-Generalizable Dataset Distillation
Spectral Gradient Surgery disentangles class-discriminative and domain-specific signals in distribution-matching distilled datasets by analyzing gradient agreement in the spectral domain, yielding better out-of-distri...
-
Privacy-Aware Video Anomaly Detection through Orthogonal Subspace Projection
A new orthogonal projection module for video anomaly detection suppresses facial attributes via weak face-presence signals and cosine alignment while preserving anomaly-relevant features like pose and motion.
-
Flatness and Gradient Alignment Are Both Necessary: Spectral-Aware Gradient-Aligned Exploration for Multi-Distribution Learning
Excess risk decomposes into independent alignment (trace of inverse average Hessian times gradient covariance) and curvature terms, so both flatness and gradient alignment are required; SAGE achieves this and sets new...
-
Are We Making Progress in Multimodal Domain Generalization? A Comprehensive Benchmark Study
A large-scale benchmark finds that recent multimodal domain generalization methods give only marginal gains over a plain ERM baseline, with no method winning consistently and all degrading sharply under corruption or ...
-
eXplaining to Learn (eX2L): Regularization Using Contrastive Visual Explanation Pairs for Distribution Shifts
eX2L improves robustness to distribution shifts by penalizing similarity between Grad-CAM maps of a label classifier and a confounder classifier, reaching new SOTA average and worst-group accuracy on the Spawrious benchmark.
-
Domain Generalization through Spatial Relation Induction over Visual Primitives
PARSE improves domain generalization accuracy by factoring recognition into visual primitives and their spatial relational compositions learned end-to-end with differentiable predicates.
-
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
ScriptHOI decomposes HOI phrases into state slots and uses script coverage, conflict, interval partial-label learning, and counterfactual contrast to improve rare and unseen interaction detection while cutting afforda...
-
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
-
Robust and Clinically Reliable EEG Biomarkers: A Cross Population Framework for Generalizable Parkinson's Disease Detection
A cross-population framework for EEG Parkinson's detection using exhaustive 75 directional evaluations and nested validation shows asymmetric transfer and accuracy up to 94.1% when training diversity increases, suppor...
-
Synthetic Designed Experiments for Diagnosing Vision Model Failure
SDRS uses designed experiments and ANOVA decomposition on synthetic data to identify Type I coverage gaps and Type II spurious dependencies in vision models, then generates targeted data to improve performance.
-
The Pragmatic Frames of Spurious Correlations in Machine Learning: Interpreting How and Why They Matter
ML researchers assess spurious correlations via four pragmatic frames (relevance, generalizability, human-likeness, harmfulness) rather than a fixed statistical definition.
-
The Matching Principle: A Geometric Theory of Loss Functions for Nuisance-Robust Representation Learning
The matching principle unifies nuisance-robust representation learning by requiring Jacobian regularization whose range covers the covariance of label-preserving deployment nuisances, with closed-form optimality proof...
-
Towards Context-Invariant Safety Alignment for Large Language Models
Introduces AIR, an asymmetric regularization that anchors open-ended safety prompts to verifiable ones via stop-gradient, improving invariance and accuracy when combined with group preference optimization.
-
S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs
S2Aligner decouples semantic and structural components in LLM-as-Aligner pre-training for sparse TAGs and uses structure-oriented reconstruction plus domain risk balancing to improve transferability and reduce general...
-
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
-
When Molecular Similarity Works: Property Cliffs Reveal Hidden Errors
CliffSplit exposes at least 15% higher errors in cliff-heavy regions of QM9 while CliffLoss narrows the cliff-to-smooth error gap by up to 30% and improves overall MAE by 9.7% across several molecular tasks and backbones.
-
Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating
OPDA is an online controller that uses violation, risk, and pseudo-label health signals to avoid Masking Collapse and Trivial Saturation in tabular fair SSL under confidence gating.
-
Rethinking Molecular OOD Generalization via Target-Aware Source Selection
SCOPE-BENCH shows state-of-the-art molecular models suffer up to 8x higher errors under extreme OOD, while POMA reduces mean absolute error by up to 11.2% via target-aware source selection and dual-scale adaptation.
-
Understanding Generalization through Decision Pattern Shift
DPS quantifies deviation of per-sample decision patterns from class averages and shows linear correlation with generalization gaps while unifying degradation scenarios into a continuous trajectory.
-
DeconDTN-Toolkit: A Library for Evaluation and Enhancement of Robustness to Provenance Shift
DeconDTN-Toolkit simulates provenance shifts to expose ERM vulnerabilities and provides tools plus a robust OOD indicator for mitigating confounding by data provenance.
-
Spurious Correlation Learning in Preference Optimization: Mechanisms, Consequences, and Mitigation via Tie Training
Standard preference learning induces spurious feature reliance via mean bias and correlation leakage, creating irreducible distribution shift vulnerabilities that tie training mitigates without degrading causal learning.
-
Intervention-Based Time Series Causal Discovery via Simulator-Generated Interventional Distributions
SVAR-FM uses simulator clamping to produce interventional distributions and flow matching to identify time series causal structures, with an error bound that predicts sign reversal of causal effects below a simulator ...
-
The Trap of Trajectory: Towards Understanding and Mitigating Spurious Correlations in Agentic Memory
Agentic memory improves clean reasoning but worsens performance when spurious patterns are present in stored trajectories; CAMEL calibration reduces this reliance while preserving clean performance.
-
CauSim: Scaling Causal Reasoning with Increasingly Complex Causal Simulators
CauSim turns scarce causal reasoning labels into scalable supervised data by having LLMs incrementally construct complex executable structural causal models.
-
TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection
TopoGeoScore combines a torsion-inspired Laplacian log-determinant, Ollivier-Ricci curvature, and higher-order topological summaries from source embeddings, with weights learned via self-supervised invariance to geome...
-
Same Brain, Different Prediction: How Preprocessing Choices Undermine EEG Decoding Reliability
EEG model predictions on the same brain signals flip for up to 42% of trials under different preprocessing choices, with new tools introduced to measure and mitigate the resulting instability.
-
ScriptHOI: Learning Scripted State Transitions for Open-Vocabulary Human-Object Interaction Detection
ScriptHOI improves rare and unseen HOI recognition by decomposing phrases into state slots, using visual tokenization and slot-wise matching for script coverage and conflict to calibrate predictions and constrain trai...
-
Anatomy of a failure: When, how, and why deep vision fails in scientific domains
Deep learning on information-rich scientific images collapses to one-dimensional predictions due to a mismatch between data priors and the model's simplicity bias, even after robustification techniques.
-
Learning to Theorize the World from Observation
NEO induces compositional latent programs as world theories from observations and executes them to enable explanation-driven generalization.
-
Attribution-Guided Masking for Robust Cross-Domain Sentiment Classification
AGM adds a gradient-based masking loss during fine-tuning to suppress reliance on spurious tokens, achieving competitive zero-shot transfer on sentiment tasks while providing token-level interpretability.
-
Deciphering Shortcut Learning from an Evolutionary Game Theory Perspective
Evolutionary game theory shows gradient descent and stochastic gradient descent drive neural networks to distinct stable states favoring shortcut or core subnetworks, with data and optimization noise shaping shortcut ...
-
Cheeger--Hodge Contrastive Learning for Structurally Robust Graph Representation Learning
CHCL aligns a Cheeger-Hodge joint signature across graph augmentations to produce embeddings that remain stable under local structural changes.
-
Robust Representation Learning through Explicit Environment Modeling
Explicitly modeling and marginalizing environment variation via generalized random-intercept models produces representations that support robust average prediction across unseen environments and outperform invariant-l...
-
Bayesian Environment Invariant Regression
A Bayesian spike-and-slab model separates invariant regression mechanisms from environment-specific associations, with proven selection consistency and posterior contraction under a working model.
-
Deep sprite-based image models: An analysis
A deep sprite-based image decomposition method matches SOTA unsupervised class-aware segmentation on CLEVR, scales linearly with objects, explicitly identifies categories, and fully models images interpretably.
-
Adversarial Label Invariant Graph Data Augmentations for Out-of-Distribution Generalization
RIA uses adversarial exploration of counterfactual graph environments via label-invariant augmentations to improve OoD generalization in graph classification tasks.
-
Learning Stable Predictors from Weak Supervision under Distribution Shift
Weak supervision supports in-domain prediction of guide efficacy in CRISPR-Cas13d data but collapses under temporal shifts due to changing feature-label associations, while cross-cell-line transfer remains partial.
-
Learning Stable Predictors from Weak Supervision under Distribution Shift
Weak supervision supports in-domain learning for CRISPR transcriptomic perturbations but temporal shifts cause negative R-squared and near-zero correlation across linear and tree models, unlike partial cell-line transfer.
-
Mitigating Shortcut Learning via Feature Disentanglement in Medical Imaging: A Benchmark Study
Benchmark shows that combining data rebalancing with feature disentanglement mitigates shortcut learning more effectively than rebalancing alone in medical imaging models.
-
Tracing Moral Foundations in Large Language Models
LLMs encode moral foundations in human-aligned, layered representations that arise from pretraining and can be steered via dense vectors or sparse SAE features.
-
Tracing Moral Foundations in Large Language Models
Moral foundations in LLMs form distributed, layered representations that align with human perceptions, emerge from pretraining, and causally influence outputs when steered via dense vectors or sparse features.
-
Invariant Feature Extraction Through Conditional Independence and the Optimal Transport Barycenter Problem: the Gaussian case
In the Gaussian case, invariant features predicting Y independent of confounders Z are given by the top d eigenvectors of a matrix derived from the optimal transport barycenter of Z given Y.
-
The Impact of Off-Policy Training Data on Probe Generalisation
Off-policy training data for LLM behavior probes causes significant generalization failures especially for intent-based behaviors like deception, and performance on coerced incentivised data correlates with real on-po...
-
Downgrade to Upgrade: Optimizer Simplification Enhances Robustness in LLM Unlearning
Downgrading optimizers to lower-information variants during LLM unlearning yields more robust forgetting on MUSE and WMDP benchmarks by converging to harder-to-perturb loss basins.
-
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
-
Doubly robust identification of treatment effects from multiple environments
RAMEN identifies treatment effects from multiple environments in a doubly robust manner by leveraging data heterogeneity without requiring the causal graph.
-
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
TabICL scales in-context learning to large tabular data via column-then-row attention for row embeddings followed by a transformer, matching TabPFNv2 speed and performance while outperforming it and CatBoost on datase...
-
Discrepancies are Virtue: Weak-to-Strong Generalization through Lens of Intrinsic Dimension
In ridgeless regression with low intrinsic dimension, discrepancy between weak and strong models reduces W2S generalization variance by dim(V_s)/N in the discrepant subspace while inheriting it in the overlap.
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks
SmoothLLM mitigates jailbreaking attacks on LLMs by randomly perturbing multiple copies of a prompt at the character level and aggregating the outputs to detect adversarial inputs.
-
On the Opportunities and Risks of Foundation Models
Foundation models are large adaptable AI systems with emergent capabilities that offer broad opportunities but carry risks from homogenization, opacity, and inherited defects across downstream applications.
-
Distributionally Robust Neural Networks for Group Shifts: On the Importance of Regularization for Worst-Case Generalization
Increased regularization is required for group DRO to achieve good worst-group generalization in overparameterized neural networks.
-
Understanding Model Behavior in Monocular Polyp Sizing
Monocular polyp sizing models achieve moderate performance by exploiting examination behavior cues rather than true metric scales, with scale information and segmentation robustness acting as independent bottlenecks.
Reference graph
Works this paper leans on
- [1]
-
[2]
James Andrew Bagnell. Robust supervised learning. In AAAI, 2005
work page 2005
-
[3]
Peter L. Bartlett, Philip M. Long, G´ abor Lugosi, and Alexander Tsigler. Benign Overfitting in Linear Regression. arXiv, 2019
work page 2019
-
[4]
Recognition in terra incognita
Sara Beery, Grant Van Horn, and Pietro Perona. Recognition in terra incognita. In ECCV, 2018
work page 2018
-
[5]
Analysis of representations for domain adaptation
Shai Ben-David, John Blitzer, Koby Crammer, and Fernando Pereira. Analysis of representations for domain adaptation. In NIPS. 2007
work page 2007
-
[6]
Aharon Ben-Tal, Laurent El Ghaoui, and Arkadi Nemirovski. Robust optimiza- tion. Princeton University Press, 2009
work page 2009
-
[7]
A meta- transfer objective for learning to disentangle causal mechanisms
Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, S´ ebastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta- transfer objective for learning to disentangle causal mechanisms. arXiv, 2019
work page 2019
-
[8]
Denker, Harris Drucker, Isabelle Guyon, Lawrence D
L´ eon Bottou, Corinna Cortes, John S. Denker, Harris Drucker, Isabelle Guyon, Lawrence D. Jackel, Yann Le Cun, Urs A. Muller, Eduard S¨ ackinger, Patrice Simard, and Vladimir Vapnik. Comparison of classifier methods: a case study in handwritten digit recognition. In ICPR, 1994
work page 1994
-
[9]
Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet
Wieland Brendel and Matthias Bethge. Approximating CNNs with bag-of-local- features models works surprisingly well on imagenet. In ICLR, 2019
work page 2019
-
[10]
Invariant scattering convolution networks
Joan Bruna and Stephane Mallat. Invariant scattering convolution networks. TPAMI, 2013
work page 2013
-
[11]
In- termittent process analysis with scattering moments
Joan Bruna, Stephane Mallat, Emmanuel Bacry, and Jean-Franois Muzy. In- termittent process analysis with scattering moments. The Annals of Statistics , 2015
work page 2015
-
[12]
Two theorems on invariance and causality
Nancy Cartwright. Two theorems on invariance and causality. Philosophy of Science, 2003. 23
work page 2003
-
[13]
Patricia W. Cheng and Hongjing Lu. Causal invariance as an essential constraint for creating a causal representation of the world. The Oxford handbook of causal reasoning, 2017
work page 2017
-
[14]
Bert: Pre-training of deep bidirectional transformers for language understanding
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. NAACL, 2019
work page 2019
-
[15]
Statistics of robust optimization: A generalized empirical likelihood approach
John Duchi, Peter Glynn, and Hongseok Namkoong. Statistics of robust optimization: A generalized empirical likelihood approach. arXiv, 2016
work page 2016
-
[16]
Domain- adversarial training of neural networks
Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran¸ cois Laviolette, Mario March, and Victor Lempitsky. Domain- adversarial training of neural networks. JMLR, 2016
work page 2016
-
[17]
Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. ICLR, 2019
work page 2019
-
[18]
Learning causal structures using regression invariance
AmirEmad Ghassami, Saber Salehkaleybar, Negar Kiyavash, and Kun Zhang. Learning causal structures using regression invariance. In NIPS, 2017
work page 2017
-
[19]
Patrick J. Grother. NIST Special Database 19: Handprinted forms and char- acters database. https://www.nist.gov/srd/nist-special-database-19 ,
-
[20]
File doc/doc.ps in the 1995 NIST CD ROM NIST Special Database 19
work page 1995
-
[21]
The probability approach in econometrics
Trygve Haavelmo. The probability approach in econometrics. Econometrica: Journal of the Econometric Society , 1944
work page 1944
-
[22]
Conditional variance penalties and domain shift robustness
Christina Heinze-Deml and Nicolai Meinshausen. Conditional variance penalties and domain shift robustness. arXiv, 2017
work page 2017
-
[23]
Invariant causal prediction for nonlinear models
Christina Heinze-Deml, Jonas Peters, and Nicolai Meinshausen. Invariant causal prediction for nonlinear models. Journal of Causal Inference , 2018
work page 2018
-
[24]
Revisiting visual question answering baselines
Allan Jabri, Armand Joulin, and Laurens Van Der Maaten. Revisiting visual question answering baselines. In ECCV, 2016
work page 2016
-
[25]
Fredrik D. Johansson, David A. Sontag, and Rajesh Ranganath. Support and invertibility in domain-invariant representations. AISTATS, 2019
work page 2019
-
[26]
General- ization in anti-causal learning
Niki Kilbertus, Giambattista Parascandolo, and Bernhard Sch¨ olkopf. General- ization in anti-causal learning. arXiv, 2018
work page 2018
-
[27]
Stable prediction across unknown environments
Kun Kuang, Peng Cui, Susan Athey, Ruoxuan Xiong, and Bo Li. Stable prediction across unknown environments. In SIGKDD, 2018
work page 2018
-
[28]
Brenden M. Lake, Tomer D. Ullman, Joshua B Tenenbaum, and Samuel J. Gershman. Building machines that learn and think like people. Behavioral and brain sciences, 2017. 24
work page 2017
-
[29]
James M. Lee. Introduction to Smooth Manifolds . Springer, 2003
work page 2003
- [30]
-
[31]
Deep domain generalization via conditional invariant adver- sarial networks
Ya Li, Xinmei Tian, Mingming Gong, Yajing Liu, Tongliang Liu, Kun Zhang, and Dacheng Tao. Deep domain generalization via conditional invariant adver- sarial networks. In ECCV, 2018
work page 2018
-
[32]
David Lopez-Paz. From dependence to causation. PhD thesis, University of Cambridge, 2016
work page 2016
-
[33]
Discovering causal signals in images
David Lopez-Paz, Robert Nishihara, Soumith Chintala, Bernhard Scholkopf, and L´ eon Bottou. Discovering causal signals in images. In CVPR, 2017
work page 2017
-
[34]
Learning to pivot with adversarial networks
Gilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks. In Advances in neural information processing systems , pages 981–990, 2017
work page 2017
-
[35]
Domain adaptation by using causal inference to predict invariant conditional distributions
Sara Magliacane, Thijs van Ommen, Tom Claassen, Stephan Bongers, Philip Versteeg, and Joris M Mooij. Domain adaptation by using causal inference to predict invariant conditional distributions. In NIPS, 2018
work page 2018
-
[36]
Deep learning: A critical appraisal
Gary Marcus. Deep learning: A critical appraisal. arXiv, 2018
work page 2018
-
[37]
Causality from a distributional robustness point of view
Nicolai Meinshausen. Causality from a distributional robustness point of view. In Data Science Workshop (DSW) , 2018
work page 2018
-
[38]
Maximin effects in inhomogeneous large-scale data
Nicolai Meinshausen and Peter B¨ uhlmann. Maximin effects in inhomogeneous large-scale data. The Annals of Statistics , 2015
work page 2015
- [39]
-
[40]
Causality: Models, Reasoning, and Inference
Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2nd edition, 2009
work page 2009
-
[41]
Causal inference using invariant prediction: identification and confidence intervals
Jonas Peters, Peter B¨ uhlmann, and Nicolai Meinshausen. Causal inference using invariant prediction: identification and confidence intervals. JRSS B , 2016
work page 2016
-
[42]
Jonas Peters, Dominik Janzing, and Bernhard Sch¨ olkopf.Elements of causal inference: foundations and learning algorithms . MIT press, 2017
work page 2017
-
[43]
Incompleteness, non locality and realism
Michael Redhead. Incompleteness, non locality and realism. a prolegomenon to the philosophy of quantum mechanics. 1987
work page 1987
-
[44]
Invariant models for causal transfer learning
Mateo Rojas-Carulla, Bernhard Sch¨ olkopf, Richard Turner, and Jonas Peters. Invariant models for causal transfer learning. JMLR, 2018
work page 2018
-
[45]
Donald B. Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of educational Psychology , 1974. 25
work page 1974
-
[46]
On causal and anticausal learning
Bernhard Sch¨ olkopf, Dominik Janzing, Jonas Peters, Eleni Sgouritsa, Kun Zhang, and Joris Mooij. On causal and anticausal learning. In ICML, 2012
work page 2012
-
[47]
Certifying some distribu- tional robustness with principled adversarial training
Aman Sinha, Hongseok Namkoong, and John Duchi. Certifying some distribu- tional robustness with principled adversarial training. ICLR, 2018
work page 2018
-
[48]
Causal necessity: a pragmatic investigation of the necessity of laws
Brian Skyrms. Causal necessity: a pragmatic investigation of the necessity of laws. Yale University Press, 1980
work page 1980
-
[49]
Bob L. Sturm. A simple method to determine if a music information retrieval system is a “horse”. IEEE Transactions on Multimedia , 2014
work page 2014
-
[50]
Antonio Torralba and Alexei Efros. Unbiased look at dataset bias. In CVPR, 2011
work page 2011
-
[51]
Principles of risk minimization for learning theory
Vladimir Vapnik. Principles of risk minimization for learning theory. In NIPS. 1992
work page 1992
-
[52]
Vladimir N. Vapnik. Statistical Learning Theory. John Wiley & Sons, 1998
work page 1998
-
[53]
Do we still need models or just more data and compute?, 2019
Max Welling. Do we still need models or just more data and compute?, 2019
work page 2019
-
[54]
Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht
Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nati Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS. 2017
work page 2017
-
[55]
Making things happen: A theory of causal explanation
James Woodward. Making things happen: A theory of causal explanation . Oxford university press, 2005
work page 2005
-
[56]
Sewall Wright. Correlation and causation. Journal of agricultural research , 1921
work page 1921
-
[57]
Understanding deep learning requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. ICLR, 2016. 26 A Additional theorems Theorem 10. Let Σe X,X := EX e[XeXe⊤] ∈ Sd×d + , with Sd×d + the space of symmetric positive semi-definite matrices, and Σe X,ϵ := EX e[Xeϵe] ∈ Rd. Then, for any arbitrary tuple ( ...
work page 2016
-
[58]
Using these data and the domain adaptation recipe outlined above, we build a classifierw◦Φ. Since domain adaptation enforcesP (Φ(Xes)) =P (Φ(Xet)), it consequently enforces P ( ˆYes) =P ( ˆYet), where ˆYe = w(Φ(Xe)), for all e ∈ {es,et}. Then, the classification accuracy will be at most 20%. This is worse than random guessing, in a problem where simply trai...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.