Learning from a Biased Sample

Lihua Lei; Roshni Sahoo; Stefan Wager

arxiv: 2209.01754 · v5 · pith:U7WAR3LYnew · submitted 2022-09-05 · 📊 stat.ME · cs.LG· stat.ML

Learning from a Biased Sample

Roshni Sahoo , Lihua Lei , Stefan Wager This is my paper

classification 📊 stat.ME cs.LGstat.ML

keywords learningbiasedrisksamplesamplingtrainingdecisionmethod

0 comments

read the original abstract

The empirical risk minimization approach to data-driven decision making requires access to training data drawn under the same conditions as those that will be faced when the decision rule is deployed. However, in a number of settings, we may be concerned that our training sample is biased in the sense that some groups (characterized by either observable or unobservable attributes) may be under- or over-represented relative to the general population; and in this setting empirical risk minimization over the training set may fail to yield rules that perform well at deployment. We propose a model of sampling bias called conditional $\Gamma$-biased sampling, where observed covariates can affect the probability of sample selection arbitrarily much but the amount of unexplained variation in the probability of sample selection is bounded by a constant factor. Applying the distributionally robust optimization framework, we propose a method for learning a decision rule that minimizes the worst-case risk incurred under a family of test distributions that can generate the training distribution under $\Gamma$-biased sampling. We apply a result of Rockafellar and Uryasev to show that this problem is equivalent to an augmented convex risk minimization problem. We give statistical guarantees for learning a model that is robust to sampling bias via the method of sieves, and propose a deep learning algorithm whose loss function captures our robust learning target. We empirically validate our proposed method in a case study on prediction of mental health scores from health survey data and a case study on ICU length of stay prediction.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models
cs.LG 2026-05 unverdicted novelty 7.0

A new upper bound is derived for the worst-case effect of selection bias on medical prediction model performance under partial observation of the selection process and target data.
In-Context Positive-Unlabeled Learning
stat.ML 2026-05 unverdicted novelty 7.0

PUICL is a transformer pretrained on synthetic PU data from structural causal models that solves positive-unlabeled classification via in-context learning without gradient updates or fitting.
Estimation beyond Missing (Completely) at Random
math.ST 2024-10 unverdicted novelty 7.0

Realisable epsilon-contamination models for MNAR data yield minimax mean estimation rates that decompose into MCAR plus robust terms and remain consistent for Gaussian bases even as missingness and epsilon both tend to 1.
Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference
cs.LG 2026-05 unverdicted novelty 6.0

Online conformal framework for adversarial CVaR control with asymptotic guarantees and regret bounds, demonstrated on portfolio management and LLM toxicity mitigation.
Assessing Estimate of CATE from Observational Data via an RCT Study
stat.ME 2026-05 unverdicted novelty 5.0

CAFE assesses the fit of observational CATE estimates by partitioning RCT data via propensity scores and comparing to experimental group averages, with theory and extensions for confounders.