Statistical matching of non-Gaussian data

Daniel Ahfock; Geoffrey J. McLachlan; Saumyadipta Pyne

read the original abstract

The statistical matching problem is a data integration problem with structured missing data. The general form involves the analysis of multiple datasets that only have a strict subset of variables jointly observed across all datasets. The simplest version involves two datasets, labelled A and B, with three variables of interest $X, Y$ and $Z$. Variables $X$ and $Y$ are observed in dataset A and variables $X$ and $Z$ are observed in dataset $B$. Statistical inference is complicated by the absence of joint $(Y, Z)$ observations. Parametric modelling can be challenging due to identifiability issues and the difficulty of parameter estimation. We develop computationally feasible procedures for the statistical matching of non-Gaussian data using suitable data augmentation schemes and identifiability constraints. Nearest-neighbour imputation is a common alternative technique due to its ease of use and generality. Nearest-neighbour matching is based on a conditional independence assumption that may be inappropriate for non-Gaussian data. The violation of the conditional independence assumption can lead to improper imputations. We compare model based approaches to nearest-neighbour imputation on a number of flow cytometry datasets and find that the model based approach can address some of the weaknesses of the nonparametric nearest-neighbour technique.

Statistical matching of non-Gaussian data

discussion (0)