CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Agnideep Aich; Bruce Wade; Md Monzur Murshed; Sameera Hewage

arxiv: 2506.17326 · v3 · pith:XP3HLANFnew · submitted 2025-06-18 · 💻 cs.LG · stat.AP· stat.ML

CopulaSMOTE: A Copula-Based Oversampling Approach for Imbalanced Classification in Diabetes Prediction

Agnideep Aich , Md Monzur Murshed , Bruce Wade , Sameera Hewage This is my paper

classification 💻 cs.LG stat.APstat.ML

keywords diabetesdatasetapproachdependenceimbalancesyntheticaddressbrfss

0 comments

read the original abstract

Class imbalance remains a practical obstacle in the development of clinical prediction models for conditions such as diabetes mellitus, where the number of confirmed cases is often much smaller than the number of controls. The Synthetic Minority Over-sampling Technique (SMOTE) and its variants are widely used to address this imbalance, but they generate synthetic observations through local interpolation in feature space and do not explicitly model the joint dependence structure of the minority class. To address this challenge, our study introduces a copula-based data augmentation approach that estimates the minority-class dependence structure when generating synthetic samples and integrates with standard machine learning techniques. Specifically, we employ truncated vine copulas to represent multivariate dependence through a sequence of bivariate building blocks. We evaluate the proposed approach on three public diabetes datasets, namely the Pima Indians Diabetes dataset, the Iraqi Diabetes dataset, and the CDC BRFSS 2015 Diabetes Health Indicators dataset, which together cover a range of sample sizes, dimensionalities, and imbalance regimes. For each dataset, five resampling strategies are compared across five classifiers using a 5 by 2 cross validation protocol with Dietterich's paired t test. Our findings suggest that CopulaSMOTE can improve minority-class recovery in larger tabular diabetes datasets, particularly the CDC BRFSS dataset, but its advantages depend on the classifier and evaluation metric.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Comparing Two Categorical Gini Correlations with Applications to Classification Problems
stat.ME 2026-05 unverdicted novelty 6.0

Proposes an inferential framework to test differences in categorical Gini correlations for predictor importance in classification, establishing asymptotic normality and consistency while accommodating unequal dimensio...