pith. sign in

arxiv: 2604.04199 · v2 · pith:O3TLDHFJnew · submitted 2026-04-05 · 💻 cs.LG

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

classification 💻 cs.LG
keywords classleakagedatasetsacrossboundarydatamattersselection
0
0 comments X
read the original abstract

Twenty-eight within-subject counterfactual experiments across 2,047 iid tabular datasets, plus a boundary experiment on 129 temporal datasets, measure the severity of four data leakage classes in machine learning. Class I (estimation: fitting scalers on full data) is negligible: all nine conditions produce $|{\Delta}AUC| \leq 0.005$. Class II (selection: peeking, seed cherry-picking) is substantial: the measured effect is consistent with about 90% noise exploitation inflating reported scores. Class III (memorization) scales with model capacity: $d_z$ = 0.37 (Naive Bayes) to 1.11 (Decision Tree) at 10% duplication. Class IV (boundary) is invisible under random cross-validation. Within this iid tabular regime, the textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.