OverNaN: NaN-Aware Oversampling for Imbalanced Learning with Meaningful Missingness
Pith reviewed 2026-05-13 01:21 UTC · model grok-4.3
The pith
Oversampling can retain informative missing values rather than imputing them in imbalanced datasets
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OverNaN extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms.
What carries the argument
The OverNaN framework, a NaN-aware extension of synthetic oversampling that treats missing values as part of the feature space for generating new samples.
If this is right
- Synthetic samples maintain the missingness structure of the original data.
- Class boundaries are not distorted by the introduction of artificial certainty from imputation.
- Model generalisability improves on imbalanced data with systematic missingness.
- No separate imputation is required before addressing class imbalance.
Where Pith is reading between the lines
- This could encourage collection of metadata about why data is missing in future datasets.
- Similar NaN-aware adaptations might be developed for other machine learning preprocessing steps.
- The effectiveness likely depends on the specific strategies chosen for handling missingness in the oversampling process.
Load-bearing premise
Missingness carries systematic information tied to the data-generating process, and preserving it during synthetic sample generation improves class boundaries and generalisability without introducing new distortions.
What would settle it
Running a controlled experiment on a dataset where missingness is known to be informative, comparing the performance of classifiers trained on OverNaN-generated samples versus samples from standard oversampling after imputation.
Figures
read the original abstract
Missing values are routinely treated as defects to be eliminated through deletion or imputation prior to machine learning. In many applied domains, however, missingness itself carries information, reflecting experimental constraints, measurement choices, or systematic mechanisms tied to the data-generating process. Eliminating or masking this structure can distort class boundaries, introduce bias, and reduce generalisability; particularly in imbalanced datasets where minority classes are already under-represented. OverNaN is a lightweight, NaN-aware oversampling framework designed to address class imbalance without erasing missingness structure. It extends common synthetic oversampling methods to operate directly on incomplete feature vectors, allowing missing values to be preserved, propagated, or selectively interpolated according to explicitly defined strategies. Rather than repairing missing data, OverNaN treats missingness as part of the feature space over which synthetic samples are generated. This paper situates OverNaN within the broader landscape of imbalanced learning, missing-data handling, and NaN-tolerant algorithms. Using representative examples included with the software, we demonstrate that meaningful missingness can be retained during oversampling without introducing artificial certainty. OverNaN is intended for practitioners working with small, incomplete, and imbalanced datasets in scientific and engineering domains where missingness is unavoidable and often informative.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces OverNaN, a lightweight extensible framework that adapts standard synthetic oversampling techniques (e.g., SMOTE-style interpolation) to operate directly on incomplete feature vectors containing NaNs. It allows missing values to be preserved, propagated, or selectively interpolated according to explicit user-defined policies rather than requiring prior imputation or deletion, with the goal of retaining informative missingness structure in imbalanced datasets. The work situates the approach in the imbalanced-learning and missing-data literature and illustrates it via software examples that show NaN retention during synthetic sample generation.
Significance. If the design functions as described, OverNaN would provide a practical, domain-specific tool for scientific and engineering applications where missingness is systematic and carries information (e.g., experimental constraints or measurement limits). By avoiding forced completeness, the framework could reduce distortion of minority-class boundaries and improve downstream generalisability on small, incomplete, imbalanced data without introducing artificial certainty. The emphasis on extensibility and accompanying software strengthens its immediate utility for practitioners.
major comments (1)
- [Demonstration / examples section] Demonstration / examples section: The paper asserts that preserving missingness avoids distorting class boundaries and reducing generalisability, yet the provided demonstrations consist only of qualitative retention checks in software examples. No quantitative comparison (e.g., classifier performance metrics, decision-boundary fidelity, or bias measures) against standard pipelines that impute before oversampling is reported, leaving the practical benefit of the central design claim unverified.
minor comments (3)
- [Abstract and Introduction] The abstract and introduction repeatedly use the phrase 'without introducing artificial certainty,' but the manuscript does not define or operationalise this term; a brief clarification of what constitutes 'artificial certainty' in the oversampling context would improve precision.
- [Method description] Method description: Although an algorithmic description is supplied, the paper would benefit from explicit pseudocode or a step-by-step enumeration of how user-specified NaN policies interact with distance computations and interpolation rules in the base oversampler.
- [Software examples] The software examples are referenced but not reproduced or described in sufficient detail within the main text; including one or two small illustrative tables or figures directly in the paper would aid readers who do not immediately run the accompanying code.
Simulated Author's Rebuttal
We thank the referee for their constructive review and recognition of OverNaN's utility for domains with informative missingness. We address the major comment on the demonstration section below.
read point-by-point responses
-
Referee: The paper asserts that preserving missingness avoids distorting class boundaries and reducing generalisability, yet the provided demonstrations consist only of qualitative retention checks in software examples. No quantitative comparison (e.g., classifier performance metrics, decision-boundary fidelity, or bias measures) against standard pipelines that impute before oversampling is reported, leaving the practical benefit of the central design claim unverified.
Authors: We agree that the current demonstrations are limited to qualitative verification of NaN retention during synthetic sample generation, as stated in the abstract and examples section. While the introduction motivates the approach by noting that eliminating missingness can distort boundaries and reduce generalisability, no quantitative comparisons to imputation-based baselines are included. To address this, we will revise the manuscript by adding a dedicated evaluation subsection with quantitative experiments. These will compare downstream classifier performance (F1-score, AUC) on imbalanced datasets with informative missingness when using OverNaN versus standard SMOTE after mean imputation or complete-case analysis, thereby verifying the practical benefit of the design. revision: yes
Circularity Check
No significant circularity
full rationale
The manuscript describes OverNaN as an algorithmic extension of existing oversamplers (SMOTE-style) that accepts and propagates NaN entries per user policy. No equations, fitted parameters, or quantitative predictions appear. The central claim is a design choice (preserve missingness structure during synthetic generation) presented directly via algorithmic description and software examples, without reduction to self-definition, post-hoc fits, or load-bearing self-citations. The informativeness of missingness is treated as a domain premise rather than a derived result. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Statistical Analysis with Missing Data , author =. 1987 , publisher =
work page 1987
-
[2]
Inference and Missing Data , author =. Biometrika , volume =. 1976 , doi =
work page 1976
-
[3]
Journal of Artificial Intelligence Research , volume =
SMOTE: Synthetic Minority Over-sampling Technique , author =. Journal of Artificial Intelligence Research , volume =. 2002 , doi =
work page 2002
-
[4]
Proceedings of the IEEE International Joint Conference on Neural Networks , pages =
ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning , author =. Proceedings of the IEEE International Joint Conference on Neural Networks , pages =. 2008 , doi =
work page 2008
-
[5]
XGBoost: A Scalable Tree Boosting System , author =. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining , pages =. 2016 , doi =
work page 2016
-
[6]
Advances in Neural Information Processing Systems , volume =
LightGBM: A Highly Efficient Gradient Boosting Decision Tree , author =. Advances in Neural Information Processing Systems , volume =
-
[7]
Applied Soft Computing , volume =
A Broad Review on Class Imbalance Learning Techniques , author =. Applied Soft Computing , volume =. 2023 , doi =
work page 2023
-
[8]
Machine Learning: Science and Technology , volume =
Inverse Prediction of Alloy Post-Processing Conditions Using Classification with Guided Oversampling , author =. Machine Learning: Science and Technology , volume =. 2024 , doi =
work page 2024
-
[9]
Advanced Intelligent Discovery , volume =
Inverse Engineering of Mg Alloys Using Guided Oversampling and Semi-Supervised Learning , author =. Advanced Intelligent Discovery , volume =. 2026 , doi =
work page 2026
-
[10]
Data Mining and Knowledge Discovery , volume =
Training and Assessing Classification Rules with Imbalanced Data , author =. Data Mining and Knowledge Discovery , volume =. 2014 , doi =
work page 2014
-
[11]
ACM SIGKDD Explorations Newsletter , volume =
OpenML: Networked Science in Machine Learning , author =. ACM SIGKDD Explorations Newsletter , volume =. 2014 , doi =
work page 2014
-
[12]
Dua, Dheeru and Graff, Casey , year =
- [13]
-
[14]
The Representative Structure of Graphene Oxide Nanoflakes from Machine Learning , author =. Nano Futures , volume =. 2019 , doi =
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.