Class-Dependent Hybrid Data Augmentation for Multiclass Migraine Classification under Severe Class Imbalance
Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3
The pith
Class-dependent hybrid augmentation with proportional growth improves average macro-F1 to 0.862 across classifiers for seven migraine subtypes after leakage corrections.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After correcting methodological flaws in previous work, the class-dependent hybrid data augmentation framework, which assigns different synthetic data generation methods based on per-class sample size and employs proportionally constrained growth motivated by fidelity asymmetry, consistently outperforms both no-augmentation and single-augmenter baselines in macro-F1 averaged across eight classifiers, achieving 0.862 on average and a maximum of 0.914 with FT-Transformer, while demonstrating that clinically motivated subtype aggregation accounts for most of the absolute gains at the per-classifier level.
What carries the argument
The class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, together with the fidelity asymmetry concept that motivates proportionally constrained growth as an alternative to full class balance.
If this is right
- The proposed framework provides higher average robustness across multiple classifiers than any individual augmentation method.
- Clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 accounts for most of the absolute performance improvement when using the best single classifier.
- Proportional augmentation under fidelity asymmetry yields better results than aiming for full class balance in this imbalanced setting.
- Correcting for data leakage and metric bias substantially lowers the performance estimates reported in earlier migraine classification studies.
Where Pith is reading between the lines
- Similar class-dependent assignment of augmentation methods could improve robustness in other medical domains with severe class imbalance, such as rare disease diagnosis.
- The focus on average performance across classifiers suggests the method may help avoid model-specific overfitting in clinical machine learning applications.
- Testing the framework on datasets with varying numbers of classes or different imbalance ratios would reveal how general the per-class assignment rule is.
Load-bearing premise
The 400-patient dataset after aggregating hemiplegic subtypes is representative of the seven migraine subtypes and that the applied corrections have fully removed data leakage and metric bias without any remaining confounding.
What would settle it
Re-running the exact same framework and evaluation protocol on a new, larger independent collection of migraine patient records would determine if the reported macro-F1 improvements hold or if they were specific to the original dataset's characteristics.
Figures
read the original abstract
We conducted a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. We then introduced (i) a clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 {\S}1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry, motivating proportionally constrained growth as an alternative to full class balance. Experiments were performed on a dataset of 400 patients across seven migraine subtypes under a two-stage protocol, including the six-class configuration described above. Models were evaluated using stratified 5-fold cross-validation with macro-averaged F1 as the primary metric. Correcting methodological flaws reduces previously inflated performance estimates, with the corrected macro-F1 baseline standing at 0.71. The proposed framework consistently outperformed individual augmenters in macro-F1 averaged across the eight evaluated classifiers (0.862 vs. 0.836 for Gaussian Copula, 0.815 for CTGAN, and 0.801 for the no-augmentation baseline), and achieved its peak result of 0.914 with FT-Transformer under proportional augmentation. The no-augmentation FT-Transformer baseline (0.896) shows that, at the per-classifier ceiling, clinically motivated class aggregation accounts for most of the absolute improvement; the framework's principal measurable contribution is the gain in average robustness across classifiers, highlighting the dominant role of problem formulation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. It introduces (i) clinically motivated aggregation of two hemiplegic subtypes per ICHD-3 §1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry motivating proportionally constrained growth. Experiments use a 400-patient dataset across seven migraine subtypes under a two-stage protocol with stratified 5-fold cross-validation and macro-F1 as primary metric. The corrected baseline is 0.71; the framework reports average macro-F1 of 0.862 across eight classifiers (vs. 0.836 Gaussian Copula, 0.815 CTGAN, 0.801 no-augmentation), with peak 0.914 for FT-Transformer under proportional augmentation (no-augmentation FT-Transformer baseline 0.896).
Significance. If the leakage corrections and post-aggregation dataset are free of residual confounding, the work shows that clinically motivated class aggregation accounts for most absolute gains while the hybrid strategy improves average robustness across classifiers. This highlights the value of problem formulation over augmentation alone in severe imbalance settings and supplies concrete, reproducible numbers from a two-stage protocol on a medical dataset.
major comments (2)
- [Abstract and Methods] Abstract and Methods (two-stage protocol and leakage corrections): the central claim that the hybrid framework delivers measurable robustness gains (0.862 vs. 0.801 baseline) beyond aggregation rests on the corrected 400-patient dataset being free of residual patient-selection or feature-definition confounding. No patient-level split audit, external cohort, or explicit validation of the corrections is described, which is load-bearing for attributing the delta to the augmentation strategy.
- [Results] Results (classifier-averaged macro-F1 and per-classifier baselines): the no-augmentation FT-Transformer result of 0.896 vs. 0.914 peak shows aggregation drives most improvement, yet the average robustness claim (0.862) lacks reported per-fold variance, statistical significance tests, or an ablation isolating each hybrid component from the aggregation step.
minor comments (2)
- [Abstract] The LaTeX fragment {§}1.2.3 in the abstract should be rendered as §1.2.3 for readability.
- [Abstract] The term 'fidelity asymmetry' is introduced without a concise formal definition or equation in the provided abstract text.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on attribution of gains and validation of the corrected dataset. We address each major comment below, with revisions where feasible to improve transparency and rigor while remaining faithful to the conducted experiments.
read point-by-point responses
-
Referee: [Abstract and Methods] Abstract and Methods (two-stage protocol and leakage corrections): the central claim that the hybrid framework delivers measurable robustness gains (0.862 vs. 0.801 baseline) beyond aggregation rests on the corrected 400-patient dataset being free of residual patient-selection or feature-definition confounding. No patient-level split audit, external cohort, or explicit validation of the corrections is described, which is load-bearing for attributing the delta to the augmentation strategy.
Authors: The manuscript describes the two-stage protocol using stratified 5-fold cross-validation on the 400-patient dataset and specifies the leakage corrections applied to prior studies (data leakage and metric bias). These steps mitigate patient-selection and feature-definition issues within the available data. We agree that an external cohort would provide stronger evidence against residual confounding; no such cohort is available. We will revise the Methods section to expand the explicit description of the patient-level split procedure and the precise correction steps performed, improving transparency without overstating the evidence. revision: partial
-
Referee: [Results] Results (classifier-averaged macro-F1 and per-classifier baselines): the no-augmentation FT-Transformer result of 0.896 vs. 0.914 peak shows aggregation drives most improvement, yet the average robustness claim (0.862) lacks reported per-fold variance, statistical significance tests, or an ablation isolating each hybrid component from the aggregation step.
Authors: The manuscript already states that aggregation accounts for most absolute gains (explicitly citing the 0.896 no-augmentation FT-Transformer baseline versus the 0.914 peak) while the hybrid framework's main contribution is improved average robustness across classifiers. To strengthen the robustness claim, we will add per-fold variance, paired statistical significance tests, and an ablation that isolates the hybrid augmentation components from the aggregation step in the revised Results section. revision: yes
Circularity Check
No significant circularity; empirical CV results independent of augmentation inputs
full rationale
The paper reports an empirical ML study: prior-work corrections, ICHD-3-based subtype aggregation, class-dependent hybrid augmentation, and stratified 5-fold CV evaluation on a 400-patient dataset. Macro-F1 values (0.862 average, 0.914 peak) are computed on held-out folds and do not reduce to any fitted parameter or self-defined quantity by construction. No equations, uniqueness theorems, or self-citations appear as load-bearing premises for the central performance claims. The derivation chain consists of standard data-preprocessing and augmentation steps followed by independent cross-validation; the reported deltas are falsifiable against external cohorts and do not collapse to the augmentation strategy itself.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption ICHD-3 §1.2.3 provides a clinically valid basis for aggregating the two hemiplegic subtypes.
invented entities (1)
-
fidelity asymmetry
no independent evidence
Reference graph
Works this paper leans on
-
[1]
L. J. Stovner, K. Hagen, M. Linde, T. J. Steiner, The global prevalence ofheadache: anupdate, withanalysisoftheinfluencesofmethodological factors on prevalence estimates, J. Headache Pain 23 (1) (2022) 34
work page 2022
-
[2]
M. Ashina, Migraine, The New England Journal of Medicine 383 (19) (2021) 1866–1876.doi:10.1056/NEJMra1915327
-
[3]
International Headache Society, The international classification of headache disorders, 3rd edition (ichd-3), Cephalalgia 38 (2018) 1–211. doi:10.1177/0333102417738202
-
[4]
I. Petrušić, R. Messina, L. Pellesi, et al., Application of machine learning in migraine classification: a call for study design standardization and global collaboration, The Journal of Headache and Pain 26 (1) (2025) 200.doi:10.1186/s10194-025-02134-9
-
[5]
W. Lee, M. K. Chu, The current role of artificial intelligence in the field of headache disorders, with a focus on migraine: A systemic review, Headache and Pain Research (Feb. 2025)
work page 2025
-
[6]
A. Stubberud, H. Langseth, P. Nachev, M. S. Matharu, E. Tron- vik, Artificial intelligence and headache, Cephalalgia 44 (8) (2024) 3331024241268290
work page 2024
-
[7]
G. S. Collins, K. G. M. Moons, P. Dhiman, R. D. Riley, A. L. Beam, B. Van Calster, M. Ghassemi, X. Liu, J. B. Reitsma, M. van Smeden, et al., TRIPOD+AI statement: updated guidance for reporting clinical predictionmodelsthatuseregressionormachinelearningmethods, BMJ 385 (2024) e078378.doi:10.1136/bmj-2023-078378
-
[8]
L. Khan, M. Shahreen, A. Qazi, S. J. A. Shah, S. Hussain, H.-T. Chang, Migraine headache (MH) classification using machine learning methods with data augmentation, Scientific Reports 14 (1) (2024) 5180.doi: 10.1038/s41598-024-55874-0
- [9]
-
[10]
H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284.doi: 10.1109/TKDE.2008.239
-
[11]
D. M. Powers, Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63
work page 2011
- [12]
-
[13]
N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357
work page 2002
-
[14]
Blagus, L.Lusa, SMOTEforhigh-dimensional class-imbalanced data, BMC Bioinformatics 14 (1) (2013) 106
R. Blagus, L.Lusa, SMOTEforhigh-dimensional class-imbalanced data, BMC Bioinformatics 14 (1) (2013) 106
work page 2013
- [15]
-
[16]
S. Arik, T. Pfister, Tabnet: Attentive interpretable tabular learning, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021) 6679–6687
work page 2021
-
[17]
G. Somepalli, M. Goldblum, A. Schwarzschild, M. Bruss, T. Goldstein, Saint: Improved neural networks for tabular data via row attention and contrastive pre-training, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 23983–23994
work page 2021
-
[18]
Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, in: Advances in Neural Information Processing Systems, Vol. 34, Curran Associates, Inc., 2021, pp. 18598– 18608. URLhttps://proceedings.neurips.cc/paper/2021/hash/ 9d86d83f925f2149e9edb0ac3b49229c-Abstract.html
work page 2021
-
[19]
N. Hollmann, S. Müller, K. Eggensperger, M. Lindauer, Tabular data: Deep learning is not all you need, Advances in Neural Information Pro- cessing Systems 35 (2022) 644–658
work page 2022
-
[20]
V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and tabular data: A survey, IEEE Transactions on Neural Networks and Learning Systems 35 (2022) 7499–7519.doi: 10.1109/TNNLS.2022.3229161
-
[21]
L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on tabular data?, Advances in Neural Information Processing Systems 35 (2022) 507–520
work page 2022
-
[22]
R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, Information Fusion 81 (2022) 84–90.doi:10.1016/j.inffus. 2021.11.011
-
[23]
I. Petrušić, A. Savić, K. Mitrović, N. Bačanin, G. Sebastianelli, D. Secci, G. Coppola, Machine learning classification meets migraine: recommen- dations for study evaluation, The Journal of Headache and Pain 25 (1) (2024) 215.doi:10.1186/s10194-024-01924-x
-
[24]
C. Mosquera, L. Ferrer, D. H. Milone, D. Luna, E. Ferrante, Class imbal- ance on medical image classification: towards better evaluation practices for discrimination and calibration performance, Eur. Radiol. 34 (12) (2024) 7895–7903
work page 2024
-
[25]
C. J. Hellín, A. A. Olmedo, A. Valledor, J. Gómez, M. López-Benítez, A. Tayebi, Unraveling the impact of class imbalance on deep-learning models for medical image classification, Appl. Sci. (Basel) 14 (8) (2024) 3419
work page 2024
-
[26]
M. Sokolova, G. Lapalme, A systematic analysis of performance mea- sures for classification tasks, Information Processing & Management 45 (4) (2009) 427–437.doi:10.1016/j.ipm.2009.03.002
-
[27]
H. He, Y. Bai, E. A. Garcia, S. Li, ADASYN: Adaptive synthetic sam- pling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Com- putational Intelligence), IEEE, 2008, pp. 1322–1328
work page 2008
- [28]
-
[29]
A. S. Tarawneh, A. B. Hassanat, G. A. Altarawneh, A. Almuhaimeed, Stop oversampling for class imbalance learning: A review, IEEE Access 10 (2022) 47643–47660
work page 2022
-
[30]
N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, in: 2016 IEEE International Conference on Data Science and Advanced An- alytics (DSAA), 2016, pp. 399–410.doi:10.1109/DSAA.2016.49
-
[31]
J. Fonseca, F. Bacao, Tabular and latent space synthetic data gen- eration: a literature review, Journal of Big Data 10 (1) (2023) 115. doi:10.1186/s40537-023-00792-7
-
[32]
R. Sauber-Cole, T. M. Khoshgoftaar, The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey, Journal of Big Data 9 (1) (2022) 98.doi:10.1186/s40537-022-00648-6
- [33]
-
[34]
D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wress- negger, L. Cavallaro, K. Rieck, Pitfalls in machine learning for computer security, Commun. ACM 67 (11) (2024) 104–112
work page 2024
-
[35]
G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (17) (2017) 1–5
work page 2017
- [36]
-
[37]
J. A. Sáez, J. Luengo, F. Herrera, Evaluating the classifier behavior with noisy data considering performance and robustness, Information Sciences 346–347 (2016) 256–274.doi:10.1016/j.ins.2016.03.050
-
[38]
J. A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE-IPF: ad- dressing the noisy and borderline examples problem in imbalanced clas- sification by a re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203.doi:10.1016/J.INS.2014.08.051. URLhttps://doi.org/10.1016/j.ins.2014.08.051 Table A.7: Python libraries and classes used for each pipeline c...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.