pith. sign in

arxiv: 2605.23453 · v1 · pith:4WOM5EPWnew · submitted 2026-05-22 · 💻 cs.LG

Class-Dependent Hybrid Data Augmentation for Multiclass Migraine Classification under Severe Class Imbalance

Pith reviewed 2026-05-25 04:37 UTC · model grok-4.3

classification 💻 cs.LG
keywords data augmentationclass imbalancemigraine classificationhybrid methodsmulticlassmedical machine learningimbalanced learning
0
0 comments X

The pith

Class-dependent hybrid augmentation with proportional growth improves average macro-F1 to 0.862 across classifiers for seven migraine subtypes after leakage corrections.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper re-evaluates prior migraine classification studies by correcting for data leakage and metric bias, which brings the baseline macro-F1 down to 0.71. It proposes a clinically motivated aggregation of hemiplegic subtypes and a class-dependent hybrid augmentation strategy that selects generation methods according to per-class sample sizes, along with the idea of fidelity asymmetry that favors proportionally constrained growth over full class balancing. On a dataset of 400 patients, the framework raises the average macro-F1 across eight classifiers to 0.862, beating individual augmenters and the no-augmentation baseline of 0.801, while the peak of 0.914 occurs with the FT-Transformer under proportional augmentation. A reader would care because the work shows that tailoring augmentation to class size and fixing problem formulation can yield more reliable performance in severely imbalanced medical multiclass tasks.

Core claim

After correcting methodological flaws in previous work, the class-dependent hybrid data augmentation framework, which assigns different synthetic data generation methods based on per-class sample size and employs proportionally constrained growth motivated by fidelity asymmetry, consistently outperforms both no-augmentation and single-augmenter baselines in macro-F1 averaged across eight classifiers, achieving 0.862 on average and a maximum of 0.914 with FT-Transformer, while demonstrating that clinically motivated subtype aggregation accounts for most of the absolute gains at the per-classifier level.

What carries the argument

The class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, together with the fidelity asymmetry concept that motivates proportionally constrained growth as an alternative to full class balance.

If this is right

  • The proposed framework provides higher average robustness across multiple classifiers than any individual augmentation method.
  • Clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 accounts for most of the absolute performance improvement when using the best single classifier.
  • Proportional augmentation under fidelity asymmetry yields better results than aiming for full class balance in this imbalanced setting.
  • Correcting for data leakage and metric bias substantially lowers the performance estimates reported in earlier migraine classification studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar class-dependent assignment of augmentation methods could improve robustness in other medical domains with severe class imbalance, such as rare disease diagnosis.
  • The focus on average performance across classifiers suggests the method may help avoid model-specific overfitting in clinical machine learning applications.
  • Testing the framework on datasets with varying numbers of classes or different imbalance ratios would reveal how general the per-class assignment rule is.

Load-bearing premise

The 400-patient dataset after aggregating hemiplegic subtypes is representative of the seven migraine subtypes and that the applied corrections have fully removed data leakage and metric bias without any remaining confounding.

What would settle it

Re-running the exact same framework and evaluation protocol on a new, larger independent collection of migraine patient records would determine if the reported macro-F1 improvements hold or if they were specific to the original dataset's characteristics.

Figures

Figures reproduced from arXiv: 2605.23453 by Elvin Som\'on, Miguel A. Guti\'errez-Naranjo.

Figure 1
Figure 1. Figure 1: Class-dependent hybrid augmentation framework. The training data are parti [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
read the original abstract

We conducted a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. We then introduced (i) a clinically motivated aggregation of two hemiplegic subtypes following ICHD-3 {\S}1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry, motivating proportionally constrained growth as an alternative to full class balance. Experiments were performed on a dataset of 400 patients across seven migraine subtypes under a two-stage protocol, including the six-class configuration described above. Models were evaluated using stratified 5-fold cross-validation with macro-averaged F1 as the primary metric. Correcting methodological flaws reduces previously inflated performance estimates, with the corrected macro-F1 baseline standing at 0.71. The proposed framework consistently outperformed individual augmenters in macro-F1 averaged across the eight evaluated classifiers (0.862 vs. 0.836 for Gaussian Copula, 0.815 for CTGAN, and 0.801 for the no-augmentation baseline), and achieved its peak result of 0.914 with FT-Transformer under proportional augmentation. The no-augmentation FT-Transformer baseline (0.896) shows that, at the per-classifier ceiling, clinically motivated class aggregation accounts for most of the absolute improvement; the framework's principal measurable contribution is the gain in average robustness across classifiers, highlighting the dominant role of problem formulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a reproducibility-oriented re-evaluation of prior migraine classification studies, correcting for data leakage and metric bias. It introduces (i) clinically motivated aggregation of two hemiplegic subtypes per ICHD-3 §1.2.3, (ii) a class-dependent hybrid augmentation strategy that assigns generation methods based on per-class sample size, and (iii) the concept of fidelity asymmetry motivating proportionally constrained growth. Experiments use a 400-patient dataset across seven migraine subtypes under a two-stage protocol with stratified 5-fold cross-validation and macro-F1 as primary metric. The corrected baseline is 0.71; the framework reports average macro-F1 of 0.862 across eight classifiers (vs. 0.836 Gaussian Copula, 0.815 CTGAN, 0.801 no-augmentation), with peak 0.914 for FT-Transformer under proportional augmentation (no-augmentation FT-Transformer baseline 0.896).

Significance. If the leakage corrections and post-aggregation dataset are free of residual confounding, the work shows that clinically motivated class aggregation accounts for most absolute gains while the hybrid strategy improves average robustness across classifiers. This highlights the value of problem formulation over augmentation alone in severe imbalance settings and supplies concrete, reproducible numbers from a two-stage protocol on a medical dataset.

major comments (2)
  1. [Abstract and Methods] Abstract and Methods (two-stage protocol and leakage corrections): the central claim that the hybrid framework delivers measurable robustness gains (0.862 vs. 0.801 baseline) beyond aggregation rests on the corrected 400-patient dataset being free of residual patient-selection or feature-definition confounding. No patient-level split audit, external cohort, or explicit validation of the corrections is described, which is load-bearing for attributing the delta to the augmentation strategy.
  2. [Results] Results (classifier-averaged macro-F1 and per-classifier baselines): the no-augmentation FT-Transformer result of 0.896 vs. 0.914 peak shows aggregation drives most improvement, yet the average robustness claim (0.862) lacks reported per-fold variance, statistical significance tests, or an ablation isolating each hybrid component from the aggregation step.
minor comments (2)
  1. [Abstract] The LaTeX fragment {§}1.2.3 in the abstract should be rendered as §1.2.3 for readability.
  2. [Abstract] The term 'fidelity asymmetry' is introduced without a concise formal definition or equation in the provided abstract text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on attribution of gains and validation of the corrected dataset. We address each major comment below, with revisions where feasible to improve transparency and rigor while remaining faithful to the conducted experiments.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods (two-stage protocol and leakage corrections): the central claim that the hybrid framework delivers measurable robustness gains (0.862 vs. 0.801 baseline) beyond aggregation rests on the corrected 400-patient dataset being free of residual patient-selection or feature-definition confounding. No patient-level split audit, external cohort, or explicit validation of the corrections is described, which is load-bearing for attributing the delta to the augmentation strategy.

    Authors: The manuscript describes the two-stage protocol using stratified 5-fold cross-validation on the 400-patient dataset and specifies the leakage corrections applied to prior studies (data leakage and metric bias). These steps mitigate patient-selection and feature-definition issues within the available data. We agree that an external cohort would provide stronger evidence against residual confounding; no such cohort is available. We will revise the Methods section to expand the explicit description of the patient-level split procedure and the precise correction steps performed, improving transparency without overstating the evidence. revision: partial

  2. Referee: [Results] Results (classifier-averaged macro-F1 and per-classifier baselines): the no-augmentation FT-Transformer result of 0.896 vs. 0.914 peak shows aggregation drives most improvement, yet the average robustness claim (0.862) lacks reported per-fold variance, statistical significance tests, or an ablation isolating each hybrid component from the aggregation step.

    Authors: The manuscript already states that aggregation accounts for most absolute gains (explicitly citing the 0.896 no-augmentation FT-Transformer baseline versus the 0.914 peak) while the hybrid framework's main contribution is improved average robustness across classifiers. To strengthen the robustness claim, we will add per-fold variance, paired statistical significance tests, and an ablation that isolates the hybrid augmentation components from the aggregation step in the revised Results section. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical CV results independent of augmentation inputs

full rationale

The paper reports an empirical ML study: prior-work corrections, ICHD-3-based subtype aggregation, class-dependent hybrid augmentation, and stratified 5-fold CV evaluation on a 400-patient dataset. Macro-F1 values (0.862 average, 0.914 peak) are computed on held-out folds and do not reduce to any fitted parameter or self-defined quantity by construction. No equations, uniqueness theorems, or self-citations appear as load-bearing premises for the central performance claims. The derivation chain consists of standard data-preprocessing and augmentation steps followed by independent cross-validation; the reported deltas are falsifiable against external cohorts and do not collapse to the augmentation strategy itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are quantified beyond the clinical aggregation rule and the new fidelity-asymmetry framing.

axioms (1)
  • domain assumption ICHD-3 §1.2.3 provides a clinically valid basis for aggregating the two hemiplegic subtypes.
    Invoked to reduce the seven-class problem to six classes.
invented entities (1)
  • fidelity asymmetry no independent evidence
    purpose: Motivates proportionally constrained growth instead of full class balance.
    Introduced to justify the proportional augmentation regime.

pith-pipeline@v0.9.0 · 5811 in / 1554 out tokens · 26898 ms · 2026-05-25T04:37:12.229492+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

  1. [1]

    L. J. Stovner, K. Hagen, M. Linde, T. J. Steiner, The global prevalence ofheadache: anupdate, withanalysisoftheinfluencesofmethodological factors on prevalence estimates, J. Headache Pain 23 (1) (2022) 34

  2. [2]

    Ashina, Migraine, The New England Journal of Medicine 383 (19) (2021) 1866–1876.doi:10.1056/NEJMra1915327

    M. Ashina, Migraine, The New England Journal of Medicine 383 (19) (2021) 1866–1876.doi:10.1056/NEJMra1915327

  3. [3]

    doi:10.1177/0333102417738202

    International Headache Society, The international classification of headache disorders, 3rd edition (ichd-3), Cephalalgia 38 (2018) 1–211. doi:10.1177/0333102417738202

  4. [4]

    Petrušić, R

    I. Petrušić, R. Messina, L. Pellesi, et al., Application of machine learning in migraine classification: a call for study design standardization and global collaboration, The Journal of Headache and Pain 26 (1) (2025) 200.doi:10.1186/s10194-025-02134-9

  5. [5]

    W. Lee, M. K. Chu, The current role of artificial intelligence in the field of headache disorders, with a focus on migraine: A systemic review, Headache and Pain Research (Feb. 2025)

  6. [6]

    Stubberud, H

    A. Stubberud, H. Langseth, P. Nachev, M. S. Matharu, E. Tron- vik, Artificial intelligence and headache, Cephalalgia 44 (8) (2024) 3331024241268290

  7. [7]

    G. S. Collins, K. G. M. Moons, P. Dhiman, R. D. Riley, A. L. Beam, B. Van Calster, M. Ghassemi, X. Liu, J. B. Reitsma, M. van Smeden, et al., TRIPOD+AI statement: updated guidance for reporting clinical predictionmodelsthatuseregressionormachinelearningmethods, BMJ 385 (2024) e078378.doi:10.1136/bmj-2023-078378

  8. [8]

    L. Khan, M. Shahreen, A. Qazi, S. J. A. Shah, S. Hussain, H.-T. Chang, Migraine headache (MH) classification using machine learning methods with data augmentation, Scientific Reports 14 (1) (2024) 5180.doi: 10.1038/s41598-024-55874-0

  9. [9]

    Reddy, A

    A. Reddy, A. Reddy, Migraine triggers, phases, and classification using machine learning models, Front. Neurol. 16 (2025) 1555215

  10. [10]

    H. He, E. A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 21 (9) (2009) 1263–1284.doi: 10.1109/TKDE.2008.239

  11. [11]

    D. M. Powers, Evaluation: From precision, recall and f-measure to roc, informedness, markedness and correlation, Journal of Machine Learning Technologies 2 (1) (2011) 37–63

  12. [12]

    Saito, M

    T. Saito, M. Rehmsmeier, The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets, PLoS One 10 (3) (2015) e0118432

  13. [13]

    N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, SMOTE: Synthetic minority over-sampling technique, J. Artif. Intell. Res. 16 (2002) 321–357

  14. [14]

    Blagus, L.Lusa, SMOTEforhigh-dimensional class-imbalanced data, BMC Bioinformatics 14 (1) (2013) 106

    R. Blagus, L.Lusa, SMOTEforhigh-dimensional class-imbalanced data, BMC Bioinformatics 14 (1) (2013) 106

  15. [15]

    L. Xu, M. Skoularidou, A. Cuesta-Infante, K. Veeramachaneni, Model- ing tabular data using conditional GAN, CoRR abs/1907.00503 (2019). arXiv:1907.00503. URLhttp://arxiv.org/abs/1907.00503

  16. [16]

    S. Arik, T. Pfister, Tabnet: Attentive interpretable tabular learning, Proceedings of the AAAI Conference on Artificial Intelligence 35 (2021) 6679–6687

  17. [17]

    Somepalli, M

    G. Somepalli, M. Goldblum, A. Schwarzschild, M. Bruss, T. Goldstein, Saint: Improved neural networks for tabular data via row attention and contrastive pre-training, in: Advances in Neural Information Processing Systems, Vol. 34, 2021, pp. 23983–23994

  18. [18]

    Gorishniy, I

    Y. Gorishniy, I. Rubachev, V. Khrulkov, A. Babenko, Revisiting deep learning models for tabular data, in: Advances in Neural Information Processing Systems, Vol. 34, Curran Associates, Inc., 2021, pp. 18598– 18608. URLhttps://proceedings.neurips.cc/paper/2021/hash/ 9d86d83f925f2149e9edb0ac3b49229c-Abstract.html

  19. [19]

    Hollmann, S

    N. Hollmann, S. Müller, K. Eggensperger, M. Lindauer, Tabular data: Deep learning is not all you need, Advances in Neural Information Pro- cessing Systems 35 (2022) 644–658

  20. [20]

    Borisov, T

    V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, G. Kasneci, Deep neural networks and tabular data: A survey, IEEE Transactions on Neural Networks and Learning Systems 35 (2022) 7499–7519.doi: 10.1109/TNNLS.2022.3229161

  21. [21]

    Grinsztajn, E

    L. Grinsztajn, E. Oyallon, G. Varoquaux, Why do tree-based models still outperform deep learning on tabular data?, Advances in Neural Information Processing Systems 35 (2022) 507–520

  22. [22]

    Shwartz-Ziv, A

    R. Shwartz-Ziv, A. Armon, Tabular data: Deep learning is not all you need, Information Fusion 81 (2022) 84–90.doi:10.1016/j.inffus. 2021.11.011

  23. [23]

    Petrušić, A

    I. Petrušić, A. Savić, K. Mitrović, N. Bačanin, G. Sebastianelli, D. Secci, G. Coppola, Machine learning classification meets migraine: recommen- dations for study evaluation, The Journal of Headache and Pain 25 (1) (2024) 215.doi:10.1186/s10194-024-01924-x

  24. [24]

    Mosquera, L

    C. Mosquera, L. Ferrer, D. H. Milone, D. Luna, E. Ferrante, Class imbal- ance on medical image classification: towards better evaluation practices for discrimination and calibration performance, Eur. Radiol. 34 (12) (2024) 7895–7903

  25. [25]

    C. J. Hellín, A. A. Olmedo, A. Valledor, J. Gómez, M. López-Benítez, A. Tayebi, Unraveling the impact of class imbalance on deep-learning models for medical image classification, Appl. Sci. (Basel) 14 (8) (2024) 3419

  26. [26]

    Sokolova, G

    M. Sokolova, G. Lapalme, A systematic analysis of performance mea- sures for classification tasks, Information Processing & Management 45 (4) (2009) 427–437.doi:10.1016/j.ipm.2009.03.002

  27. [27]

    H. He, Y. Bai, E. A. Garcia, S. Li, ADASYN: Adaptive synthetic sam- pling approach for imbalanced learning, in: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Com- putational Intelligence), IEEE, 2008, pp. 1322–1328

  28. [28]

    Han, W.-Y

    H. Han, W.-Y. Wang, B.-H. Mao, Borderline-SMOTE: A new over- sampling method in imbalanced data sets learning, in: Lecture Notes in Computer Science, Lecture Notes in Computer Science, Springer Berlin Heidelberg, Berlin, Heidelberg, 2005, pp. 878–887

  29. [29]

    A. S. Tarawneh, A. B. Hassanat, G. A. Altarawneh, A. Almuhaimeed, Stop oversampling for class imbalance learning: A review, IEEE Access 10 (2022) 47643–47660

  30. [30]

    Patki, R

    N. Patki, R. Wedge, K. Veeramachaneni, The synthetic data vault, in: 2016 IEEE International Conference on Data Science and Advanced An- alytics (DSAA), 2016, pp. 399–410.doi:10.1109/DSAA.2016.49

  31. [31]

    Fonseca, F

    J. Fonseca, F. Bacao, Tabular and latent space synthetic data gen- eration: a literature review, Journal of Big Data 10 (1) (2023) 115. doi:10.1186/s40537-023-00792-7

  32. [32]

    Sauber-Cole, T

    R. Sauber-Cole, T. M. Khoshgoftaar, The use of generative adversarial networks to alleviate class imbalance in tabular data: a survey, Journal of Big Data 9 (1) (2022) 98.doi:10.1186/s40537-022-00648-6

  33. [33]

    Kapoor, A

    S. Kapoor, A. Narayanan, Leakage and the reproducibility crisis in machine-learning-based science, Patterns (N. Y.) 4 (9) (2023) 100804

  34. [34]

    D. Arp, E. Quiring, F. Pendlebury, A. Warnecke, F. Pierazzi, C. Wress- negger, L. Cavallaro, K. Rieck, Pitfalls in machine learning for computer security, Commun. ACM 67 (11) (2024) 104–112

  35. [35]

    Lemaître, F

    G. Lemaître, F. Nogueira, C. K. Aridas, Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (17) (2017) 1–5

  36. [36]

    Joseph, H

    M. Joseph, H. Raj, Gandalf: Gated adaptive network for deep auto- mated learning of features (2024).arXiv:2207.08548,doi:10.48550/ arXiv.2207.08548

  37. [37]

    J. A. Sáez, J. Luengo, F. Herrera, Evaluating the classifier behavior with noisy data considering performance and robustness, Information Sciences 346–347 (2016) 256–274.doi:10.1016/j.ins.2016.03.050

  38. [38]

    J. A. Sáez, J. Luengo, J. Stefanowski, F. Herrera, SMOTE-IPF: ad- dressing the noisy and borderline examples problem in imbalanced clas- sification by a re-sampling method with filtering, Inf. Sci. 291 (2015) 184–203.doi:10.1016/J.INS.2014.08.051. URLhttps://doi.org/10.1016/j.ins.2014.08.051 Table A.7: Python libraries and classes used for each pipeline c...