pith. sign in

arxiv: 2605.23094 · v1 · pith:CQJNO377new · submitted 2026-05-21 · 📡 eess.IV · cs.AI· cs.CV

Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025

Pith reviewed 2026-05-25 04:50 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CV
keywords synthetic data augmentationStyleGAN2-ADAbrain MRItumour classificationBRISC 2025generative adversarial networksmedical image classification
0
0 comments X

The pith

Synthetic brain MRIs from StyleGAN2-ADA improve tumour classification only for MobileViTV2 at a filtered 1:1 ratio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether adding images from twelve class-plane StyleGAN2-ADA generators to real BRISC 2025 training data can raise held-out tumour classification performance. It evaluates the effect across a random forest on InceptionV3 features, a compact CNN, and MobileViTV2 at 1:1 and 1:2 real-to-synthetic ratios, both with and without feature-space filtering. Only MobileViTV2 shows a statistically significant lift after correction, reaching 1.02 percent higher accuracy with filtered 1:1 augmentation. A separate blind test finds that real and synthetic images are difficult to separate, yet this realism does not predict usefulness for every classifier. The result matters because generative augmentation is widely proposed for small medical datasets, but the work shows its value is conditional on both model type and mixing ratio.

Core claim

Class-plane StyleGAN2-ADA augmentation supplies architecture- and ratio-dependent gains to tumour classification on BRISC 2025; filtered 1:1 supplementation produces a 1.02 percent absolute accuracy increase (95 percent CI 0.54 to 1.54 percent, Holm-corrected p equals 0.0104) only for MobileViTV2, while the random forest receives no benefit and the compact CNN shows mean gains that fail correction, demonstrating that downstream utility is not assured by visual fidelity alone.

What carries the argument

Class-plane StyleGAN2-ADA generators that produce synthetic MRIs optionally filtered in InceptionV3 feature space before addition to real training sets for downstream tumour classifiers.

If this is right

  • Random forest classifiers on InceptionV3 features receive no accuracy benefit from the added synthetic samples at either ratio.
  • Compact CNN classifiers record mean accuracy increases that disappear after Holm correction for multiple comparisons.
  • MobileViTV2 reaches its largest improvement under filtered 1:1 augmentation.
  • Both CNN and MobileViTV2 augmented runs select their best checkpoints after substantially fewer real-data epochs than baseline.
  • Augmentation success cannot be inferred from the difficulty of distinguishing real from synthetic images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The architecture dependence suggests that hybrid convolutional-transformer models may exploit distributional cues in the synthetic images differently from pure convolutional or feature-based models.
  • Small absolute gains imply that StyleGAN2-ADA outputs may need to be paired with other regularisation techniques to produce larger practical improvements.
  • The earlier checkpoint selection under augmentation indicates the synthetic samples may function partly as a training regulariser rather than solely as additional data.
  • Varying the strictness of the InceptionV3 filter could be tested to determine whether more selective inclusion of synthetic samples amplifies the observed benefit.

Load-bearing premise

Observed accuracy differences arise from the synthetic samples themselves rather than from random training variation or dataset-specific effects.

What would settle it

Re-running the MobileViTV2 trials with fresh random seeds and checking whether the 1.02 percent gain and its Holm-corrected significance remain stable across independent trainings.

Figures

Figures reproduced from arXiv: 2605.23094 by Jos\'e Rafael Noriega Cede\~no.

Figure 1
Figure 1. Figure 1: StyleGAN2-ADA training loop used for class-plane brain MRI synthesis. The label-free schematic is read from top to bottom: a latent vector 𝑧 is normalized to , passed through the eight-layer mapping network ̂𝑧 𝑓, and converted to the intermediate latent 𝑤. The synthesis network 𝑔 starts from a learned 4 × 4 constant, then passes through six progressively larger synthesis blocks; each block receives a style… view at source ↗
Figure 2
Figure 2. Figure 2: Raw-to-preprocessed BRISC examples documenting the harmonization ap￾plied before GAN training and downstream classification. Panel (a) shows raw MRI examples arranged as tumour class by anatomical plane, exposing the original variation in field of view, skull visibility, background intensity, crop position, and contrast. Panel (b) shows the same images after grayscale conversion, brain￾region masking, perc… view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics for the twelve class-plane StyleGAN2-ADA generators over 1,000 kimg. In all panels, colour encodes tumour class and line style encodes anatomical plane, so each visible trajectory corresponds to one independent generator. Panel (a) overlays generator and discriminator loss trajectories, sum￾marizing adversarial stability rather than supervised convergence. Panel (b) reports the discrimina… view at source ↗
Figure 4
Figure 4. Figure 4: Training-time image-quality diagnostics for the twelve class-plane StyleGAN2-ADA generators. Each curve represents one tumour-class and anatomical-plane generator; colour indicates tumour class and line style in￾dicates axial, coronal, or sagittal plane. Panel (a) shows FID across training progress; lower values indicate closer feature-distribution agreement with the corresponding real subset. Panel (b) sh… view at source ↗
Figure 6
Figure 6. Figure 6: UMAP projection of InceptionV3 pool3 features for all real training images and all synthetic pool images used in the filtering stage. Filled circles are real images, cross marks are synthetic candidates, and colour indicates tumour class. The projection was fitted jointly to real and synthetic features using two output dimensions, Euclidean distance, 15 nearest neighbours, minimum embedding distance 0.1, a… view at source ↗
Figure 7
Figure 7. Figure 7: Compact two-headed CNN used as the end-to-end convolutional classifier. A 128 × 128 preprocessed MRI slice enters a shared five-block convolutional backbone; each block applies two 3 × 3 convolutions, batch normalization, ReLU activation, and 2 × 2 max pooling. Global adaptive average pooling collapses the final feature maps into a 512-dimensional representation before a shared fully connected layer maps 5… view at source ↗
Figure 8
Figure 8. Figure 8: MobileViTV2 control architecture used as the pretrained hybrid convolutional-transformer classifier. A 128 × 128 preprocessed MRI slice enters an ImageNet-initialised MobileViTV2-100 CVNets backbone: the stacked mod￾ules depict the convolutional stem, mobile convolution blocks, separable self￾attention over token grids, and the final feature-map stack. Adaptive average pooling reduces the shared feature ma… view at source ↗
Figure 10
Figure 10. Figure 10: Seed-averaged tumour confusion matrix for the random forest under the real-only baseline condition. Rows are true tumour classes and columns are predicted tumour classes; each cell is the row-normalized percentage of held￾out test images assigned to that predicted class, reported as ̄𝑥 ± half 95% CI across ten seeds. The diagonal gives class recall. The RF was highly reliable for no tumour (99.8% recall) … view at source ↗
Figure 11
Figure 11. Figure 11: Paired random-forest changes relative to the real-only baseline. Each row is a tumour-class or aggregate metric, and each horizontal violin shows the seed￾level change for one augmentation condition: unfiltered 1:1, unfiltered 1:2, filtered 1:1, or filtered 1:2. The vertical dashed zero line marks no change; values to the right favour augmentation and values to the left favour the real-only baseline. Nega… view at source ↗
Figure 12
Figure 12. Figure 12: Seed-averaged confusion matrices for the compact two-headed CNN under the real-only baseline condition. Panel (a) is the tumour-class head: rows are true tumour classes and columns are predicted tumour classes. Panel (b) is the anatomical-plane head: rows are true planes and columns are predicted planes. Cell values are row-normalized percentages, reported as ̄𝑥 ± half 95% CI across ten seeds, so diagonal… view at source ↗
Figure 13
Figure 13. Figure 13: Validation behaviour of the compact two-headed CNN across the five training conditions. Panel (a) shows the multitask validation loss used for check￾point selection; panel (b) shows tumour-head validation accuracy over the same epochs. Solid curves are seed medians and shaded bands are interquartile ranges (25th–75th percentile) across ten seeds, covering full training histories rather than just selected … view at source ↗
Figure 15
Figure 15. Figure 15: MobileViTV2 validation behaviour under the compute-matched training regime. Panel (a) shows validation loss as a function of optimizer step; panel (b) shows tumour-head validation accuracy over the same step axis. Solid curves are cubic-polynomial smoothed seed means, and shaded bands are estimated 95% CI across ten seeds. The late-training bands are narrow because validation loss and tumour accuracy were… view at source ↗
Figure 14
Figure 14. Figure 14: Seed-averaged confusion matrices for MobileViTV2 under the real-only baseline condition. Panel (a) reports the tumour-class head, with true tumour classes in rows and predicted tumour classes in columns. Panel (b) reports the anatomical-plane head, with true planes in rows and predicted planes in columns. Cell values are row-normalized percentages, reported as ̄𝑥 ± half 95% CI across ten seeds; diagonal c… view at source ↗
Figure 16
Figure 16. Figure 16: Paired downstream-performance changes for the compact CNN and MobileViTV2 relative to their real-only baselines. The four panels separate classifier and task: panel (a-1) shows CNN tumour metrics, panel (a-2) shows MobileViTV2 tumour metrics, panel (b-1) shows CNN anatomical-plane metrics, and panel (b-2) shows MobileViTV2 anatomical-plane metrics. Within each panel, rows are individual metrics and the ho… view at source ↗
read the original abstract

Generative augmentation is often proposed as a remedy for small medical-image datasets, but synthetic images are only useful when they improve downstream task performance. "Augmentation" here means synthetic supplementation: GAN-generated samples added to the real training pool, not geometric or photometric transforms of existing images. Twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions to test whether their output, with or without InceptionV3 feature-space filtering, improves held-out tumour classification across three classifier families: a random forest (RF) on InceptionV3 features, a compact two-headed convolutional neural network (CNN), and MobileViTV2, a mobile hybrid convolutional-transformer. Each was evaluated at 1:1 and 1:2 real-to-synthetic ratios. An independent GPT-5.5 blind test placed gated real-versus-synthetic discrimination at 57.73% (95% CI: 54.48--60.92%) on the model-legible subset -- modestly above chance. The RF classifier did not benefit from the synthetic MRIs. The CNN showed consistent mean gains that did not survive Holm correction. MobileViTV2 showed the clearest benefit: filtered 1:1 augmentation improved tumour classification accuracy by 1.02% absolute (95% CI: 0.54--1.54%; Holm-corrected p = 0.0104). A secondary efficiency analysis found that every augmented CNN condition selected its checkpoint 42--64% earlier than baseline, while compute-matched MobileViTV2 runs reached selection after 50--67% fewer real-data epochs. Overall, augmentation utility was found to be architecture- and ratio-dependent, not guaranteed by visual fidelity alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript evaluates whether synthetic brain MRIs from twelve class-plane StyleGAN2-ADA generators, with or without InceptionV3 feature-space filtering, improve held-out tumour classification when added to real BRISC 2025 training data at 1:1 and 1:2 ratios. Three classifier families are tested: a random forest on InceptionV3 features, a compact two-headed CNN, and MobileViTV2. Results indicate architecture-dependent effects, with no benefit for RF, non-significant gains for CNN after Holm correction, and a 1.02% absolute accuracy improvement for MobileViTV2 under filtered 1:1 augmentation (95% CI 0.54–1.54%, Holm-corrected p=0.0104). A secondary GPT-5.5 blind test reports 57.73% real-vs-synthetic discrimination, supporting that visual fidelity alone does not predict utility. Efficiency gains in checkpoint selection are also noted for augmented conditions.

Significance. If the reported gains and statistical controls hold under full methodological scrutiny, the work supplies concrete empirical evidence that GAN augmentation benefits in medical imaging are classifier-specific rather than universal, cautioning against reliance on generative fidelity metrics alone. The use of Holm correction, explicit CIs, and an independent discrimination test strengthens the falsifiability of the architecture-dependence claim.

major comments (2)
  1. [Abstract and Methods] Abstract/Methods: The central claim of a statistically significant 1.02% gain for MobileViTV2 rests on held-out accuracy differences, yet the manuscript provides no explicit description of the train/validation/test partitioning of BRISC 2025, the number of independent random seeds or runs used to derive the 95% CIs, or the precise training hyperparameters; these omissions directly affect whether the reported p=0.0104 can be interpreted as evidence against chance or unaccounted variance.
  2. [Results] Results: The differential outcome across RF (no benefit), CNN (gains fail correction), and MobileViTV2 is presented as supporting architecture dependence, but without tabulated per-run accuracies, variance estimates, or confirmation that all three families used identical data splits and augmentation pipelines, it remains unclear whether the pattern is robust or could arise from implementation differences.
minor comments (1)
  1. [Abstract] The abstract states that 'twelve class-plane StyleGAN2-ADA generators were trained on constrained BRISC 2025 partitions' but does not specify whether this means one generator per tumour class or another partitioning; a brief clarification in the methods would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on methodological clarity and reproducibility. We address each point below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract/Methods: The central claim of a statistically significant 1.02% gain for MobileViTV2 rests on held-out accuracy differences, yet the manuscript provides no explicit description of the train/validation/test partitioning of BRISC 2025, the number of independent random seeds or runs used to derive the 95% CIs, or the precise training hyperparameters; these omissions directly affect whether the reported p=0.0104 can be interpreted as evidence against chance or unaccounted variance.

    Authors: We agree that these details are necessary for full interpretation of the results. The revised manuscript will expand the Methods section to explicitly describe the train/validation/test partitioning of BRISC 2025, the number of independent random seeds or runs used to compute the 95% CIs and Holm-corrected p-values, and the complete training hyperparameters for each classifier family. revision: yes

  2. Referee: [Results] Results: The differential outcome across RF (no benefit), CNN (gains fail correction), and MobileViTV2 is presented as supporting architecture dependence, but without tabulated per-run accuracies, variance estimates, or confirmation that all three families used identical data splits and augmentation pipelines, it remains unclear whether the pattern is robust or could arise from implementation differences.

    Authors: We confirm that the RF, CNN, and MobileViTV2 classifiers were evaluated using identical data splits and the same augmentation pipelines. The revised manuscript will add a supplementary table with per-run accuracies and variance estimates across conditions to demonstrate that the architecture-dependent pattern is consistent and not attributable to implementation differences. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper is a purely empirical study: it trains StyleGAN2-ADA generators on real partitions, produces synthetic samples, augments training sets at explicit 1:1 and 1:2 ratios, trains three classifier families, and reports held-out accuracy with 95% CIs and Holm-corrected p-values. No equations, fitted parameters, or derivations appear; the central claims (architecture-dependent gains, fidelity not guaranteeing utility) are direct experimental outcomes on an independent test set. No self-citation chains or uniqueness theorems are invoked to justify the results. The design measures external performance metrics and is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical machine-learning evaluation study. No mathematical derivations or fitted parameters underpin the central claim; the work relies on standard training and statistical practices already established in the field.

axioms (1)
  • standard math Standard assumptions underlying confidence intervals and Holm multiple-testing correction hold for the reported p-values and CIs.
    Invoked when stating Holm-corrected p = 0.0104 and the 95% CIs.

pith-pipeline@v0.9.0 · 5866 in / 1406 out tokens · 33289 ms · 2026-05-25T04:50:57.259097+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages · 1 internal anchor

  1. [1]

    Generative adversarial nets,

    I. Goodfellow et al. , “Generative adversarial nets,” in Advances in Neural Information Processing Systems , 2014

  2. [2]

    Unsupervised representation learning with deep convolutional genera- tive adversarial networks,

    A. Radford, L. Metz, and S. Chintala, “Unsupervised representation learning with deep convolutional genera- tive adversarial networks,” in International Conference on Learning Representations , 2016

  3. [3]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium,

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “GANs trained by a two time-scale update rule converge to a local Nash equilibrium,” in Advances in Neural Information Processing Systems , 2017

  4. [4]

    Demystifying MMD GANs,

    M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying MMD GANs,” in International Conference on Learning Representations , 2018

  5. [5]

    Analyzing and improving the image quality of StyleGAN,

    T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “ Analyzing and improving the image quality of StyleGAN,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition , 2020

  6. [6]

    Training generative adversarial networks with limited data,

    T. Karras et al. , “Training generative adversarial networks with limited data,” in Advances in Neural Information Pro- cessing Systems, 2020

  7. [7]

    Alias-free generative adversarial net- works,

    T. Karras et al. , “ Alias-free generative adversarial net- works,” in Advances in Neural Information Processing Sys- tems, 2021

  8. [8]

    Generative adversarial network in medical imaging: A review ,

    X. Yi, E. Walia, and P . Babyn, “Generative adversarial network in medical imaging: A review ,” Medical Image Analysis, vol. 58, 101552, 2019. Noriega Cedeño, p. 18

  9. [9]

    GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,

    M. Frid-Adar et al. , “GAN-based synthetic medical image augmentation for increased CNN performance in liver lesion classification,” Neurocomputing, vol. 321, pp. 321– 331, 2018

  10. [10]

    Medical image synthesis for data aug- mentation and anonymization using generative adversar- ial networks,

    H.-C. Shin et al. , “Medical image synthesis for data aug- mentation and anonymization using generative adversar- ial networks,” in International Workshop on Simulation and Synthesis in Medical Imaging , 2018

  11. [11]

    GAN-based synthetic brain MR image generation,

    C. Han et al. , “GAN-based synthetic brain MR image generation,” in IEEE International Symposium on Biomedical Imaging, 2018

  12. [12]

    Brain tumor image generation using an aggrega- tion of GAN models with style transfer,

    D. Mukherkjee, P . Saha, D. Kaplun, A. Sinitca, and R. Sarkar, “Brain tumor image generation using an aggrega- tion of GAN models with style transfer,” Scientific Reports, vol. 12, article 9141, 2022, doi:10.1038/s41598-022-12646-y

  13. [13]

    Evaluating the performance of StyleGAN2-ADA on medical images,

    M. Woodland et al. , “Evaluating the performance of StyleGAN2-ADA on medical images,” arXiv:2210.03786, 2022

  14. [14]

    Brain tumor segmentation using synthetic MR images: A comparison of GANs and diffusion models,

    M. U. Akbar, M. Larsson, I. Blystad, and A. Eklund, “Brain tumor segmentation using synthetic MR images: A comparison of GANs and diffusion models,” Scientific Data, vol. 11, 259, 2024

  15. [15]

    Brain imaging generation with latent diffusion models,

    W. H. L. Pinaya et al. , “Brain imaging generation with latent diffusion models,” arXiv:2209.07162, 2022

  16. [16]

    Denoising diffusion probabilistic models for 3D medical image generation,

    F. Khader et al., “Denoising diffusion probabilistic models for 3D medical image generation,” Scientific Reports , vol. 13, 7303, 2023

  17. [17]

    A multimodal comparison of latent denoising diffusion probabilistic models and gener- ative adversarial networks for medical image synthesis,

    G. Müller-Franzes et al. , “ A multimodal comparison of latent denoising diffusion probabilistic models and gener- ative adversarial networks for medical image synthesis,” Scientific Reports , vol. 13, 12098, 2023

  18. [18]

    BRISC 2025: Brain T umor MRI Dataset for Segmentation and Classification,

    A. Fateh et al. , “BRISC 2025: Brain T umor MRI Dataset for Segmentation and Classification,” Kaggle, 2025. [On- line]. Available: https://www.kaggle.com/datasets/ briscdataset/brisc2025. Accessed: February 21, 2026

  19. [19]

    BRISC: Annotated dataset for brain tumor segmentation and classification,

    A. Fateh, Y . Rezvani, S. Moayedi et al., “BRISC: Annotated dataset for brain tumor segmentation and classification,” Scientific Data , vol. 13, 361, 2026, doi: 10.1038/s41597-026- 06753-y

  20. [20]

    CNNs vs. hybrid transformers for brain tumor classification on the BRISC dataset,

    M. Thahiruddin and A. Wulandari, “CNNs vs. hybrid transformers for brain tumor classification on the BRISC dataset,” Jurnal Aplikasi T eknologi Informasi dan Manajemen , vol. 6, no. 1, pp. 24–33, 2025, doi: 10.31102/jatim.v6i1.3545

  21. [21]

    Generative adversarial synthe- sis and deep feature discrimination of brain tumor MRI images,

    M. S. Ali and M. Behzad, “Generative adversarial synthe- sis and deep feature discrimination of brain tumor MRI images,” arXiv:2511.01574, 2025

  22. [22]

    Convolutional neural networks for medical image analysis: Full training or fine tuning?

    N. Tajbakhsh et al. , “Convolutional neural networks for medical image analysis: Full training or fine tuning?” IEEE T ransactions on Medical Imaging , vol. 35, no. 5, pp. 1299– 1312, 2016

  23. [23]

    Random forests,

    L. Breiman, “Random forests,” Machine Learning , vol. 45, pp. 5–32, 2001

  24. [24]

    Rethinking the Inception architecture for computer vision,

    C. Szegedy , V . Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo- jna, “Rethinking the Inception architecture for computer vision,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2016

  25. [25]

    Separable self-attention for mobile vision transformers,

    S. Mehta and M. Rastegari, “Separable self-attention for mobile vision transformers,” arXiv:2206.02680, 2022

  26. [26]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,

    A. Kendall, Y . Gal, and R. Cipolla, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018

  27. [27]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy , and J. Melville, “UMAP: Uniform manifold approximation and projection for dimension reduction,” arXiv:1802.03426, 2018

  28. [28]

    A well-conditioned estimator for large-dimensional covariance matrices,

    O. Ledoit and M. Wolf, “ A well-conditioned estimator for large-dimensional covariance matrices,” Journal of Multi- variate Analysis, vol. 88, no. 2, pp. 365–411, 2004

  29. [29]

    SGDR: Stochastic gradient descent with warm restarts,

    I. Loshchilov and F. Hutter, “SGDR: Stochastic gradient descent with warm restarts,” in International Conference on Learning Representations , 2017

  30. [30]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in International Conference on Learning Rep- resentations, 2019

  31. [31]

    mixup: Beyond empirical risk minimization,

    H. Zhang, M. Cisse, Y . N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Interna- tional Conference on Learning Representations , 2018

  32. [32]

    Implementation and benchmarking of per- ceptual image hash functions,

    C. Zauner, “Implementation and benchmarking of per- ceptual image hash functions,” B.S. thesis, Alpen-Adria Universität Klagenfurt, Austria, 2010