Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes
Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3
The pith
ZACH-ViT keeps the best mean rank for clean medical images and under common corruptions while ranking first or second against adversarial attacks in low-data tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ZACH-ViT, the zero-token adaptive compact hierarchical vision transformer, reaches the best overall mean rank of 1.57 on clean data and the same rank under common corruptions across seven MedMNIST datasets in a 50-sample-per-class regime. It also ranks first under FGSM attacks and second under PGD attacks, although every model drops sharply when facing adversarial perturbations. These outcomes extend the original design rationale by showing that permutation-invariant structure without rigid spatial assumptions supports both baseline accuracy and resistance to realistic degradations, while leaving adversarial defense as an unresolved issue for compact transformers in this setting.
What carries the argument
ZACH-ViT, the compact permutation-invariant Vision Transformer that avoids class tokens and fixed positional embeddings to adapt to variable or weakly structured spatial patterns in biomedical images.
If this is right
- The benefits of ZACH-ViT's permutation-invariant design extend from clean accuracy to robustness against common image corruptions in low-data medical settings.
- All tested compact models, including ZACH-ViT, remain vulnerable to adversarial perturbations, so further defenses are still required.
- Mean rank across seven datasets offers a practical way to judge the trade-off between baseline performance and corruption resistance.
- Fixed low-data protocols with multiple seeds produce stable ranking patterns that highlight ZACH-ViT's balanced profile.
Where Pith is reading between the lines
- The same architecture could be checked on real hospital scans that contain natural acquisition artifacts rather than synthetic corruptions.
- Integrating ZACH-ViT with targeted adversarial training might close the remaining gap against FGSM and PGD without losing its corruption advantages.
- The findings point toward using flexible spatial designs in other low-data domains where image organization varies, such as pathology slides or remote sensing.
Load-bearing premise
The specific setup of 50 samples per class, fixed hyperparameters, five random seeds, and the chosen set of corruptions plus FGSM and PGD attacks on MedMNIST datasets is enough to represent real-world low-data medical imaging robustness.
What would settle it
Running the same models on a fresh medical imaging dataset that uses different corruption types or natural clinical noise and finding that ZACH-ViT loses its top mean rank on corruptions would show the current results do not generalize.
Figures
read the original abstract
The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper extends ZACH-ViT to robustness evaluation by testing it against common image corruptions and adversarial attacks (FGSM, PGD) on seven MedMNIST datasets in a low-data regime (50 samples per class, fixed hyperparameters, five random seeds). It reports that ZACH-ViT obtains the best mean rank of 1.57 on both clean data and under corruptions, and competitive adversarial ranks (first under FGSM at 2.00, second under PGD at 2.29) relative to ABMIL, Minimal-ViT, and TransMIL, concluding that its permutation-invariant design advantages extend beyond clean performance.
Significance. If the rankings are statistically supported, the work would credibly show that compact permutation-invariant ViT designs can maintain favorable robustness to realistic degradations in low-data medical imaging, extending the original ZACH-ViT findings. The reproducible setup with fixed hyperparameters and multiple seeds is a positive contribution for baseline comparisons in this domain.
major comments (1)
- [Abstract] Abstract: the reported mean ranks (1.57 clean/corruptions, 2.00 FGSM, 2.29 PGD) are given without per-seed standard deviations, confidence intervals, or any statistical tests (e.g., Wilcoxon or paired t-tests on seed-level accuracies). In low-data regimes where run-to-run variability is typically large, this prevents determining whether the claimed superiority and 'favorable balance' exceed noise, directly undermining the central empirical claims.
minor comments (2)
- [Abstract] Abstract and methods: exact corruption severity levels, specific parameter values for the common corruptions, and attack strengths (e.g., epsilon for FGSM/PGD) are not stated, hindering reproducibility and interpretation of the stress-test results.
- The manuscript would benefit from a table or appendix listing per-dataset accuracies (with means and stds) rather than relying solely on aggregate mean ranks.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported mean ranks (1.57 clean/corruptions, 2.00 FGSM, 2.29 PGD) are given without per-seed standard deviations, confidence intervals, or any statistical tests (e.g., Wilcoxon or paired t-tests on seed-level accuracies). In low-data regimes where run-to-run variability is typically large, this prevents determining whether the claimed superiority and 'favorable balance' exceed noise, directly undermining the central empirical claims.
Authors: We agree that the reported mean ranks require accompanying measures of variability and statistical tests to substantiate the claims, especially given the low-data regime and the use of only five seeds. In the revised version we will add per-seed standard deviations and 95% confidence intervals to all mean-rank figures in the abstract, results section, and tables. We will also report the results of paired Wilcoxon signed-rank tests performed on the seed-level accuracies for each dataset and perturbation type, allowing readers to evaluate whether observed rank differences exceed run-to-run noise. revision: yes
Circularity Check
No circularity: purely empirical benchmark rankings with no derivations or self-referential reductions
full rationale
The paper reports direct experimental results: mean ranks of ZACH-ViT versus baselines on seven MedMNIST datasets under clean, corrupted, and adversarial settings, using fixed hyperparameters, 50 samples per class, and five seeds. No equations, predictions, ansatzes, or uniqueness theorems are derived; the central claims are computed rankings from external benchmarks. The original ZACH-ViT is referenced only as motivation, not as a load-bearing self-citation that substitutes for new evidence. This is a standard empirical extension study whose claims stand or fall on the reported runs rather than any internal reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- training samples per class
axioms (1)
- domain assumption MedMNIST datasets with 50 samples per class and the selected corruptions plus FGSM/PGD attacks adequately proxy real medical imaging conditions
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ZACH-ViT ... removes both positional embeddings and the dedicated [CLS] token, replacing token-based aggregation with global average pooling over patch representations.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
mean rank on clean data (1.57) and under common corruptions (1.57)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Zach-vit: Regime-dependent inductive bias in compact vision trans- formers for medical imaging, 2026
Athanasios Angelakis. ZACH-ViT: Regime-dependent in- ductive bias in compact vision transformers for medical imaging. arXiv:2602.17929, 2026
-
[2]
Alan Balendran, C ´eline Beji, Florie Bouvier, Ottavio Khal- ifa, Theodoros Evgeniou, Philippe Ravaud, and Rapha ¨el Porcher. A scoping review of robustness concepts for ma- chine learning in healthcare.npj Digital Medicine, 8(1):38, 2025
work page 2025
-
[3]
Francesco Di Salvo, Sebastian Doerrich, and Christian Ledig. MedMNIST-C: Comprehensive benchmark and im- proved classifier robustness by simulating realistic image corruptions. arXiv:2406.17536, 2024
-
[4]
Kyriakos D. Apostolidis and George A. Papakostas. A sur- vey on adversarial deep learning robustness in medical image analysis.Electronics, 10(17):2132, 2021
work page 2021
-
[5]
Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification. InAdvances in Neural In- formation Processing Systems, 2021
work page 2021
-
[6]
Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. InInter- national Conference on Machine Learning, 2018
work page 2018
-
[7]
Benchmarking neu- ral network robustness to common corruptions and perturba- tions
Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. InInternational Conference on Learning Representa- tions, 2019
work page 2019
-
[8]
Goodfellow, Jonathon Shlens, and Christian Szegedy
Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInterna- tional Conference on Learning Representations, 2015
work page 2015
-
[9]
Towards deep learn- ing models resistant to adversarial attacks
Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[10]
Jiancheng Yang, Rui Shi Huang, Jiajing Li, et al. MedM- NIST v2: A large-scale lightweight benchmark for 2D and 3D biomedical image classification.Scientific Data, 10:48, 2023
work page 2023
-
[11]
Edge AI for Internet of Medical Things: A literature review
Atslands Rocha, Matheus Monteiro, C ´esar Mattos, Mad- son Dias, Jorge Soares, Regis Magalh˜aes, and Jos´e Macedo. Edge AI for Internet of Medical Things: A literature review. Computers & Electrical Engineering, 116:109202, 2024
work page 2024
-
[12]
Vishal Lakshminarayanan, Aswathy Ravikumar, Harini Sri- raman, Sujatha Alla, and Vijay Kumar Chattu. Health Care Equity Through Intelligent Edge Computing and Augmented Reality/Virtual Reality: A systematic review.Journal of Multidisciplinary Healthcare, 16:2839–2859, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.