pith. sign in

arxiv: 2604.06099 · v1 · submitted 2026-04-07 · 💻 cs.CV

Extending ZACH-ViT to Robust Medical Imaging: Corruption and Adversarial Stress Testing in Low-Data Regimes

Pith reviewed 2026-05-10 19:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords ZACH-ViTVision Transformermedical imagingrobustnessimage corruptionsadversarial attackslow-data regimesMedMNIST
0
0 comments X

The pith

ZACH-ViT keeps the best mean rank for clean medical images and under common corruptions while ranking first or second against adversarial attacks in low-data tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the compact ZACH-ViT model, designed without fixed positional embeddings or class tokens, continues to perform well when medical images are degraded by noise or deliberate attacks. It runs comparisons on seven MedMNIST datasets using only 50 samples per class and fixed training setups across five seeds. ZACH-ViT secures the top average rank for both clean images and realistic corruptions, and it stays competitive when models face gradient-based attacks. A reader would care because medical imaging systems often encounter noisy scans or security threats, and models that work with scarce labeled data are practical for many clinical sites. The results indicate that the architecture's flexibility with variable spatial patterns helps maintain accuracy under everyday image problems.

Core claim

ZACH-ViT, the zero-token adaptive compact hierarchical vision transformer, reaches the best overall mean rank of 1.57 on clean data and the same rank under common corruptions across seven MedMNIST datasets in a 50-sample-per-class regime. It also ranks first under FGSM attacks and second under PGD attacks, although every model drops sharply when facing adversarial perturbations. These outcomes extend the original design rationale by showing that permutation-invariant structure without rigid spatial assumptions supports both baseline accuracy and resistance to realistic degradations, while leaving adversarial defense as an unresolved issue for compact transformers in this setting.

What carries the argument

ZACH-ViT, the compact permutation-invariant Vision Transformer that avoids class tokens and fixed positional embeddings to adapt to variable or weakly structured spatial patterns in biomedical images.

If this is right

  • The benefits of ZACH-ViT's permutation-invariant design extend from clean accuracy to robustness against common image corruptions in low-data medical settings.
  • All tested compact models, including ZACH-ViT, remain vulnerable to adversarial perturbations, so further defenses are still required.
  • Mean rank across seven datasets offers a practical way to judge the trade-off between baseline performance and corruption resistance.
  • Fixed low-data protocols with multiple seeds produce stable ranking patterns that highlight ZACH-ViT's balanced profile.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same architecture could be checked on real hospital scans that contain natural acquisition artifacts rather than synthetic corruptions.
  • Integrating ZACH-ViT with targeted adversarial training might close the remaining gap against FGSM and PGD without losing its corruption advantages.
  • The findings point toward using flexible spatial designs in other low-data domains where image organization varies, such as pathology slides or remote sensing.

Load-bearing premise

The specific setup of 50 samples per class, fixed hyperparameters, five random seeds, and the chosen set of corruptions plus FGSM and PGD attacks on MedMNIST datasets is enough to represent real-world low-data medical imaging robustness.

What would settle it

Running the same models on a fresh medical imaging dataset that uses different corruption types or natural clinical noise and finding that ZACH-ViT loses its top mean rank on corruptions would show the current results do not generalize.

Figures

Figures reproduced from arXiv: 2604.06099 by Athanasios Angelakis, Marta Gomez-Barrero.

Figure 1
Figure 1. Figure 1: Clean performance and corruption-averaged robustness across the seven MedMNIST datasets. The clean performance pattern [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Adversarial stress testing under FGSM and PGD. All models degrade substantially relative to the clean setting. ZACH-ViT [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Representative corruption-specific severity plots from the benchmark outputs. The selected panels illustrate that the favorable [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dataset-agnostic summary of robustness behavior. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

The recently introduced ZACH-ViT (Zero-token Adaptive Compact Hierarchical Vision Transformer) formalized a compact permutation-invariant Vision Transformer for medical imaging and argued that architectural alignment with spatial structure can matter more than universal benchmark dominance. Its design was motivated by the observation that positional embeddings and a dedicated class token encode fixed spatial assumptions that may be suboptimal when spatial organization is weakly informative, locally distributed, or variable across biomedical images. The foundational study established a regime-dependent clean performance profile across MedMNIST, but did not examine robustness in detail. In this work, we present the first robustness-focused extension of ZACH-ViT by evaluating its behavior under common image corruptions and adversarial perturbations in the same low-data setting. We compare ZACH-ViT with three scratch-trained compact baselines, ABMIL, Minimal-ViT, and TransMIL, on seven MedMNIST datasets using 50 samples per class, fixed hyperparameters, and five random seeds. Across the benchmark, ZACH-ViT achieves the best overall mean rank on clean data (1.57) and under common corruptions (1.57), indicating a favorable balance between baseline predictive performance and robustness to realistic image degradation. Under adversarial stress, all models deteriorate substantially; nevertheless, ZACH-ViT remains competitive, ranking first under FGSM (2.00) and second under PGD (2.29), where ABMIL performs best overall. These results extend the original ZACH-ViT narrative: the advantages of compact permutation-invariant transformers are not limited to clean evaluation, but can persist under realistic perturbation stress in low-data medical imaging, while adversarial robustness remains an open challenge for all evaluated models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper extends ZACH-ViT to robustness evaluation by testing it against common image corruptions and adversarial attacks (FGSM, PGD) on seven MedMNIST datasets in a low-data regime (50 samples per class, fixed hyperparameters, five random seeds). It reports that ZACH-ViT obtains the best mean rank of 1.57 on both clean data and under corruptions, and competitive adversarial ranks (first under FGSM at 2.00, second under PGD at 2.29) relative to ABMIL, Minimal-ViT, and TransMIL, concluding that its permutation-invariant design advantages extend beyond clean performance.

Significance. If the rankings are statistically supported, the work would credibly show that compact permutation-invariant ViT designs can maintain favorable robustness to realistic degradations in low-data medical imaging, extending the original ZACH-ViT findings. The reproducible setup with fixed hyperparameters and multiple seeds is a positive contribution for baseline comparisons in this domain.

major comments (1)
  1. [Abstract] Abstract: the reported mean ranks (1.57 clean/corruptions, 2.00 FGSM, 2.29 PGD) are given without per-seed standard deviations, confidence intervals, or any statistical tests (e.g., Wilcoxon or paired t-tests on seed-level accuracies). In low-data regimes where run-to-run variability is typically large, this prevents determining whether the claimed superiority and 'favorable balance' exceed noise, directly undermining the central empirical claims.
minor comments (2)
  1. [Abstract] Abstract and methods: exact corruption severity levels, specific parameter values for the common corruptions, and attack strengths (e.g., epsilon for FGSM/PGD) are not stated, hindering reproducibility and interpretation of the stress-test results.
  2. The manuscript would benefit from a table or appendix listing per-dataset accuracies (with means and stds) rather than relying solely on aggregate mean ranks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the reported mean ranks (1.57 clean/corruptions, 2.00 FGSM, 2.29 PGD) are given without per-seed standard deviations, confidence intervals, or any statistical tests (e.g., Wilcoxon or paired t-tests on seed-level accuracies). In low-data regimes where run-to-run variability is typically large, this prevents determining whether the claimed superiority and 'favorable balance' exceed noise, directly undermining the central empirical claims.

    Authors: We agree that the reported mean ranks require accompanying measures of variability and statistical tests to substantiate the claims, especially given the low-data regime and the use of only five seeds. In the revised version we will add per-seed standard deviations and 95% confidence intervals to all mean-rank figures in the abstract, results section, and tables. We will also report the results of paired Wilcoxon signed-rank tests performed on the seed-level accuracies for each dataset and perturbation type, allowing readers to evaluate whether observed rank differences exceed run-to-run noise. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark rankings with no derivations or self-referential reductions

full rationale

The paper reports direct experimental results: mean ranks of ZACH-ViT versus baselines on seven MedMNIST datasets under clean, corrupted, and adversarial settings, using fixed hyperparameters, 50 samples per class, and five seeds. No equations, predictions, ansatzes, or uniqueness theorems are derived; the central claims are computed rankings from external benchmarks. The original ZACH-ViT is referenced only as motivation, not as a load-bearing self-citation that substitutes for new evidence. This is a standard empirical extension study whose claims stand or fall on the reported runs rather than any internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

This is a pure empirical benchmarking study. The central claims rest on the assumption that the chosen experimental protocol (50 samples per class, MedMNIST, specific corruptions and attacks) is representative, with no new theoretical entities or derivations introduced.

free parameters (1)
  • training samples per class
    Fixed at 50 to define the low-data regime; this choice affects all compared models but is an experimental design parameter.
axioms (1)
  • domain assumption MedMNIST datasets with 50 samples per class and the selected corruptions plus FGSM/PGD attacks adequately proxy real medical imaging conditions
    Invoked implicitly by using these as the evaluation protocol without further justification in the abstract.

pith-pipeline@v0.9.0 · 5616 in / 1523 out tokens · 61113 ms · 2026-05-10T19:34:32.289100+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages

  1. [1]

    Zach-vit: Regime-dependent inductive bias in compact vision trans- formers for medical imaging, 2026

    Athanasios Angelakis. ZACH-ViT: Regime-dependent in- ductive bias in compact vision transformers for medical imaging. arXiv:2602.17929, 2026

  2. [2]

    A scoping review of robustness concepts for ma- chine learning in healthcare.npj Digital Medicine, 8(1):38, 2025

    Alan Balendran, C ´eline Beji, Florie Bouvier, Ottavio Khal- ifa, Theodoros Evgeniou, Philippe Ravaud, and Rapha ¨el Porcher. A scoping review of robustness concepts for ma- chine learning in healthcare.npj Digital Medicine, 8(1):38, 2025

  3. [3]

    Medmnist-c: Comprehensive benchmark and improved classifier robustness by simulating realistic image corruptions,

    Francesco Di Salvo, Sebastian Doerrich, and Christian Ledig. MedMNIST-C: Comprehensive benchmark and im- proved classifier robustness by simulating realistic image corruptions. arXiv:2406.17536, 2024

  4. [4]

    Apostolidis and George A

    Kyriakos D. Apostolidis and George A. Papakostas. A sur- vey on adversarial deep learning robustness in medical image analysis.Electronics, 10(17):2132, 2021

  5. [5]

    TransMIL: Transformer based correlated multiple instance learning for whole slide image classification

    Zhuchen Shao, Hao Bian, Yang Chen, Yifeng Wang, Jian Zhang, Xiangyang Ji, and Yongbing Zhang. TransMIL: Transformer based correlated multiple instance learning for whole slide image classification. InAdvances in Neural In- formation Processing Systems, 2021

  6. [6]

    Tomczak, and Max Welling

    Maximilian Ilse, Jakub M. Tomczak, and Max Welling. Attention-based deep multiple instance learning. InInter- national Conference on Machine Learning, 2018

  7. [7]

    Benchmarking neu- ral network robustness to common corruptions and perturba- tions

    Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions. InInternational Conference on Learning Representa- tions, 2019

  8. [8]

    Goodfellow, Jonathon Shlens, and Christian Szegedy

    Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. InInterna- tional Conference on Learning Representations, 2015

  9. [9]

    Towards deep learn- ing models resistant to adversarial attacks

    Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learn- ing models resistant to adversarial attacks. InInternational Conference on Learning Representations, 2018

  10. [10]

    MedM- NIST v2: A large-scale lightweight benchmark for 2D and 3D biomedical image classification.Scientific Data, 10:48, 2023

    Jiancheng Yang, Rui Shi Huang, Jiajing Li, et al. MedM- NIST v2: A large-scale lightweight benchmark for 2D and 3D biomedical image classification.Scientific Data, 10:48, 2023

  11. [11]

    Edge AI for Internet of Medical Things: A literature review

    Atslands Rocha, Matheus Monteiro, C ´esar Mattos, Mad- son Dias, Jorge Soares, Regis Magalh˜aes, and Jos´e Macedo. Edge AI for Internet of Medical Things: A literature review. Computers & Electrical Engineering, 116:109202, 2024

  12. [12]

    Health Care Equity Through Intelligent Edge Computing and Augmented Reality/Virtual Reality: A systematic review.Journal of Multidisciplinary Healthcare, 16:2839–2859, 2023

    Vishal Lakshminarayanan, Aswathy Ravikumar, Harini Sri- raman, Sujatha Alla, and Vijay Kumar Chattu. Health Care Equity Through Intelligent Edge Computing and Augmented Reality/Virtual Reality: A systematic review.Journal of Multidisciplinary Healthcare, 16:2839–2859, 2023