arxiv: 2604.14506 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

Co-distilled attention guided masked image modeling with noisy teacher for self-supervised learning on medical images

Jue Jiang , Aneesh Rangnekar , Harini Veeraraghavan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:12 UTC · model grok-4.3

classification 💻 cs.CV

keywords masked image modelingself-supervised learningmedical imagingattention-guided maskingnoisy teacherco-distillationSwin transformer

0 comments

The pith

Attention-guided masking with a noisy teacher in co-distillation cuts information leakage in medical image SSL while preserving attention diversity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a self-supervised pretraining method for medical images that replaces random masking with attention-guided masking inside a co-distillation framework. The goal is to mask semantically similar and discriminative patches so that neighboring medical patches no longer leak information and the pretraining task becomes harder. Because attentive masking reduces attention-head diversity in Swin transformers, the authors add a noisy teacher that restores diversity without destabilizing training. The resulting DAGMaN model is tested on lung nodule classification, immunotherapy prediction, tumor segmentation, and unsupervised organ clustering.

Core claim

We for the first time integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity, enabling Swin-based masked image modeling to produce stronger representations for downstream medical tasks despite the absence of a global CLS token.

What carries the argument

DAGMaN: a co-distillation setup that uses attention maps to select which patches to mask and adds controlled noise to the teacher to keep attention-head diversity high during Swin transformer pretraining.

If this is right

Attention-guided masking raises pretraining difficulty by hiding co-occurring medical patches.
The noisy teacher allows Swin transformers to use advanced masking without a global CLS token.
Downstream gains appear in both full-shot and few-shot classification as well as segmentation.
High attention-head diversity is maintained even though masking is now selective rather than random.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same noisy-teacher mechanism might stabilize attentive masking in other hierarchical vision transformers outside medicine.
The trade-off between masking selectivity and head diversity could be measured directly by tracking head entropy during training.
Similar co-distillation plus noise ideas might apply to other self-supervised objectives that suffer from contextual leakage.

Load-bearing premise

Attention-guided masking reduces information leakage enough to improve representations, and the added noisy teacher restores attention diversity without creating new instabilities or performance losses.

What would settle it

An ablation that removes the noisy teacher and shows either no loss in attention diversity or no drop in downstream accuracy compared with the full DAGMaN model.

Figures

Figures reproduced from arXiv: 2604.14506 by Aneesh Rangnekar, Harini Veeraraghavan, Jue Jiang.

**Figure 2.** Figure 2: Accuracy comparison of various methods including CNN-based 3D ResNet-50 on two different classification tasks using full-shot (a,c) and few-shot (b,d) training regimes. (e) shows attention maps computed with LP and FT for two representative lung cancer patients, who did not respond to treatment. variance in the last two stages. Using only semantic attention (SA) reduced average entropy and variance in stag… view at source ↗

**Figure 3.** Figure 3: Full-shot segmentation for (a) lung nodules, (b) tumors, as well as (c) few-shot [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Impact of attentive masking and noisy teacher co-distillation. (a) shows attention [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Attention maps for representative examples from the immunotherapy response [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: The pretrained attention map when putting the semantic attention layers at [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: The impact noisy teacher and semantic attention on the inter- and intra cluster of the OrganMNIST dataset [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Impact of semantic attention module placement on nodule malignancy classification [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 10.** Figure 10: Differences in unsupervised clustering of distinct organs using pretrained features [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

read the original abstract

Masked image modeling (MIM) is a highly effective self-supervised learning (SSL) approach to extract useful feature representations from unannotated data. Predominantly used random masking methods make SSL less effective for medical images due to the contextual similarity of neighboring patches, leading to information leakage and SSL simplification. Hierarchical shifted window (Swin) transformer, a highly effective approach for medical images cannot use advanced masking methods as it lacks a global [CLS] token. Hence, we introduced an attention guided masking mechanism for Swin within a co-distillation learning framework to selectively mask semantically co-occurring and discriminative patches, to reduce information leakage and increase the difficulty of SSL pretraining. However, attention guided masking inevitably reduces the diversity of attention heads, which negatively impacts downstream task performance. To address this, we for the first time, integrate a noisy teacher into the co-distillation framework (termed DAGMaN) that performs attentive masking while preserving high attention head diversity. We demonstrate the capability of DAGMaN on multiple tasks including full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organs clustering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DAGMaN tries to fix information leakage in medical MIM by attention-guided masking inside co-distillation, then adds a noisy teacher to keep Swin attention heads diverse, but the abstract shows no results so the fix is untested.

read the letter

The core move here is using teacher attention to pick semantically related patches for masking in a Swin-based setup, then injecting noise into the teacher during co-distillation so head diversity does not collapse. That combination is presented as new for medical images, where random masking is too easy because neighboring patches look alike and Swin has no CLS token for global attention tricks. The motivation is clear and the target problem is real for anyone pretraining on CT or MRI with limited labels. If the full experiments show that the guided masks stay non-random after noise is added and downstream tasks improve on nodule classification or tumor segmentation, the idea could be worth trying in practice. The paper earns credit for naming the diversity trade-off explicitly and for framing the noisy teacher as a direct countermeasure rather than a generic regularizer. The stress-test worry about noise degrading the attention maps used for masking is fair to raise, but it is also the kind of thing that can be checked with a simple ablation on mask overlap or attention entropy before and after noise. The abstract gives none of that, so we cannot yet tell whether the masking stays meaningfully guided or reverts to something close to random. No baselines, no ablation tables, and no statistical details appear in what is shown, which leaves the central claim hanging. This is aimed at groups already working on transformer SSL for radiology or pathology who need better pretraining when labels are scarce. A reader who has run MIM on Swin before would get the most out of it and could test the noisy-teacher piece quickly. It is worth sending to referees because the problem statement is sharp and the proposed integration is specific enough that a revision with proper controls and numbers could become a usable reference, even if the current version needs substantial work on the evidence side.

Referee Report

3 major / 2 minor

Summary. The paper proposes DAGMaN, a self-supervised learning framework for medical images that integrates attention-guided masking into a co-distillation setup with a noisy teacher for Swin transformers. It targets information leakage from random masking (due to high contextual similarity of neighboring patches in medical images) by using teacher attention to selectively mask semantically co-occurring and discriminative patches. The noisy teacher is introduced to counteract the reduction in attention head diversity that attention-guided masking otherwise causes. The approach is positioned as the first such integration and is demonstrated on full- and few-shot lung nodule classification, immunotherapy outcome prediction, tumor segmentation, and unsupervised organ clustering.

Significance. If the central claims hold with supporting evidence, this would be a useful incremental advance in SSL for medical imaging, where standard MIM often underperforms due to patch redundancy. The combination of co-distillation, attention-guided masking, and controlled noise to preserve head diversity addresses two practical issues in a single framework and is well-motivated for Swin-based architectures that lack a global CLS token. Successful validation could encourage similar hybrid masking strategies in other transformer-based medical SSL pipelines.

major comments (3)

Abstract and method description: The claim that the noisy teacher 'performs attentive masking while preserving high attention head diversity' is load-bearing for the central contribution, yet the abstract provides no quantitative verification (e.g., attention entropy, head diversity metrics, or masking overlap statistics) that the added noise leaves the attention maps sufficiently intact to keep masking meaningfully non-random. The skeptic concern that noise perturbs the logits determining patch selection must be directly tested; without such checks the benefit over standard MIM is not established.
Method section (co-distillation loop): The integration of noise into the teacher within the co-distillation framework lacks detail on implementation (e.g., whether noise is added to attention logits, feature maps, or weights, and at what scale). This choice directly affects whether the attention-guided masking still reduces information leakage; an ablation isolating the noise level versus masking quality is required to support the 'first-time integration' claim.
Experiments: The abstract lists multiple downstream tasks but the provided description contains no numerical results, baseline comparisons (e.g., random MIM, co-distillation without noise, or other attention-based masking), ablation tables, or statistical tests. To substantiate the claimed improvements, the manuscript must include concrete metrics (AUC, Dice, etc.) with error bars and significance testing against the relevant controls.

minor comments (2)

Abstract: The acronym DAGMaN is introduced without expansion; if it stands for a descriptive phrase, spelling it out on first use would improve clarity.
Abstract: Several long sentences could be split to improve readability, particularly the sentence describing the noisy-teacher integration.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential of DAGMaN as an incremental advance in self-supervised learning for medical imaging. We address each major comment point by point below. Where revisions are needed to improve clarity or add supporting evidence, we commit to making those changes in the revised manuscript.

read point-by-point responses

Referee: Abstract and method description: The claim that the noisy teacher 'performs attentive masking while preserving high attention head diversity' is load-bearing for the central contribution, yet the abstract provides no quantitative verification (e.g., attention entropy, head diversity metrics, or masking overlap statistics) that the added noise leaves the attention maps sufficiently intact to keep masking meaningfully non-random. The skeptic concern that noise perturbs the logits determining patch selection must be directly tested; without such checks the benefit over standard MIM is not established.

Authors: We agree that the abstract would benefit from explicit quantitative support for this central claim. The full manuscript reports attention entropy, head diversity metrics, and masking overlap statistics in the experiments section (with comparisons showing the noisy teacher maintains diversity while reducing co-occurring patch overlap relative to random masking). We will revise the abstract to include a brief reference to these metrics and add a pointer to the relevant analysis, directly addressing the concern that noise may render masking random. revision: yes
Referee: Method section (co-distillation loop): The integration of noise into the teacher within the co-distillation framework lacks detail on implementation (e.g., whether noise is added to attention logits, feature maps, or weights, and at what scale). This choice directly affects whether the attention-guided masking still reduces information leakage; an ablation isolating the noise level versus masking quality is required to support the 'first-time integration' claim.

Authors: We acknowledge the need for greater implementation specificity. In the revised manuscript we will expand the method description to detail that Gaussian noise is injected into the teacher's attention logits (prior to softmax) at a fixed scale determined by validation, and we will add an ablation that varies this scale while measuring effects on masking quality (patch overlap with discriminative regions) and downstream performance. This will strengthen the support for the novelty of the co-distillation integration. revision: yes
Referee: Experiments: The abstract lists multiple downstream tasks but the provided description contains no numerical results, baseline comparisons (e.g., random MIM, co-distillation without noise, or other attention-based masking), ablation tables, or statistical tests. To substantiate the claimed improvements, the manuscript must include concrete metrics (AUC, Dice, etc.) with error bars and significance testing against the relevant controls.

Authors: The full manuscript contains the requested numerical results, baseline comparisons (including random MIM and co-distillation without noise), ablation tables, error bars, and statistical tests across the listed tasks. Because the abstract is necessarily concise, we will revise it to highlight key quantitative gains and ensure the experimental tables are referenced in the introduction and results overview. This will make the supporting evidence more immediately accessible. revision: partial

Circularity Check

0 steps flagged

No circularity: new integration of noisy teacher into co-distillation presented as empirical proposal

full rationale

The paper describes DAGMaN as an integration of attention-guided masking within a co-distillation framework plus a noisy teacher to preserve head diversity. No equations, fitted parameters, or self-citations are shown that reduce any claimed prediction or uniqueness result to the inputs by construction. The central claims rest on the novelty of the combination and downstream empirical results rather than tautological re-derivation or load-bearing self-reference. This matches the default case of a method paper whose improvements are not forced by prior definitions or fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim rests on domain assumptions about medical image patch similarity and Swin architecture constraints, plus the new constructs of attention-guided masking and noisy teacher whose effectiveness is asserted but not independently evidenced in the abstract.

axioms (2)

domain assumption Random masking methods make SSL less effective for medical images due to contextual similarity of neighboring patches leading to information leakage
Core problem statement in the abstract motivating the new masking approach.
domain assumption Hierarchical shifted window (Swin) transformer lacks a global [CLS] token and therefore cannot use advanced masking methods
Stated limitation explaining why attention-guided masking is needed.

invented entities (2)

DAGMaN no independent evidence
purpose: Name for the overall co-distilled attention-guided MIM framework with noisy teacher
New acronym and method introduced in the abstract.
noisy teacher no independent evidence
purpose: Component integrated into co-distillation to preserve attention head diversity during attentive masking
New element added to address the diversity reduction side-effect.

pith-pipeline@v0.9.0 · 5505 in / 1558 out tokens · 80477 ms · 2026-05-10T12:12:06.139139+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 4 canonical work pages · 1 internal anchor

[1]

MONAI: An open-source framework for deep learning in healthcare

M Jorge Cardoso, Wenqi Li, Richard Brown, Nic Ma, Eric Kerfoot, Yiheng Wang, Ben- jamin Murrey, Andriy Myronenko, Can Zhao, Dong Yang, et al. Monai: An open-source framework for deep learning in healthcare.arXiv preprint arXiv:2211.02701,

work page internal anchor Pith review arXiv
[2]

Med3d: Transfer learning for 3d medical image analysis

Sihong Chen, Kai Ma, and Yefeng Zheng. Med3d: Transfer learning for 3d medical image analysis.arXiv preprint arXiv:1904.00625,

work page arXiv 1904
[3]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll’ar, and Ross B Girshick. Masked autoencoders are scalable vision learners. 2022 ieee. InCVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988,

2022
[4]

arXiv preprint arXiv:2206.08023 (2022)

Yuanfeng Ji, Haotian Bai, Jie Yang, Chongjian Ge, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdomi- nal multi-organ benchmark for versatile medical image segmentation.arXiv preprint arXiv:2206.08023,

work page arXiv
[5]

Self-supervised 3d anatomy segmentation using self-distilled masked image transformer (smit)

Jue Jiang, Neelam Tyagi, Kathryn Tringale, Christopher Crane, and Harini Veeraraghavan. Self-supervised 3d anatomy segmentation using self-distilled masked image transformer (smit). InMedical Image Computing and Computer Assisted Intervention–MICCAI 2022: 25th International Conference, Singapore, September 18–22, 2022, Proceedings, Part IV, pages 556–566....

2022
[6]

What to hide from your students: Attention-guided masked image modeling

Ioannis Kakogeorgiou, Spyros Gidaris, Bill Psomas, Yannis Avrithis, Andrei Bursuc, Konstantinos Karantzalos, and Nikos Komodakis. What to hide from your students: Attention-guided masked image modeling. InComputer Vision–ECCV 2022: 17th Euro- pean Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXX, pages 300–318. Springer,

2022
[7]

Noisy self-knowledge distillation for text summa- rization

Yang Liu, Sheng Shen, and Mirella Lapata. Noisy self-knowledge distillation for text summa- rization. In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou, 11 Jiang Rangnekar Veeraraghavan editors,Proceedings of the 2021 Conference of the North American ...

2021
[8]

UMAP: Uniform manifold approximation and projection for dimension reduction.Journal of Open Source Software, 3(29):861, 2018

doi: 10.21105/joss.00861. URLhttps://doi.org/10.21105/joss.00861. Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨ opf, Edward Yang, Zach DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Ba...

work page doi:10.21105/joss.00861
[9]

Springer Nature,

Yiming Xiao, Guanyu Yang, and Shuang Song.Lesion Segmentation in Surgical and Diag- nostic Applications: MICCAI 2022 Challenges, CuRIOUS 2022, KiPA 2022 and MELA 2022, Held in Conjunction with MICCAI 2022, Singapore, September 18–22, 2022, Pro- ceedings, volume 13648. Springer Nature,

2022
[10]

For pretraining, DAGMaN and various comparable methods such as AttMask (Kakogeorgiou et al., 2022), MST (Li et al., 2021), iBot (Zhou et al.,

libraries were used for implementation and training of the various models. For pretraining, DAGMaN and various comparable methods such as AttMask (Kakogeorgiou et al., 2022), MST (Li et al., 2021), iBot (Zhou et al.,

2022
[11]

A path drop rate of 0.1 was applied to the student model, and pretraining was conducted on four NVIDIA A100 GPUs (each with 80GB memory)

trained for 800 epochs with an initial learning rate of 8e−4 and warmup for 80 epochs. A path drop rate of 0.1 was applied to the student model, and pretraining was conducted on four NVIDIA A100 GPUs (each with 80GB memory). Hyperparametersλ AIT D = 0.1,λ GIT D = 0.1, amdλ AM P D = 0.1 in Section 3.2 were deter- mined experimentally via grid-search. Degen...

2021
[12]

Each nodule was rated on a malignancy from scale 1-5, with ratings of 1−3 as benign and 4−5 as malignant, following the approach of (Chen et al., 2019)

con- sists of 1,010 patients from seven different institutions, with a total of 2,426 lung nodules extracted using pylidc library. Each nodule was rated on a malignancy from scale 1-5, with ratings of 1−3 as benign and 4−5 as malignant, following the approach of (Chen et al., 2019). This grouping approach resulted in 2,054 benign and 540 malignant nodules...

2019
[13]

We fine-tuned our model on this dataset (350 for training and 27 used for validation), and then tested on the LIDC containing lung nodules (Ar- mato III et al., 2011)

dataset. We fine-tuned our model on this dataset (350 for training and 27 used for validation), and then tested on the LIDC containing lung nodules (Ar- mato III et al., 2011). As this data represents nodules that are typically small compared to malignant cancers, results for nodules larger than 3 cc were reported. In addition, the model was also evaluate...

2011