pith. sign in

arxiv: 2606.20027 · v1 · pith:NNCPELOLnew · submitted 2026-06-18 · 💻 cs.CV

QG-MIL: A Gated Transformer Aggregator for Domain-Agnostic Multiple Instance Learning in Medical Imaging

Pith reviewed 2026-06-26 18:24 UTC · model grok-4.3

classification 💻 cs.CV
keywords multiple instance learningmedical imagingtransformer aggregatorattention mechanismspathologyhematologydomain adaptation
0
0 comments X

The pith

QG-MIL uses four transformer components to reduce attention concentration and improve stability in medical multiple instance learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Attention-based multiple instance learning models for medical images often focus too narrowly on a few instances, which produces overconfident and unstable results. The paper introduces QG-MIL, a gated transformer aggregator built from RMSNorm pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. These four elements work together to spread attention more evenly across instances during training. The resulting models are tested on six benchmarks that cover whole-slide pathology images and cell-level hematology data. The best QG-MIL versions beat prior leading methods on every benchmark while showing lower variance across domains.

Core claim

QG-MIL is a gated transformer aggregator for multiple instance learning that combines RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. These components together stabilize training and produce more uniform attention weights across instances without auxiliary losses, masking, or multi-stage regularization. When evaluated on six benchmarks spanning whole-slide pathology and cell-level hematology, the best QG-MIL variants outperform leading baselines on all tasks with an average gain of 6.1 mean macro F1 points and show more distributed instance weighting in attention analysis.

What carries the argument

The QG-MIL gated transformer aggregator, which applies the four listed normalization and gating mechanisms inside each transformer block to control how attention is computed and passed forward.

If this is right

  • The best QG-MIL variants exceed leading baselines on all six benchmarks with an average +6.1 mean macro F1 improvement.
  • Attention overlays and mass analysis show more uniform weighting across instances than in prior aggregators.
  • The full combination of the four components yields the most consistent cross-domain results and the smallest variance compared with selected baselines.
  • Individual components can reach the full model's score on some single datasets, yet only the complete design maintains performance across all six tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same normalization and gating pattern could be tested in non-medical MIL settings where attention concentration is also observed.
  • The design suggests that internal transformer adjustments can sometimes substitute for external regularization techniques in attention models.
  • Releasing the configurable code makes it straightforward to apply the aggregator to new medical imaging datasets or to ablate the components further.

Load-bearing premise

The observed gains in accuracy and stability come from the four architectural components rather than from differences in hyperparameter tuning or training schedules.

What would settle it

Retraining the baseline models under identical hyperparameter, schedule, and regularization conditions as the QG-MIL runs and finding that the performance gap disappears would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.20027 by Andrea Loddo, Carsten Marr, Cecilia Di Ruberto, Davide Antonio Mura, Luca Zedda, Maurizio Atzori, Muhammed Furkan Dasdelen.

Figure 1
Figure 1. Figure 1: Mean top-10 attention mass extracted from QG-MIL and ABMIL models on the Breast dataset test set. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative attention overlays on example cases. Left: sample whole slide image. Middle: normalized attention [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
read the original abstract

Attention-based Multiple Instance Learning aggregators in medical imaging are prone to attention concentration, producing overconfident and unstable predictions. We introduce QG-MIL, a gated transformer aggregator that addresses this through four synergistic architectural components: RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style feed-forward modules. Together, these design choices stabilize training and distribute attention more uniformly across instances without auxiliary losses, masking, or multi-stage regularization. We evaluate QG-MIL across six benchmarks spanning whole-slide pathology and cell-level hematology, covering two fundamentally different MIL scales. The best-performing QG-MIL variants outperform leading baselines on all six benchmarks, with an average improvement of +6.1 mean macro F1 points. Attention overlays and attention mass analysis confirm more distributed instance weighting. Ablation studies show that while individual components can match the full model on specific datasets, the QG-MIL design provides the most consistent cross-domain performance and tightest variance when compared to selected baselines. We release a configurable implementation to support reproducibility at: https://github.com/unica-visual-intelligence-lab/QG-MIL

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces QG-MIL, a gated transformer aggregator for multiple instance learning in medical imaging. It proposes four components—RMSNorm-based pre-normalization, per-head QK normalization, fine-grained attention output gating, and SwiGLU-style FFN modules—to stabilize training and promote distributed attention without auxiliary losses, masking, or multi-stage regularization. Evaluated on six benchmarks spanning whole-slide pathology and cell-level hematology, the best QG-MIL variants are claimed to outperform leading baselines on all datasets with an average +6.1 mean macro F1 improvement; attention overlays and mass analysis support more uniform instance weighting, while ablations show consistent cross-domain performance. A configurable implementation is released on GitHub.

Significance. If the gains prove attributable to the architectural choices under matched optimization, the work could provide a useful domain-agnostic MIL aggregator that improves stability and interpretability in medical imaging without extra training stages. The release of reproducible code is a clear strength that supports verification and extension.

major comments (2)
  1. [Abstract] Abstract: the central claim of +6.1 mean macro F1 improvement and reduced variance due to the four components lacks error bars, standard deviations across runs, or statistical significance tests, preventing assessment of whether the reported lift exceeds baseline variability.
  2. [Abstract] Abstract and implied Experiments section: the attribution of performance and stability gains to RMSNorm pre-norm, per-head QK normalization, output gating, and SwiGLU FFN is load-bearing for the headline result, yet no evidence is provided that the leading baselines received equivalent hyperparameter search budgets, learning-rate schedules, or regularization strength; internal ablations do not address this confound.
minor comments (2)
  1. [Abstract] The abstract should name the specific leading baselines and report the number of independent runs underlying the mean F1 scores.
  2. Notation for the four components would benefit from explicit equations in the methods to allow precise reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback and for recognizing the potential utility of QG-MIL as a domain-agnostic aggregator. We address each major comment below and commit to revisions that strengthen the empirical support for our claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of +6.1 mean macro F1 improvement and reduced variance due to the four components lacks error bars, standard deviations across runs, or statistical significance tests, preventing assessment of whether the reported lift exceeds baseline variability.

    Authors: We agree that the absence of error bars, run-wise standard deviations, and formal significance tests limits the strength of the headline claims. Although the manuscript already states that QG-MIL yields the tightest variance among compared methods, we will add these elements in the revision: results from five independent random seeds per method, mean ± std tables, and paired statistical tests (Wilcoxon signed-rank or t-tests with Bonferroni correction) to quantify whether the +6.1 average improvement exceeds baseline variability. revision: yes

  2. Referee: [Abstract] Abstract and implied Experiments section: the attribution of performance and stability gains to RMSNorm pre-norm, per-head QK normalization, output gating, and SwiGLU FFN is load-bearing for the headline result, yet no evidence is provided that the leading baselines received equivalent hyperparameter search budgets, learning-rate schedules, or regularization strength; internal ablations do not address this confound.

    Authors: We acknowledge that the manuscript does not demonstrate exhaustive, matched hyperparameter search budgets across all baselines. Baselines were re-implemented from their original papers and tuned using the hyperparameter ranges and schedules reported in those works plus standard MIL practices for the respective datasets. To address the concern, the revised manuscript will include an expanded experimental appendix that enumerates every hyperparameter, optimizer setting, and regularization choice used for each baseline and QG-MIL variant. We will also clarify that the internal component ablations isolate the contribution of each architectural choice inside the same training pipeline, while the cross-domain consistency and variance reduction remain the primary empirical arguments. Full re-optimization of every baseline under an identical search budget would require substantial additional compute and is not feasible within the current study; we will therefore note this limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks

full rationale

The paper advances an empirical architecture (QG-MIL) and reports performance on six held-out medical imaging benchmarks. No derivation chain, first-principles prediction, or fitted quantity is presented that reduces the claimed +6.1 macro-F1 improvement or attention-distribution results to a quantity defined inside the model by construction. The four components are introduced as design choices and validated via ablation and external comparison; no self-citation supplies a uniqueness theorem or ansatz that the main result depends upon. The reader's assessment that no equations collapse the improvement to internal fitted parameters is accurate, placing the work in the normal non-circular category for an empirical ML paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical architecture paper; the central claim rests on standard deep-learning training assumptions and benchmark evaluation rather than new free parameters, axioms, or invented entities.

axioms (1)
  • domain assumption Standard assumptions of supervised deep learning (i.i.d. train/test splits, gradient-based optimization, macro F1 as appropriate metric)
    Invoked implicitly when reporting benchmark results and ablation studies.

pith-pipeline@v0.9.1-grok · 5766 in / 1342 out tokens · 26680 ms · 2026-06-26T18:24:26.345092+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 10 canonical work pages

  1. [1]

    Attention-based Deep Multiple Instance Learning

    Maximilian Ilse, Jakub Tomczak, and Max Welling. Attention-based Deep Multiple Instance Learning. In Proceedings of the 35th International Conference on Machine Learning, July 2018. URLhttps://proceedings. mlr.press/v80/ilse18a.html

  2. [2]

    Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification

    Yunlong Zhang, Honglin Li, Yunxuan Sun, Sunyi Zheng, Chenglu Zhu, and Lin Yang. Attention-Challenging Multiple Instance Learning for Whole Slide Image Classification. In Ales Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-Octo...

  3. [3]

    Contrastive learning for compact single image dehazing,

    Bin Li, Yin Li, and Kevin W. Eliceiri. Dual-stream Multiple Instance Learning Network for Whole Slide Image Classification with Self-supervised Contrastive Learning. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14313–14323, Nashville, TN, USA, June 2021. IEEE. ISBN 978-1-6654- 4509-2.10.1109/CVPR46437.2021.01409. URL...

  4. [4]

    Bi-directional weakly supervised knowledge distillation for whole slide image classification.Advances in Neural Information Processing Systems, 35:15368–15381, 2022

    Linhao Qu, Manning Wang, Zhijian Song, et al. Bi-directional weakly supervised knowledge distillation for whole slide image classification.Advances in Neural Information Processing Systems, 35:15368–15381, 2022

  5. [5]

    URL https://doi.org/10.1109/ ICCV51070.2023.00008

    Wenhao Tang, Sheng Huang, Xiaoxian Zhang, Fengtao Zhou, Yi Zhang, and Bo Liu. Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification. In2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 4055–4064, Paris, France, October 2023. IEEE. ISBN 979-8-3503-0718-4. 10.1109/ICCV51070.2023.0037...

  6. [6]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, May 2025

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free, May 2025. URL http://arxiv.org/abs/2505.06708. arXiv:2505.06708 [cs]

  7. [7]

    Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.Nature medicine, 25(8):1301–1309, 2019

    Gabriele Campanella, Matthew G Hanna, Luke Geneslaw, Allen Miraflor, Vitor Werneck Krauss Silva, Klaus J Busam, Edi Brogi, Victor E Reuter, David S Klimstra, and Thomas J Fuchs. Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.Nature medicine, 25(8):1301–1309, 2019

  8. [8]

    Lunghist700: A dataset of histological images for deep learning in pulmonary pathology.Scientific Data, 11(1):1088, 2024

    Jorge Diosdado, Pere Gilabert, Santi Seguí, and Henar Borrego. Lunghist700: A dataset of histological images for deep learning in pulmonary pathology.Scientific Data, 11(1):1088, 2024. 10.1038/s41597-024-03944-3 . URLhttps://doi.org/10.1038/s41597-024-03944-3

  9. [9]

    A clinical prostate biopsy dataset with undetected cancer.Scientific Data, 12(1):423, 2025

    Eduard Chelebian, Christophe Avenel, Helena Järemo, Pernilla Andersson, Carolina Wählby, and An- ders Bergh. A clinical prostate biopsy dataset with undetected cancer.Scientific Data, 12(1):423, 2025. 10.1038/s41597-025-04758-7. URLhttps://doi.org/10.1038/s41597-025-04758-7

  10. [10]

    Chen, Tong Ding, Ming Y

    Richard J. Chen, Tong Ding, Ming Y . Lu, and et al. Towards a general-purpose foundation model for computational pathology.Nature Medicine, 30:850–862, 2024. 10.1038/s41591-024-02857-3 . URL https://doi.org/ 10.1038/s41591-024-02857-3

  11. [11]

    A whole-slide foundation model for digital pathology from real-world data

    Haotian Xu, Naoto Usuyama, Jatin Bagga, and et al. A whole-slide foundation model for digital pathology from real-world data.Nature, 630:181–188, 2024. 10.1038/s41586-024-07441-w . URL https://doi.org/10. 1038/s41586-024-07441-w

  12. [12]

    and Chen, Bowen and Williamson, Drew F

    Ming Y . Lu, Bowen Chen, Drew F. K. Williamson, and et al. A visual-language foundation model for computational pathology.Nature Medicine, 30:863–874, 2024. 10.1038/s41591-024-02856-4 . URL https://doi.org/ 10.1038/s41591-024-02856-4

  13. [13]

    Deep residual learning for image recognition, 2015

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015. URLhttps://arxiv.org/abs/1512.03385

  14. [14]

    Explainable ai identifies diagnostic cells of genetic aml subtypes.PLOS Digital Health, 2, 2023

    Matthias Hehr, Ario Sadafi, Christian Matek, Peter Lienemann, Christian Pohlkamp, Torsten Haferlach, Karsten Spiekermann, and Carsten Marr. Explainable ai identifies diagnostic cells of genetic aml subtypes.PLOS Digital Health, 2, 2023. URLhttps://api.semanticscholar.org/CorpusID:257534165. 7 PRIME AI paper

  15. [15]

    Sidhom, I

    John W. Sidhom, I. J. Siddarthan, Brian S. Lai, and et al. Deep learning for diagnosis of acute promyelocytic leukemia via recognition of genomically imprinted morphologic features.npj Precision Oncology, 5:38, 2021. 10.1038/s41698-021-00179-y

  16. [16]

    Transformer-based hematological malignancy prediction from peripheral blood smears in a real-world cohort, 2025

    Muhammed Furkan Dasdelen, Ivan Kukuljan, Peter Lienemann, Fatih Ozlugedik, Ario Sadafi, Matthias Hehr, Karsten Spiekermann, Christian Pohlkamp, and Carsten Marr. Transformer-based hematological malignancy prediction from peripheral blood smears in a real-world cohort, 2025. URL https://arxiv.org/abs/2509. 20402

  17. [17]

    Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia A

    Valentin Koch, Sophia J. Wagner, Salome Kazeminia, Ece Sancar, Matthias Hehr, Julia A. Schnabel, Tingying Peng, and Carsten Marr. DinoBloom: A Foundation Model for Generalizable Cell Embeddings in Hematology . Inproceedings of Medical Image Computing and Computer Assisted Intervention – MICCAI 2024, volume LNCS 15012. Springer Nature Switzerland, October 2024

  18. [18]

    Lu, Tong Ding, , and Faisal Mahmood

    Daniel Shao, Richard J Chen, Andrew H Song, Joel Runevic, Ming Y . Lu, Tong Ding, , and Faisal Mahmood. Do multiple instance learning models transfer? InInternational conference on machine learning, 2025

  19. [19]

    Cost-sensitive multi-kernel elm based on reduced expectation kernel auto-encoder.PLOS ONE, 20 (2):1–20, 02 2025

    Liang Yixuan. Cost-sensitive multi-kernel elm based on reduced expectation kernel auto-encoder.PLOS ONE, 20 (2):1–20, 02 2025. 10.1371/journal.pone.0314851. URL https://doi.org/10.1371/journal.pone. 0314851

  20. [20]

    Explainable AI identifies diagnostic cells of genetic AML subtypes.PLOS digital health, 2023

    Matthias Hehr, Ario Sadafi, Christian Matek, Peter Lienemann, Christian Pohlkamp, Torsten Haferlach, Karsten Spiekermann, and Carsten Marr. Explainable AI identifies diagnostic cells of genetic AML subtypes.PLOS digital health, 2023. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC10016704/. 8