pith. sign in

arxiv: 2606.07633 · v1 · pith:WJ7JF5AZnew · submitted 2026-05-31 · 💻 cs.CV · cs.AI

AMN: An Adaptive Multi-Scale Fusion Network with Boundary and Uncertainty Modeling for Nuclei Segmentation

Pith reviewed 2026-06-28 17:11 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords nuclei segmentationhistopathologymulti-scale fusiontransformer CNN hybridboundary aware lossuncertainty modelingCoNIC benchmarkMoNuSeg
0
0 comments X

The pith

AMN fuses Swin Transformer and ResNet-50 features through per-channel gating to improve nuclei subtype segmentation over single-encoder baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that combining a transformer encoder for long-range context with a CNN feature pyramid for local texture, then fusing them scale-by-scale with learned per-channel weights, produces more accurate nuclei segmentation than either architecture alone. Training adds boundary emphasis and an uncertainty term to reduce overconfident mistakes on hard classes such as lymphocytes. On the CoNIC benchmark this yields a mean Dice of 0.82 and F1 of 0.68 across seven nuclei types, beating eight published models, and the same weights transfer to MoNuSeg without retraining. The result matters because reliable subtype counts support tumor grading, immune quantification, and prognosis in pathology slides. The authors position the adaptive fusion and uncertainty loss as the elements that close the gap left by pure CNN or pure transformer encoders.

Core claim

AMN is a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. On the CoNIC benchmark across seven nuclei classes it reaches a mean Dice of 0.82 and mean F1 of 0.68, outperforming U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet, and shows strong gene

What carries the argument

The learned per-channel gating that dynamically weights Swin Transformer and ResNet-50 contributions at each scale, combined with the uncertainty-modulated term in the loss.

If this is right

  • Higher subtype classification accuracy directly improves automated tumor grading and immune infiltrate quantification on whole-slide images.
  • Cross-dataset transfer without retraining indicates the representations are robust to staining and scanner variations common in clinical pathology.
  • Stronger performance on the lymphocyte class suggests the boundary and uncertainty terms help with small or densely packed nuclei that defeat standard losses.
  • Hybrid CNN-transformer designs with scale-specific adaptive fusion can outperform both pure-CNN and pure-transformer segmentation networks on histopathology tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gating pattern could be tested on other paired encoders such as CNN plus vision transformer for non-medical segmentation problems.
  • Uncertainty modulation may reduce the effect of label noise that often occurs when pathologists annotate nuclei subtypes.
  • Extending the framework to three or more encoders would test whether the per-channel weighting generalizes beyond two sources.

Load-bearing premise

The reported gains are produced by the per-channel gating and uncertainty-modulated loss rather than by the choice of the two encoders or by ordinary training of the same backbone combination.

What would settle it

An ablation that removes the gating module and the uncertainty term, retrains the identical dual-encoder backbone with only the remaining loss terms, and measures Dice and F1 on CoNIC; if performance drops to baseline levels the claim holds, otherwise the contribution of the new components is not isolated.

Figures

Figures reproduced from arXiv: 2606.07633 by Spoorthi M, Suja Palaniswamy.

Figure 2
Figure 2. Figure 2: Adaptive Fusion at level l. Swin and CNN features are projected to 256 channels (s, c) and spatially aligned. Their concatenation is processed via global pooling and an MLP to produce a channel-wise gate α. The fused output f = α ⊙ s + (1 − α) ⊙ c adaptively combines both features. strides {4, 8, 16, 32}. NHWC outputs are transposed to NCHW before fusion. CNN Encoder. We employed ResNet-50[16] pre-trained … view at source ↗
Figure 3
Figure 3. Figure 3: Training dynamics of the proposed AMN model. Left: training and [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Per-class F1 (left) and Dice (right) on CoNIC validation for AMN and all eight baseline methods. AMN achieves the highest scores on five of seven [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation results on CoNIC validation. Dice and F1 across progres [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative results on CoNIC validation. Columns: (a) H&E input, [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
read the original abstract

Accurate classification of nuclei subtypes in histopathology images is critical for downstream tasks including tumor grading, immune infiltrate quantification, and prognosis prediction. Existing approaches rely on either convolutional or transformer-based encoders in isolation, limiting their ability to simultaneously capture fine-grained local texture and long-range spatial context. We present AMN (Adaptive Multi-Scale Nuclei Network), a dual-encoder segmentation framework that jointly leverages a Swin Transformer and a ResNet-50 feature pyramid, fused via a learned per-channel gating mechanism that dynamically weighs each encoder's contribution at every scale. AMN is trained with a multi-objective loss combining class-weighted focal loss, boundary-aware loss with positive-pixel emphasis, and a novel uncertainty-modulated classification term that suppresses overconfident erroneous predictions. Evaluated on the CoNIC benchmark across seven nuclei classes, AMN achieves a mean Dice of 0.82 and mean F1 of 0.68, with an F1 of 0.67 on the diagnostically challenging lymphocyte class. AMN outperforms eight baseline models spanning pure-CNN, pure-transformer, and recent hybrid architectures: U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, and BEFUnet. Cross-dataset evaluation on MoNuSeg demonstrates strong generalization without retraining and validating the domain robustness of the learned representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AMN, a dual-encoder nuclei segmentation network that fuses Swin Transformer and ResNet-50 features via a learned per-channel gating mechanism at multiple scales. It is trained with a composite loss (class-weighted focal, boundary-aware with positive-pixel emphasis, and uncertainty-modulated classification) and reports mean Dice of 0.82 and mean F1 of 0.68 on the CoNIC benchmark across seven nuclei classes, outperforming eight baselines (U-Net, ResU-Net, DeepLabV3+, SegNet, ViT-Small, HmsU-Net, ConvFormer-UNet, BEFUnet). Cross-dataset evaluation on MoNuSeg is claimed to show strong generalization without retraining.

Significance. If the performance gains can be rigorously attributed to the proposed gating and uncertainty components, the work would provide a concrete example of adaptive multi-scale fusion for histopathology segmentation, with potential utility for downstream tasks such as tumor grading and immune quantification. The hybrid encoder design and boundary/uncertainty terms address known challenges in nuclei subtype classification.

major comments (2)
  1. [Experiments section] Experiments section: the manuscript reports superior Dice (0.82) and F1 (0.68) on CoNIC but contains no ablation that trains the identical Swin+ResNet-50 dual-encoder backbone under a standard loss (without per-channel gating or the uncertainty-modulated term). This omission prevents attribution of the gains to the two proposed mechanisms rather than encoder choice, augmentation, or optimization.
  2. [Results on CoNIC and MoNuSeg] Results on CoNIC and MoNuSeg: no statistical significance tests, standard deviations, or error bars accompany the reported metrics or the outperformance claims versus the eight baselines; the cross-dataset generalization statement likewise lacks quantitative support in the provided summary.
minor comments (1)
  1. The abstract states 'strong cross-dataset generalization' on MoNuSeg but does not list the exact quantitative metrics or whether any fine-tuning occurred.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important aspects of experimental validation and statistical reporting. We address each major comment point by point below.

read point-by-point responses
  1. Referee: [Experiments section] Experiments section: the manuscript reports superior Dice (0.82) and F1 (0.68) on CoNIC but contains no ablation that trains the identical Swin+ResNet-50 dual-encoder backbone under a standard loss (without per-channel gating or the uncertainty-modulated term). This omission prevents attribution of the gains to the two proposed mechanisms rather than encoder choice, augmentation, or optimization.

    Authors: We agree that the current experiments do not include an ablation isolating the dual-encoder backbone trained under a standard loss without the gating or uncertainty terms. This limits direct attribution of gains to the proposed components. In the revised manuscript we will add this ablation study, training the identical Swin+ResNet-50 backbone with a baseline loss (e.g., weighted cross-entropy plus Dice) and reporting the resulting metrics for comparison against the full AMN model. revision: yes

  2. Referee: [Results on CoNIC and MoNuSeg] Results on CoNIC and MoNuSeg: no statistical significance tests, standard deviations, or error bars accompany the reported metrics or the outperformance claims versus the eight baselines; the cross-dataset generalization statement likewise lacks quantitative support in the provided summary.

    Authors: We acknowledge that the absence of statistical tests, standard deviations, and error bars weakens the strength of the reported outperformance. We will revise the results section to include these: standard deviations computed over multiple random seeds, error bars on bar plots, and paired statistical significance tests (e.g., Wilcoxon signed-rank) against each baseline on CoNIC. For the MoNuSeg cross-dataset evaluation we will add the corresponding quantitative Dice and F1 scores to support the generalization claim. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical architecture and benchmark evaluation

full rationale

The paper proposes an empirical dual-encoder segmentation network (Swin + ResNet-50 with per-channel gating and uncertainty-modulated loss) and reports measured performance (Dice 0.82, F1 0.68 on CoNIC; generalization on MoNuSeg) against external baselines. No equations, self-definitional loops, fitted-input-as-prediction, or self-citation chains are present that would reduce the reported metrics to quantities defined inside the method itself. The central claims rest on standard train/test splits and independent benchmark datasets rather than any internal reduction or tautology. This is a conventional empirical ML paper whose derivation chain is self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is supplied; no explicit free parameters, axioms, or invented entities can be extracted beyond the high-level description of learned gating weights and loss terms.

pith-pipeline@v0.9.1-grok · 5783 in / 1054 out tokens · 25938 ms · 2026-06-28T17:11:35.906119+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 6 canonical work pages

  1. [1]

    B. Fu, Y . Peng, J. He, C. Tian, X. Sun, and R. Wang, ”HmsU-Net: A Hybrid Multi-Scale U-Net Based on a CNN and Transformer for Medical Image Segmentation,”Computers in Biology and Medicine, vol. 170, p. 108013, Mar. 2024, doi: 10.1016/j.compbiomed.2024.108013

  2. [2]

    Frontiers in Cardiovascular Medicine10, 1056055 (Feb 2023)

    H. Tang et al., ”HTC-Net: A Hybrid CNN-Transformer Framework for Medical Image Segmentation,”Biomedical Signal Processing and Control, vol. 88, p. 105605, Feb. 2024, doi: 10.1016/j.bspc.2023.105605

  3. [3]

    X. Lin, Z. Yan, X. Deng, C. Zheng, and L. Yu, ”ConvFormer: Plug-and- play CNN-style Transformers for Improving Medical Image Segmenta- tion,” inProc. MICCAI, 2023, pp. 642–651

  4. [4]

    X. Liu et al., ”Enhancing Medical Image Segmentation via Complemen- tary CNN-Transformer Fusion and Boundary Perception,”Frontiers in Computer Science, 2025, doi: 10.3389/fcomp.2025.1677905

  5. [5]

    Yao et al., ”From CNN to Transformer: A Review of Medical Image Segmentation Models,”Journal of Imaging Informatics in Medicine, vol

    W. Yao et al., ”From CNN to Transformer: A Review of Medical Image Segmentation Models,”Journal of Imaging Informatics in Medicine, vol. 37, no. 4, pp. 1529–1547, Aug. 2024

  6. [6]

    Pu et al., ”Advantages of Transformer and Its Application for Medical Image Segmentation: A Survey,”BioMedical Engineering OnLine, vol

    Q. Pu et al., ”Advantages of Transformer and Its Application for Medical Image Segmentation: A Survey,”BioMedical Engineering OnLine, vol. 23, p. 14, Feb. 2024

  7. [7]

    A. R. Khan and A. Khan, ”Multi-Axis Vision Transformer for Medical Image Segmentation,”Engineering Applications of Artificial Intelli- gence, 2025

  8. [8]

    Jiang et al., ”Hybrid U-Net Model with Visual Transformers for Enhanced Multi-Organ Medical Image Segmentation,”Information, vol

    P. Jiang et al., ”Hybrid U-Net Model with Visual Transformers for Enhanced Multi-Organ Medical Image Segmentation,”Information, vol. 16, no. 2, p. 111, Feb. 2025

  9. [9]

    Xu, Y .-L

    W. Xu, Y .-L. Fu, and D. Zhu, ”ResNet and Its Application to Med- ical Image Processing: Research Progress and Challenges,”Computer Methods and Programs in Biomedicine, vol. 240, p. 107660, Oct. 2023

  10. [10]

    Wang et al., ”Skin Lesion Segmentation Using Atrous Convolution via DeepLab v3,”arXiv preprint arXiv:1807.08891, 2018

    Z. Wang et al., ”Skin Lesion Segmentation Using Atrous Convolution via DeepLab v3,”arXiv preprint arXiv:1807.08891, 2018

  11. [11]

    Krithika (alias AnbuDevi) and K

    M. Krithika (alias AnbuDevi) and K. Suganthi, ”Review of Semantic Segmentation of Medical Images Using Modified Architectures of UNet,”Diagnostics, vol. 12, no. 12, p. 3064, 2022

  12. [12]

    Fu et al., ”A Survey of Vision Transformer Derivatives for Medical Image Segmentation,”arXiv preprint arXiv:2205.11239, 2022

    K. Fu et al., ”A Survey of Vision Transformer Derivatives for Medical Image Segmentation,”arXiv preprint arXiv:2205.11239, 2022

  13. [13]

    Liu et al., ”Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” inProc

    Z. Liu et al., ”Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows,” inProc. IEEE/CVF ICCV, 2021, pp. 10012–10022

  14. [14]

    Graham et al., ”CoNIC: Colon Nucleus Identification and Counting Challenge,” inProc

    S. Graham et al., ”CoNIC: Colon Nucleus Identification and Counting Challenge,” inProc. MICCAI, 2021

  15. [15]

    Kumar et al., ”A Multi-Organ Nucleus Segmentation Challenge,” IEEE Transactions on Medical Imaging, 2019

    N. Kumar et al., ”A Multi-Organ Nucleus Segmentation Challenge,” IEEE Transactions on Medical Imaging, 2019

  16. [16]

    He et al., ”Deep Residual Learning for Image Recognition,” inProc

    K. He et al., ”Deep Residual Learning for Image Recognition,” inProc. IEEE CVPR, 2016

  17. [17]

    Lin et al., ”Feature Pyramid Networks for Object Detection,” in Proc

    T.-Y . Lin et al., ”Feature Pyramid Networks for Object Detection,” in Proc. IEEE CVPR, 2017

  18. [18]

    Lin et al., ”Focal Loss for Dense Object Detection,” inProc

    T.-Y . Lin et al., ”Focal Loss for Dense Object Detection,” inProc. IEEE ICCV, 2017

  19. [19]

    Kendall, Y

    A. Kendall, Y . Gal, and R. Cipolla, ”Multi-Task Learning Using Uncer- tainty to Weigh Losses,” inProc. IEEE CVPR, 2018

  20. [20]

    Ronneberger, P

    O. Ronneberger, P. Fischer, and T. Brox, ”U-Net: Convolutional Net- works for Biomedical Image Segmentation,” inProc. MICCAI, 2015

  21. [21]

    Chen et al., ”Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” inProc

    L.-C. Chen et al., ”Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation,” inProc. ECCV, 2018

  22. [22]

    Badrinarayanan, A

    V . Badrinarayanan, A. Kendall, and R. Cipolla, ”SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017

  23. [23]

    Dosovitskiy et al., ”An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” inProc

    A. Dosovitskiy et al., ”An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” inProc. ICLR, 2021

  24. [24]

    W. Wang, Y . Luo, and X. Wang, ”BefNet: A Hybrid CNN-Mamba Architecture for Accurate Skin Lesion Image Segmentation,” inProc. IEEE BIBM, 2024, pp. 3795–3798

  25. [25]

    Afnaan, K

    K. Afnaan, K. L. S. P. Reddy, K. P. Dharmaraj, K. Ajith, T. Singh, and K. Hushme, ”Deep Learning for Enhanced Delineation and Clas- sification in Brain MRI Images,” inIFIP Advances in Information and Communication Technology, Springer Nature Switzerland, 2025. https://doi.org/10.1007/978-3-031-98356-6\ 11

  26. [26]

    Afnaan, S

    K. Afnaan, S. Palaniswamy, T. Singh, and B. Prakash, ”VisioRenalNet: Spatial Vision Transformer UNet for Enhanced T2-Weighted Kidney MRI Segmentation,” inProc. ICMLDE, Procedia Computer Science, vol. 235, 2024, pp. 1674–1683

  27. [27]

    Satish and S

    M. Satish and S. Palaniswamy, ”Image Super-Resolution by Aug- mentation of Region Information by Rapid Segmentation,” inApplied Soft Computing and Communication Networks (ACN 2023), Lecture Notes in Networks and Systems, vol. 966, Springer, Singapore, 2024. https://doi.org/10.1007/978-981-97-2004-0\ 27

  28. [28]

    B. S. Devi, R. P. Singh, and S. Palaniswamy, ”Enhancing Aerial Ship Segmentation: Attention-Based U-Net Optimization with Reduced Resolution,” inProc. 6th Int. Conf. Emerging Technology (INCET), Belgaum, India, 2025, pp. 1–6. https://doi.org/10.1109/INCET64471. 2025.11139870