Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Bin Fu; Diego Adame; Dongchul Kim; Erik Enriquez; Fabian Vazquez; Haoteng Tang; Huimin Li; Jinghao Yang; Jose A. Nunez; Pengfei Gu

arxiv: 2605.17719 · v1 · pith:F2TYA3ODnew · submitted 2026-05-18 · 💻 cs.CV

Patch-MoE Mamba: A Patch-Ordered Mixture-of-Experts State Space Architecture for Medical Image Segmentation

Diego Adame , Fabian Vazquez , Jose A. Nunez , Huimin Li , Jinghao Yang , Erik Enriquez , DongChul Kim , Haoteng Tang

show 2 more authors

Bin Fu Pengfei Gu

This is my paper

Pith reviewed 2026-05-19 21:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationstate space modelsMambamixture of expertspatch-ordered scanningpolyp segmentationskin lesion segmentation

0 comments

The pith

Patch-MoE Mamba addresses limitations in Mamba models by using hierarchical patch-ordered scanning and mixture-of-experts fusion for medical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve medical image segmentation using state space models by fixing two issues in current Mamba-based approaches. Pixel-wise scanning disrupts local spatial structures in 2D images, and fixed summation of different scan directions fails to handle varying object shapes well. The proposed Patch-MoE Mamba uses a hierarchical patch-ordered scanning to keep local neighborhoods intact while getting multi-scale context, and an MoE module that adaptively fuses outputs from directional experts. This is tested on polyp and skin lesion datasets to show better results with efficient computation.

Core claim

The central discovery is that a patch-ordered mixture-of-experts state space architecture can effectively model long-range dependencies in medical images while preserving local 2D spatial structure through hierarchical scanning and adaptive directional fusion using four directional experts, a learnable concatenation expert, and residual aggregation.

What carries the argument

Hierarchical patch-ordered scanning mechanism that processes image patches in an ordered way to maintain spatial neighborhoods, combined with MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs.

If this is right

Better performance on polyp segmentation benchmarks by preserving local structure.
Improved adaptability to diverse object sizes and boundaries in skin lesion segmentation.
Linear sequence complexity maintained while capturing multi-scale context.
Generality demonstrated across five polyp datasets and ISIC skin lesion datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This approach might allow state space models to be applied more broadly in computer vision tasks requiring spatial awareness.
Future work could explore extending the patch ordering to 3D medical volumes for volumetric segmentation.
The MoE fusion could be applied to other multi-directional scanning problems in sequence modeling.

Load-bearing premise

The hierarchical patch-ordered scanning mechanism preserves local spatial neighborhoods while capturing multi-scale context better than standard pixel-wise directional scanning.

What would settle it

Running the model on the same five polyp benchmarks and ISIC datasets and finding no improvement in standard metrics such as Dice coefficient or IoU compared to prior Mamba models would falsify the effectiveness claim.

Figures

Figures reproduced from arXiv: 2605.17719 by Bin Fu, Diego Adame, Dongchul Kim, Erik Enriquez, Fabian Vazquez, Haoteng Tang, Huimin Li, Jinghao Yang, Jose A. Nunez, Pengfei Gu.

**Figure 1.** Figure 1: (a) Overview of the proposed Patch-MoE Mamba architecture. (b) Structure of the Patch-MoE Visual State Space (VSS) block. (c) Structure of the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Illustration of the patch-ordered scanning method on a [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed MoE-based directional fusion module. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: Visual examples of segmentations results. [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

CNN- and Transformer-based architectures have achieved strong performance in medical image segmentation, but CNNs are limited in modeling long-range dependencies, while Transformers often suffer from quadratic computational and memory complexity. State space models, especially Mamba-based networks, offer an efficient alternative with linear sequence complexity. However, existing Mamba segmentation models still face two limitations: pixel-wise directional scanning can disrupt local 2D spatial structure, and simple summation-based fusion of scan directions cannot adapt well to diverse object sizes, shapes, and boundaries. To address these issues, we propose \textit{Patch-MoE Mamba}, a patch-ordered mixture-of-experts state space architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism that preserves local spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines multiple Mamba scanner outputs using four directional experts, a learnable concatenation expert, and residual directional aggregation. Experiments on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion segmentation datasets demonstrate the effectiveness and generality of Patch-MoE Mamba.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Patch-MoE Mamba offers a targeted fix for locality loss and rigid fusion in Mamba segmentation models via hierarchical patch scanning and a specific MoE directional module.

read the letter

The main point is that this paper combines hierarchical patch-ordered scanning with an MoE-based fusion of directional Mamba outputs to handle medical image segmentation more effectively than prior state-space approaches. It directly tackles two stated problems: pixel-wise scans breaking 2D neighborhoods and simple summation failing to adapt to varying object shapes and sizes. The architecture uses four directional experts, a learnable concatenation expert, and residual aggregation on top of the patch mechanism. This specific pairing does not appear in the cited prior work, so the combination counts as new rather than incremental re-use of existing equations. The motivation section maps cleanly onto the proposed modules, which is a strength. The evaluation covers five polyp benchmarks plus ISIC 2017/2018, which are reasonable public datasets for this task and give some sense of generality. The paper therefore supplies a concrete alternative for anyone looking for linear-complexity segmentation backbones that still respect local spatial structure. On the weaker side, the abstract gives no concrete metrics, no list of exact baselines, and no mention of ablations or statistical tests. Without those details it is difficult to judge whether the reported gains are robust or sensitive to post-hoc choices in routing and aggregation. The central assumption that patch ordering preserves neighborhoods better than pixel-wise scans also needs the full experimental section to hold up. Minor implementation details such as the exact number of experts and how the learnable concatenation is trained could affect reproducibility. This paper is mainly for researchers working on efficient vision models for medical imaging who already follow Mamba or state-space variants. A reader building new segmentation architectures would find the scanning order and expert fusion sections worth examining even if the final numbers require closer checking. I would send it to peer review because the motivation is clear, the architecture is defined independently of the results, and the datasets are external and public. The experimental claims can be pressure-tested in review without needing major new data collection.

Referee Report

0 major / 3 minor

Summary. The manuscript proposes Patch-MoE Mamba, a Mamba-based architecture for medical image segmentation. It introduces a hierarchical patch-ordered scanning mechanism intended to preserve local 2D spatial neighborhoods while capturing multi-scale context, and an MoE-based directional fusion module that adaptively combines outputs from four directional experts, a learnable concatenation expert, and residual aggregation. The central claim is that these components address limitations of pixel-wise scanning and non-adaptive fusion in prior Mamba segmentation models, with effectiveness demonstrated on five public polyp segmentation benchmarks and the ISIC 2017/2018 skin lesion datasets.

Significance. If the reported gains hold under rigorous controls, the work offers a concrete advance in efficient (linear-complexity) segmentation models for medical imaging by directly targeting the locality disruption and fusion rigidity of existing Mamba approaches. Evaluation across multiple public benchmarks supports generality claims and enables direct comparison; the explicit architectural choices (patch ordering, expert routing) are falsifiable and could be adopted or extended by others.

minor comments (3)

[Abstract] Abstract: quantitative results (Dice, IoU, etc.), exact baselines, and statistical significance are referenced only qualitatively; adding one sentence with key numbers would strengthen the summary.
[§3.2] The hierarchical patch-ordered scanning is described at a high level; a small diagram or pseudocode in §3.2 would clarify how patch ordering differs from standard directional scans while preserving neighborhoods.
[Experiments] Table captions and axis labels in the experimental section should explicitly state the evaluation metric (e.g., mean Dice) and whether results are averaged over multiple runs.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work and the recommendation for minor revision. We appreciate the acknowledgment that the hierarchical patch-ordered scanning and MoE-based directional fusion target key limitations in prior Mamba segmentation models, with evaluation on multiple public benchmarks supporting the claims.

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper defines its Patch-MoE Mamba architecture independently by introducing a hierarchical patch-ordered scanning mechanism to preserve local 2D neighborhoods and an MoE-based directional fusion module with four directional experts plus learnable concatenation. These components directly address the stated limitations of pixel-wise scanning and non-adaptive fusion in prior Mamba models. The central effectiveness claim rests on experimental validation across five external public polyp segmentation benchmarks and ISIC 2017/2018 datasets, with no reduction of predictions to fitted inputs, no load-bearing self-citations, and no self-definitional loops in the architecture equations or motivation. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The proposal rests on standard deep-learning assumptions about spatial structure in images and the benefits of adaptive fusion; no new physical entities or ungrounded constants are introduced.

free parameters (2)

Number of directional experts
Fixed at four in the MoE module description.
MoE routing and aggregation parameters
Learnable parameters for concatenation expert and residual directional aggregation.

axioms (2)

standard math Mamba state space models provide linear sequence complexity for long-range modeling
Invoked when contrasting with quadratic Transformer complexity.
domain assumption Pixel-wise directional scanning disrupts local 2D spatial structure
Stated as a core limitation of existing Mamba segmentation models.

pith-pipeline@v0.9.0 · 5762 in / 1274 out tokens · 39715 ms · 2026-05-19T21:51:16.685165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 2 internal anchors

[1]

PraNet: Parallel reverse attention network for polyp segmentation,

D.-P. Fanet al., “PraNet: Parallel reverse attention network for polyp segmentation,” inMICCAI, 2020, pp. 263–273

work page 2020
[2]

Automated polyp detection in colonoscopy videos using shape and context information,

N. Tajbakhshet al., “Automated polyp detection in colonoscopy videos using shape and context information,”IEEE Transactions on Medical Imaging, vol. 35, no. 2, pp. 630–644, 2015

work page 2015
[3]

Keep your friends close & enemies farther: Debiasing contrastive learning with spatial priors in 3D radiology images,

Y . Zhanget al., “Keep your friends close & enemies farther: Debiasing contrastive learning with spatial priors in 3D radiology images,” in BIBM, 2022, pp. 1824–1829

work page 2022
[4]

Sli2vol+: Segmenting 3D medical images based on an object estimation guided correspondence flow network,

D. Anet al., “Sli2vol+: Segmenting 3D medical images based on an object estimation guided correspondence flow network,” inWACV, 2025, pp. 3624–3634

work page 2025
[5]

U-Net: Convolutional networks for biomedical image segmentation,

O. Ronnebergeret al., “U-Net: Convolutional networks for biomedical image segmentation,” inMICCAI, 2015, pp. 234–241

work page 2015
[6]

Self pre-training with topology-and spatiality-aware masked autoencoders for 3D medical image segmentation,

P. Guet al., “Self pre-training with topology-and spatiality-aware masked autoencoders for 3D medical image segmentation,” inBIBM, 2025, pp. 3608–3613

work page 2025
[7]

Polyp-pvt: Polyp seg- mentation with pyramid vision transformers.arXiv preprint arXiv:2108.06932, 2021

B. Donget al., “Polyp-PVT: Polyp segmentation with pyramid vision Transformers,”arXiv preprint arXiv:2108.06932, 2021

work page arXiv 2021
[8]

A point in the right direction: Vector prediction for spatially-aware self-supervised volumetric representation learning,

Y . Zhanget al., “A point in the right direction: Vector prediction for spatially-aware self-supervised volumetric representation learning,” in ISBI, 2023, pp. 1–5

work page 2023
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

VMamba: Visual state space model,

Y . Liuet al., “VMamba: Visual state space model,”NeurIPS, vol. 37, 2024

work page 2024
[11]

arXiv preprint arXiv:2402.02491 (2024)

J. Ruan,et al., “VM-UNet: Vision Mamba U-Net for medical image segmentation,”arXiv preprint arXiv:2402.02491, 2024

work page arXiv 2024
[12]

VM-UNetV2: rethinking vision Mamba UNet for medical image segmentation,

M. Zhanget al., “VM-UNetV2: rethinking vision Mamba UNet for medical image segmentation,” inISBI, 2024, pp. 335–346

work page 2024
[13]

Topo-VM-UNetV2: Encoding topology into vision Mamba UNet for polyp segmentation,

D. Adameet al., “Topo-VM-UNetV2: Encoding topology into vision Mamba UNet for polyp segmentation,” inCBMS, 2025, pp. 258–263

work page 2025
[14]

Learning with geometric priors in U-Net variants for polyp segmentation,

F. Vazquezet al., “Learning with geometric priors in U-Net variants for polyp segmentation,”arXiv preprint arXiv:2601.17331, 2026

work page arXiv 2026
[15]

U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation,

Y . Penget al., “U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation,” inISBI, 2025, pp. 1–5

work page 2025
[16]

Kvasir-SEG: A segmented polyp dataset,

D. Jhaet al., “Kvasir-SEG: A segmented polyp dataset,” inMMM, 2020

work page 2020
[17]

WM-DOV A maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

J. Bernalet al., “WM-DOV A maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computer- ized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015

work page 2015
[18]

Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,

J. Silvaet al., “Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,”Journal of CARS, vol. 9, pp. 283–293, 2014

work page 2014
[19]

A benchmark for endoluminal scene segmentation of colonoscopy images,

D. V ´azquezet al., “A benchmark for endoluminal scene segmentation of colonoscopy images,”Journal of Healthcare Engineering, 2017

work page 2017
[20]

N. C. Codellaet al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC),” inISBI, 2018, pp. 168–172

work page 2017
[21]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codellaet al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC),”arXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,

P. Tschandlet al., “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Scientific Data, vol. 5, no. 1, pp. 1–9, 2018

work page 2018

[1] [1]

PraNet: Parallel reverse attention network for polyp segmentation,

D.-P. Fanet al., “PraNet: Parallel reverse attention network for polyp segmentation,” inMICCAI, 2020, pp. 263–273

work page 2020

[2] [2]

Automated polyp detection in colonoscopy videos using shape and context information,

N. Tajbakhshet al., “Automated polyp detection in colonoscopy videos using shape and context information,”IEEE Transactions on Medical Imaging, vol. 35, no. 2, pp. 630–644, 2015

work page 2015

[3] [3]

Keep your friends close & enemies farther: Debiasing contrastive learning with spatial priors in 3D radiology images,

Y . Zhanget al., “Keep your friends close & enemies farther: Debiasing contrastive learning with spatial priors in 3D radiology images,” in BIBM, 2022, pp. 1824–1829

work page 2022

[4] [4]

Sli2vol+: Segmenting 3D medical images based on an object estimation guided correspondence flow network,

D. Anet al., “Sli2vol+: Segmenting 3D medical images based on an object estimation guided correspondence flow network,” inWACV, 2025, pp. 3624–3634

work page 2025

[5] [5]

U-Net: Convolutional networks for biomedical image segmentation,

O. Ronnebergeret al., “U-Net: Convolutional networks for biomedical image segmentation,” inMICCAI, 2015, pp. 234–241

work page 2015

[6] [6]

Self pre-training with topology-and spatiality-aware masked autoencoders for 3D medical image segmentation,

P. Guet al., “Self pre-training with topology-and spatiality-aware masked autoencoders for 3D medical image segmentation,” inBIBM, 2025, pp. 3608–3613

work page 2025

[7] [7]

Polyp-pvt: Polyp seg- mentation with pyramid vision transformers.arXiv preprint arXiv:2108.06932, 2021

B. Donget al., “Polyp-PVT: Polyp segmentation with pyramid vision Transformers,”arXiv preprint arXiv:2108.06932, 2021

work page arXiv 2021

[8] [8]

A point in the right direction: Vector prediction for spatially-aware self-supervised volumetric representation learning,

Y . Zhanget al., “A point in the right direction: Vector prediction for spatially-aware self-supervised volumetric representation learning,” in ISBI, 2023, pp. 1–5

work page 2023

[9] [9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

VMamba: Visual state space model,

Y . Liuet al., “VMamba: Visual state space model,”NeurIPS, vol. 37, 2024

work page 2024

[11] [11]

arXiv preprint arXiv:2402.02491 (2024)

J. Ruan,et al., “VM-UNet: Vision Mamba U-Net for medical image segmentation,”arXiv preprint arXiv:2402.02491, 2024

work page arXiv 2024

[12] [12]

VM-UNetV2: rethinking vision Mamba UNet for medical image segmentation,

M. Zhanget al., “VM-UNetV2: rethinking vision Mamba UNet for medical image segmentation,” inISBI, 2024, pp. 335–346

work page 2024

[13] [13]

Topo-VM-UNetV2: Encoding topology into vision Mamba UNet for polyp segmentation,

D. Adameet al., “Topo-VM-UNetV2: Encoding topology into vision Mamba UNet for polyp segmentation,” inCBMS, 2025, pp. 258–263

work page 2025

[14] [14]

Learning with geometric priors in U-Net variants for polyp segmentation,

F. Vazquezet al., “Learning with geometric priors in U-Net variants for polyp segmentation,”arXiv preprint arXiv:2601.17331, 2026

work page arXiv 2026

[15] [15]

U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation,

Y . Penget al., “U-Net v2: Rethinking the skip connections of U-Net for medical image segmentation,” inISBI, 2025, pp. 1–5

work page 2025

[16] [16]

Kvasir-SEG: A segmented polyp dataset,

D. Jhaet al., “Kvasir-SEG: A segmented polyp dataset,” inMMM, 2020

work page 2020

[17] [17]

WM-DOV A maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,

J. Bernalet al., “WM-DOV A maps for accurate polyp highlighting in colonoscopy: Validation vs. saliency maps from physicians,”Computer- ized Medical Imaging and Graphics, vol. 43, pp. 99–111, 2015

work page 2015

[18] [18]

Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,

J. Silvaet al., “Toward embedded detection of polyps in WCE images for early diagnosis of colorectal cancer,”Journal of CARS, vol. 9, pp. 283–293, 2014

work page 2014

[19] [19]

A benchmark for endoluminal scene segmentation of colonoscopy images,

D. V ´azquezet al., “A benchmark for endoluminal scene segmentation of colonoscopy images,”Journal of Healthcare Engineering, 2017

work page 2017

[20] [20]

N. C. Codellaet al., “Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (ISBI), hosted by the international skin imaging collaboration (ISIC),” inISBI, 2018, pp. 168–172

work page 2017

[21] [21]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

N. Codellaet al., “Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (ISIC),”arXiv preprint arXiv:1902.03368, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,

P. Tschandlet al., “The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions,” Scientific Data, vol. 5, no. 1, pp. 1–9, 2018

work page 2018