pith. sign in

arxiv: 2606.17966 · v1 · pith:GZFQ5OJYnew · submitted 2026-06-16 · 💻 cs.CV

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

Pith reviewed 2026-06-27 01:12 UTC · model grok-4.3

classification 💻 cs.CV
keywords semantic segmentationstate space modelsMambaboundary prioranti-dilutionADE20Kmulti-class segmentationhierarchical decoder
0
0 comments X

The pith

Reload-Mamba counters response dilution in Mamba state-space models for multi-class semantic segmentation through boundary-supervised priors, uncertainty-aware gates, and hierarchical reloads at multiple decoder levels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that sequential propagation in Mamba models attenuates boundary and detail signals essential for dense multi-class prediction. It counters this with three targeted designs: a boundary-supervised local detail prior trained on ground-truth masks, a Reload Gate that adds per-pixel class entropy from an auxiliary head as a gating signal, and a hierarchical multi-level Reload that refines and fuses representations top-down across three decoder stages. These sit atop a ConvNeXt-Tiny encoder, multi-scale decoder, and four-directional Mamba scanning with pixel-wise attention. If the designs work as described, they restore the lost responses without quadratic attention cost and deliver a cumulative 2.2 mIoU gain on ADE20K over a direct port of prior anti-dilution work. The result is 47.9 percent single-scale mIoU on ADE20K, 83.2 percent on Cityscapes, and 87.8 percent on PASCAL VOC 2012 val under standard protocols.

Core claim

The central claim is that propagation-induced response dilution in Mamba-based state-space models for semantic segmentation is mitigated by the Reload-Mamba framework's three segmentation-specific designs, which cumulatively improve over the direct-port baseline by 2.2 mIoU on ADE20K while reaching 47.9 percent single-scale mIoU on ADE20K, 83.2 percent on Cityscapes, and 87.8 percent on PASCAL VOC 2012 val.

What carries the argument

The class-uncertainty-aware Reload Gate combined with hierarchical multi-level Reload, which uses boundary-supervised priors and auxiliary entropy signals to restore attenuated responses at three decoder levels before top-down fusion.

If this is right

  • Each of the three designs contributes measurable improvement beyond a direct port of the prior anti-dilution architecture.
  • The full model reaches 47.9 percent single-scale mIoU on ADE20K and 83.2 percent on Cityscapes.
  • With ResNet-101 and COCO pre-training the same architecture reaches 87.8 percent mIoU on PASCAL VOC 2012 val.
  • The class-uncertainty-aware gate is formulated specifically for multi-class dense prediction and is informative only in that setting.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The hierarchical reload pattern could be tested on other linear-time sequence models applied to dense prediction tasks where boundary fidelity matters.
  • The uncertainty gate might reduce the need for heavy auxiliary supervision if the entropy signal can be derived from the main head itself.
  • If the anti-dilution effect holds, similar reload stages could be inserted into Mamba backbones for related tasks such as panoptic segmentation or monocular depth.

Load-bearing premise

The auxiliary entropy head and boundary-supervised prior supply independent signals that genuinely restore diluted responses rather than simply adding capacity or fitting dataset artifacts.

What would settle it

An experiment that replaces the boundary masks and entropy head with random or constant signals and still records the full 2.2 mIoU gain on ADE20K would show the gains come from added capacity rather than targeted anti-dilution.

Figures

Figures reproduced from arXiv: 2606.17966 by Hsin-Jui Pan, Jen-Shiun Chiang, Sheng-Wei Chan.

Figure 1
Figure 1. Figure 1: Overall architecture of the proposed Reload-Mamba segmentation network. The upper part shows the main [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative segmentation results on ADE20K. Each example shows the input image, ground-truth annotation, [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative segmentation results on Cityscapes. Each example shows the ground-truth annotation, the Reload [PITH_FULL_IMAGE:figures/full_fig_p019_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative segmentation results on PASCAL VOC 2012 for object-centric semantic segmentation. [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗
read the original abstract

Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Reload-Mamba, a Mamba-based semantic segmentation architecture that counters propagation-induced response dilution via three designs: (i) a boundary-supervised local detail prior trained on GT boundary masks, (ii) a class-uncertainty-aware Reload Gate that uses per-pixel entropy from a pre-reload auxiliary head, and (iii) a hierarchical multi-level reload mechanism applied at three decoder levels with top-down fusion. Built on a ConvNeXt-Tiny encoder with four-directional Mamba scanning and pixel-wise directional attention, the model reports 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K, 83.2% on Cityscapes, and 87.8% on PASCAL VOC 2012 val. Controlled ablations claim the three designs cumulatively deliver +2.2 mIoU over a direct-port baseline.

Significance. If the +2.2 mIoU gain is shown to arise specifically from the anti-dilution mechanisms rather than auxiliary supervision or capacity, the work would offer a practical route for adapting linear-time state-space models to boundary-sensitive dense prediction. The concrete benchmark numbers and the explicit hierarchical reload formulation constitute a strength; the segmentation-specific gating that incorporates class entropy is a targeted contribution for multi-class settings.

major comments (1)
  1. [Ablation study] Ablation study (corresponding to the controlled ablations referenced in the abstract): the direct-port baseline must be demonstrated to match total parameter count, FLOPs, and auxiliary training protocol (including the entropy head and boundary supervision) of the full Reload-Mamba model. Without this control, the cumulative +2.2 mIoU cannot be unambiguously attributed to the three segmentation-specific designs rather than added capacity or extra supervision signals.
minor comments (2)
  1. [Abstract and §3] The abstract and methods should explicitly state whether the auxiliary entropy head remains active at inference or is detached, and whether its parameters are counted in the reported model size.
  2. [Results] No error bars or multi-seed statistics accompany the mIoU figures; adding these would strengthen the reported deltas even if not load-bearing for the central claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for tighter controls in our ablation study. We address this point directly below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Ablation study] Ablation study (corresponding to the controlled ablations referenced in the abstract): the direct-port baseline must be demonstrated to match total parameter count, FLOPs, and auxiliary training protocol (including the entropy head and boundary supervision) of the full Reload-Mamba model. Without this control, the cumulative +2.2 mIoU cannot be unambiguously attributed to the three segmentation-specific designs rather than added capacity or extra supervision signals.

    Authors: We agree that unambiguous attribution requires the direct-port baseline to match the full model in parameter count, FLOPs, and auxiliary training protocol. The current manuscript describes the baseline as a direct port of the prior binarization architecture without the three proposed designs, but does not explicitly verify equivalence of auxiliary heads or report matching FLOPs/parameters for all variants. In the revised manuscript we will add a controlled ablation table that (i) reports parameter counts and FLOPs for every configuration, (ii) equips the direct-port baseline with auxiliary entropy and boundary heads under the same training protocol, and (iii) isolates the incremental effect of the boundary-supervised prior, entropy-aware gate, and hierarchical reload. This will confirm that the reported +2.2 mIoU gain arises from the segmentation-specific mechanisms rather than added capacity or supervision. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ablation gains are measured outcomes, not derived by construction

full rationale

The paper reports measured mIoU improvements (+2.2 on ADE20K) from three segmentation designs via controlled ablations against a direct-port baseline. No equations, fitted parameters renamed as predictions, or self-definitional reductions appear in the provided text. The boundary prior and entropy head add explicit supervision, but the gains are presented as experimental results rather than forced by the model definition itself. Self-citation to a prior binarization architecture is mentioned but not load-bearing for the current claims, as the evaluation uses external benchmarks (ADE20K, Cityscapes, PASCAL VOC). The derivation chain is self-contained against external data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated beyond the three new architectural components whose independent value is asserted by the ablation.

invented entities (2)
  • Reload Gate no independent evidence
    purpose: Incorporate per-pixel class entropy as gating signal for response restoration
    Introduced as a new gating formulation specific to multi-class dense prediction
  • boundary-supervised local detail prior no independent evidence
    purpose: Identify regions requiring response restoration using ground-truth boundary masks
    New auxiliary supervision branch not present in the referenced binarization baseline

pith-pipeline@v0.9.1-grok · 5831 in / 1490 out tokens · 34977 ms · 2026-06-27T01:12:16.739877+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 5 linked inside Pith

  1. [1]

    Segnet: A deep convolutional encoder-decoder architecture for image segmentation, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp

    Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architecture for image segmentation, in: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2481–2495

  2. [2]

    Deepmine-mamba: Mitigating information dilution in mamba-based state space models for document image binarization

    Chan, S.W., Wang, Y.C., Pan, H.J., Lin, C.M., Chiang, J.S., 2026. Deepmine-mamba: Mitigating information dilution in mamba-based state space models for document image binarization. arXiv preprint arXiv:2606.08781

  3. [3]

    Rethinking atrous convolution for semantic image segmentation

    Chen, L.C., Papandreou, G., Schroff, F., Adam, H., 2017. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587

  4. [4]

    Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision, pp

    Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with atrous separable convolution for semantic image segmentation, in: Proceedings of the European Conference on Computer Vision, pp. 801–818

  5. [5]

    Tensor low-rank reconstruction for semantic segmentation, in: European Conference on Computer Vision, pp

    Chen, W., Zhu, X., Sun, R., He, J., Li, R., Shen, X., Yu, B., 2020. Tensor low-rank reconstruction for semantic segmentation, in: European Conference on Computer Vision, pp. 52–69

  6. [6]

    Thecityscapesdatasetfor semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Cordts,M.,Omran,M.,Ramos,S.,Rehfeld,T.,Enzweiler,M.,Benenson,R.,Franke,U.,Roth,S.,Schiele,B.,2016. Thecityscapesdatasetfor semantic urban scene understanding, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3223

  7. [7]

    Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L., 2009. Imagenet: A large-scale hierarchical image database, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255

  8. [8]

    Boundary-aware feature propagation for scene segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Ding, H., Jiang, X., Liu, A.Q., Magnenat-Thalmann, N., Wang, G., 2019. Boundary-aware feature propagation for scene segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6819–6829

  9. [9]

    An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: International Conference on Learning Representations

  10. [10]

    Thepascalvisualobjectclasses(voc)challenge

    Everingham,M.,VanGool,L.,Williams,C.K.I.,Winn,J.,Zisserman,A.,2010. Thepascalvisualobjectclasses(voc)challenge. International Journal of Computer Vision 88, 303–338

  11. [11]

    Dualattentionnetworkforscenesegmentation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Fu,J.,Liu,J.,Tian,H.,Li,Y.,Bao,Y.,Fang,Z.,Lu,H.,2019. Dualattentionnetworkforscenesegmentation,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3146–3154

  12. [12]

    SegMAN:Omni-scalecontextmodelingwithstatespacemodelsandlocalattentionforsemanticsegmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Fu,Y.,Lou,M.,Yu,Y.,2025. SegMAN:Omni-scalecontextmodelingwithstatespacemodelsandlocalattentionforsemanticsegmentation, in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  13. [13]

    Mamba: Linear-time sequence modeling with selective state spaces

    Gu, A., Dao, T., 2023. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752

  14. [14]

    Efficiently modeling long sequences with structured state spaces, in: International Conference on Learning Representations

    Gu, A., Goel, K., Ré, C., 2022. Efficiently modeling long sequences with structured state spaces, in: International Conference on Learning Representations

  15. [15]

    Segnext: Rethinking convolutional attention design for semantic segmentation, in: Advances in Neural Information Processing Systems, pp

    Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M., 2022. Segnext: Rethinking convolutional attention design for semantic segmentation, in: Advances in Neural Information Processing Systems, pp. 1140–1156

  16. [16]

    Semantic contours from inverse detectors, in: Proceedings of the IEEE International Conference on Computer Vision, pp

    Hariharan, B., Arbelaez, P., Bourdev, L., Maji, S., Malik, J., 2011. Semantic contours from inverse detectors, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 991–998. S.W. Chan:Preprint submitted to ElsevierPage 21 of 23 Reload-Mamba

  17. [17]

    Mambavision:Ahybridmamba-transformervisionbackbone,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition

    Hatamizadeh,A.,Kautz,J.,2025. Mambavision:Ahybridmamba-transformervisionbackbone,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition

  18. [18]

    Deepresiduallearningforimagerecognition,in:ProceedingsoftheIEEEConferenceonComputer Vision and Pattern Recognition, pp

    He,K.,Zhang,X.,Ren,S.,Sun,J.,2016. Deepresiduallearningforimagerecognition,in:ProceedingsoftheIEEEConferenceonComputer Vision and Pattern Recognition, pp. 770–778

  19. [19]

    Localmamba: Visual state space model with windowed selective scan

    Huang, T., Pei, X., You, S., Wang, F., Qian, C., Xu, C., 2024. Localmamba: Visual state space model with windowed selective scan. arXiv preprint arXiv:2403.09338

  20. [20]

    Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Huang, Z., Wang, X., Huang, L., Huang, C., Wei, Y., Liu, W., 2019. Ccnet: Criss-cross attention for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 603–612

  21. [21]

    Alignseg: Feature-aligned segmentation networks

    Huang, Z., Wei, Y., Wang, X., Liu, W., Huang, T.S., Shi, H., 2022. Alignseg: Feature-aligned segmentation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 44, 550–557

  22. [22]

    Deeply-supervisednets,in:ProceedingsoftheEighteenthInternationalConference on Artificial Intelligence and Statistics, pp

    Lee,C.Y.,Xie,S.,Gallagher,P.,Zhang,Z.,Tu,Z.,2015. Deeply-supervisednets,in:ProceedingsoftheEighteenthInternationalConference on Artificial Intelligence and Statistics, pp. 562–570

  23. [23]

    Semantic flow for fast and accurate scene parsing, in: European Conference on Computer Vision, Springer

    Li, X., You, A., Zhu, Z., Zhao, H., Yang, M., Yang, K., Tong, Y., 2020. Semantic flow for fast and accurate scene parsing, in: European Conference on Computer Vision, Springer. pp. 775–793

  24. [24]

    Expectation-maximization attention networks for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Li, X., Zhong, Z., Wu, J., Yang, Y., Lin, Z., Liu, H., 2019. Expectation-maximization attention networks for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9167–9176

  25. [25]

    Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S., 2017. Feature pyramid networks for object detection, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125

  26. [26]

    Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L., 2014. Microsoft coco: Common objects in context, in: European Conference on Computer Vision, Springer. pp. 740–755

  27. [27]

    Vmamba: Visual state space model, in: Advances in Neural Information Processing Systems

    Liu, Y., Tian, Y., Zhao, Y., Yu, H., Xie, L., Wang, Y., Ye, Q., Jiao, J., Liu, Y., 2024. Vmamba: Visual state space model, in: Advances in Neural Information Processing Systems

  28. [28]

    Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022

  29. [29]

    Aconvnetforthe2020s,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp

    Liu,Z.,Mao,H.,Wu,C.Y.,Feichtenhofer,C.,Darrell,T.,Xie,S.,2022. Aconvnetforthe2020s,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 11976–11986

  30. [30]

    Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3431–3440

  31. [31]

    Decoupled weight decay regularization, in: International Conference on Learning Representations

    Loshchilov, I., Hutter, F., 2019. Decoupled weight decay regularization, in: International Conference on Learning Representations

  32. [32]

    SparX: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks

    Lou, M., Fu, Y., Yu, Y., 2024. SparX: A sparse cross-layer connection mechanism for hierarchical vision mamba and transformer networks. arXiv preprint arXiv:2409.09649

  33. [33]

    U-mamba: Enhancing long-range dependency for biomedical image segmentation

    Ma, J., Li, F., Wang, B., 2024. U-mamba: Enhancing long-range dependency for biomedical image segmentation. arXiv preprint arXiv:2401.04722

  34. [34]

    The mapillary vistas dataset for semantic understanding of street scenes, in: Proceedings of the IEEE International Conference on Computer Vision, pp

    Neuhold, G., Ollmann, T., Rota Bulo, S., Kontschieder, P., 2017. The mapillary vistas dataset for semantic understanding of street scenes, in: Proceedings of the IEEE International Conference on Computer Vision, pp. 4990–4999

  35. [35]

    Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural Information Processing Systems

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S., 2019. Pytorch: An imperative style, high-performance deep learning library, in: Advances in Neural In...

  36. [36]

    U-net:Convolutionalnetworksforbiomedicalimagesegmentation,in:MedicalImageComputing and Computer-Assisted Intervention, Springer

    Ronneberger,O.,Fischer,P.,Brox,T.,2015. U-net:Convolutionalnetworksforbiomedicalimagesegmentation,in:MedicalImageComputing and Computer-Assisted Intervention, Springer. pp. 234–241

  37. [37]

    Vm-unet: Vision mamba unet for medical image segmentation

    Ruan, J., Xiang, S., 2024. Vm-unet: Vision mamba unet for medical image segmentation. arXiv preprint arXiv:2402.02491

  38. [38]

    Multi-scale vmamba: Hierarchy in hierarchy visual state space model

    Shi, Y., Dong, M., Xu, C., 2024. Multi-scale vmamba: Hierarchy in hierarchy visual state space model. arXiv preprint arXiv:2405.14174

  39. [39]

    Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations

    Simonyan, K., Zisserman, A., 2015. Very deep convolutional networks for large-scale image recognition, in: International Conference on Learning Representations

  40. [40]

    An isotropic 3x3 image gradient operator

    Sobel, I., Feldman, G., 1968. An isotropic 3x3 image gradient operator. Presented at the Stanford Artificial Intelligence Project

  41. [41]

    Gated-scnn: Gated shape cnns for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp

    Takikawa, T., Acuna, D., Jampani, V., Fidler, S., 2019. Gated-scnn: Gated shape cnns for semantic segmentation, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5229–5238

  42. [42]

    Spatial-Mamba: Effective visual state space models via structure-aware state fusion

    Xiao, C., Li, M., Zhang, Z., Meng, D., Zhang, L., 2024. Spatial-Mamba: Effective visual state space models via structure-aware state fusion. arXiv preprint arXiv:2410.15091

  43. [43]

    Segformer: Simple and efficient design for semantic segmentation with transformers, in: Advances in Neural Information Processing Systems, pp

    Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P., 2021. Segformer: Simple and efficient design for semantic segmentation with transformers, in: Advances in Neural Information Processing Systems, pp. 12077–12090

  44. [44]

    Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K., 2017. Aggregated residual transformations for deep neural networks, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500

  45. [45]

    Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation, in: Medical Image Computing and Computer Assisted Intervention, Springer

    Xing, Z., Ye, T., Yang, Y., Liu, G., Zhu, L., 2024. Segmamba: Long-range sequential modeling mamba for 3d medical image segmentation, in: Medical Image Computing and Computer Assisted Intervention, Springer. pp. 578–588

  46. [46]

    Focal self-attention for local-global interactions in vision transformers, in: Advances in Neural Information Processing Systems, pp

    Yang, J., Li, C., Zhang, P., Dai, X., Xiao, B., Yuan, L., Gao, J., 2021. Focal self-attention for local-global interactions in vision transformers, in: Advances in Neural Information Processing Systems, pp. 30008–30022

  47. [47]

    Learning a discriminative feature network for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Yu, C., Wang, J., Peng, C., Gao, C., Yu, G., Sang, N., 2018. Learning a discriminative feature network for semantic segmentation, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1857–1866

  48. [48]

    Mambaout:Dowereallyneedmambaforvision?,in:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition

    Yu,W.,Wang,X.,2025. Mambaout:Dowereallyneedmambaforvision?,in:ProceedingsoftheIEEE/CVFConferenceonComputerVision and Pattern Recognition

  49. [49]

    Object-contextual representations for semantic segmentation, in: European Conference on Computer Vision, Springer

    Yuan, Y., Chen, X., Wang, J., 2020. Object-contextual representations for semantic segmentation, in: European Conference on Computer Vision, Springer. pp. 173–190. S.W. Chan:Preprint submitted to ElsevierPage 22 of 23 Reload-Mamba

  50. [50]

    Contextencodingforsemanticsegmentation,in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Zhang,H.,Dana,K.,Shi,J.,Zhang,Z.,Wang,X.,Tyagi,A.,Agrawal,A.,2018. Contextencodingforsemanticsegmentation,in:Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7151–7160

  51. [51]

    Co-occurrentfeaturesinsemanticsegmentation,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp

    Zhang,H.,Zhang,H.,Wang,C.,Xie,J.,2019. Co-occurrentfeaturesinsemanticsegmentation,in:ProceedingsoftheIEEE/CVFConference on Computer Vision and Pattern Recognition, pp. 548–557

  52. [52]

    Pyramidsceneparsingnetwork,in:ProceedingsoftheIEEEConferenceonComputerVision and Pattern Recognition, pp

    Zhao,H.,Shi,J.,Qi,X.,Wang,X.,Jia,J.,2017. Pyramidsceneparsingnetwork,in:ProceedingsoftheIEEEConferenceonComputerVision and Pattern Recognition, pp. 2881–2890

  53. [53]

    Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A., 2017. Scene parsing through ade20k dataset, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641

  54. [54]

    Vision mamba: Efficient visual representation learning with bidirectional state space model

    Zhu, L., Liao, B., Zhang, Q., Wang, X., Liu, W., Wang, X., 2024. Vision mamba: Efficient visual representation learning with bidirectional state space model. arXiv preprint arXiv:2401.09417 . S.W. Chan:Preprint submitted to ElsevierPage 23 of 23