arxiv: 2604.18721 · v1 · submitted 2026-04-20 · 📡 eess.IV · cs.CV

Recognition: unknown

A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation

Nichula Wasalathilaka , Dineth Perera , Oshadha Samarakoon , Buddhi Wijenayake , Roshan Godaliyadda , Vijitha Herath , Parakrama Ekanayake

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:05 UTC · model grok-4.3

classification 📡 eess.IV cs.CV

keywords visual state-space modelsremote sensing segmentationdomain shiftboundary analysisbenchmarkMambasemantic segmentationefficiency trade-offs

0 comments

The pith

Visual state-space models show favorable efficiency in remote-sensing segmentation yet gain little from encoder scaling while boundary errors dominate under domain shifts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a controlled benchmark for visual state-space model encoders in remote-sensing semantic segmentation by varying only the encoder while fixing the decoder and training protocol. It evaluates models including VMamba, MambaVision, and Spatial-Mamba against CNN and Transformer baselines on the LoveDA and ISPRS Potsdam datasets. The findings indicate modest benefits from scaling within SSM families, asymmetric generalization under domain shifts, and boundary errors as the primary failure mode. Sympathetic readers would care because these results redirect development efforts toward robustness and boundary-aware techniques rather than relying on encoder scaling for better performance.

Core claim

Under a unified 4-stage feature interface and fixed lightweight decoder, visual SSM backbones achieve better accuracy-efficiency balances than controlled CNN and Transformer baselines. However, increasing encoder size within each SSM family produces only modest segmentation gains, cross-domain performance is strongly asymmetric, and boundary errors dominate under distribution shifts, suggesting that robustness-oriented designs and boundary-aware decoding will drive future improvements more than encoder scaling.

What carries the argument

The strictly controlled experimental setup with a unified 4-stage feature interface and fixed lightweight decoder that isolates the effects of different visual state-space encoders.

If this is right

Intra-family scaling of visual SSM encoders yields only modest gains in segmentation accuracy.
Cross-domain generalization exhibits strong asymmetry between the evaluated remote-sensing datasets.
Boundary delineation errors constitute the dominant failure mode under distribution shift.
Visual SSM backbones deliver favorable accuracy-efficiency trade-offs relative to the CNN and Transformer controls.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Adding boundary-aware components to the decoder could reduce the main source of errors identified in the study.
Reversing the domain shift direction in tests might confirm whether the asymmetry is inherent to specific dataset pairs.
The observed efficiency could support real-time segmentation on resource-limited platforms for satellite image analysis.

Load-bearing premise

A single fixed lightweight decoder and unified 4-stage feature interface are sufficient to isolate the effects of the encoders without decoder-specific biases altering the scaling, asymmetry, or boundary patterns.

What would settle it

Repeating the benchmark with multiple different decoder designs and observing whether the modest scaling gains, asymmetric generalization, and boundary dominance persist would determine if the encoder isolation is valid.

Figures

Figures reproduced from arXiv: 2604.18721 by Buddhi Wijenayake, Dineth Perera, Nichula Wasalathilaka, Oshadha Samarakoon, Parakrama Ekanayake, Roshan Godaliyadda, Vijitha Herath.

**Figure 1.** Figure 1: Overview of the controlled benchmark pipeline. To ensure fair comparison, all backbones including visual SSMs (VMamba, MambaVision, Spatial [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Qualitative comparison under the controlled protocol. (a) LoveDA: Urban/Rural examples illustrating boundary delineation and robustness under [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Efficiency analysis on LoveDA: A comparison of mIoU versus FPS [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

read the original abstract

Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A useful controlled benchmark that isolates visual SSM encoder effects on remote-sensing segmentation better than most prior work, though the fixed decoder leaves some isolation questions open.

read the letter

The paper's real contribution is its strictly controlled setup: only the encoder changes while the decoder, training protocol, and 4-stage feature interface stay fixed across VMamba, MambaVision, Spatial-Mamba, and the CNN/Transformer baselines. That kind of isolation is uncommon in visual SSM papers and makes the results easier to interpret. The three findings—modest gains from intra-family scaling, strong asymmetry in cross-domain performance between LoveDA and Potsdam, and boundary delineation as the main failure mode under shift—follow directly from the experiments and give practical direction. The accuracy-efficiency numbers also look reasonable for the SSM variants relative to the controls. The work is purely empirical with no circular claims or invented parameters, and the protocol is reproducible enough that others could rerun or extend it.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a controlled empirical benchmark of visual state-space model (SSM) backbones (VMamba, MambaVision, Spatial-Mamba) against CNN and Transformer baselines for semantic segmentation on the LoveDA and ISPRS Potsdam remote-sensing datasets. With only the encoder varying under a fixed lightweight decoder and unified 4-stage feature interface, the study reports three main findings: intra-family scaling produces only modest accuracy gains, cross-domain generalization is strongly asymmetric, and boundary delineation errors dominate under distribution shift. The authors conclude that visual SSMs offer favorable accuracy-efficiency trade-offs relative to the baselines but that future progress is more likely to arise from robustness-oriented design and boundary-aware decoding than from encoder scaling alone, while establishing a reproducible protocol for such evaluations.

Significance. If the encoder isolation holds, the work supplies a practical, reproducible reference benchmark for visual SSMs in remote-sensing segmentation, a setting where domain shift is prevalent. It usefully directs attention away from pure scaling toward boundary handling and robustness, which aligns with observed failure modes in the experiments. The emphasis on a unified interface and controlled comparison is a constructive contribution to the empirical literature on efficient vision backbones.

major comments (1)

[Experimental Setup / Methods] The headline claim that intra-family scaling yields only modest gains and that robustness/boundary-aware decoding will matter more than encoder scaling rests on the assumption that the fixed lightweight decoder and unified 4-stage feature interface cleanly isolate encoder effects. The interface normalizes spatial resolution and channel count but leaves higher-order statistics (activation distributions, long-range dependency patterns) unnormalized across SSM, CNN, and Transformer families. No ablation that varies the decoder or interface while holding encoders fixed is described, so decoder-encoder compatibility biases cannot be ruled out as contributors to the observed modest scaling, asymmetry, and boundary-failure patterns. This is load-bearing for the central conclusions (abstract and §3–4).

minor comments (2)

The abstract and results sections would benefit from explicit reporting of the number of training runs, whether error bars or statistical significance tests accompany the 'modest gains' and 'strongly asymmetric' statements, and the precise channel/resolution normalization steps in the 4-stage interface.
Figure captions and tables should clarify which metrics are reported on the source versus target domains for the cross-domain experiments to make the asymmetry claim immediately verifiable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the manuscript's contribution. We address the single major comment below with an honest accounting of the experimental design choices and planned revisions.

read point-by-point responses

Referee: [Experimental Setup / Methods] The headline claim that intra-family scaling yields only modest gains and that robustness/boundary-aware decoding will matter more than encoder scaling rests on the assumption that the fixed lightweight decoder and unified 4-stage feature interface cleanly isolate encoder effects. The interface normalizes spatial resolution and channel count but leaves higher-order statistics (activation distributions, long-range dependency patterns) unnormalized across SSM, CNN, and Transformer families. No ablation that varies the decoder or interface while holding encoders fixed is described, so decoder-encoder compatibility biases cannot be ruled out as contributors to the observed modest scaling, asymmetry, and boundary-failure patterns. This is load-bearing for the central conclusions (abstract and §3–4).

Authors: We agree that the 4-stage interface normalizes only spatial resolution and channel count, leaving higher-order statistics unnormalized, and that decoder-encoder compatibility effects cannot be fully ruled out without additional ablations. This is a genuine limitation of the current experimental design. The study deliberately fixes the decoder and interface to create a reproducible, practical benchmark that isolates encoder variation under a common lightweight head, following the protocol used in many backbone comparison papers. No decoder-variation ablation is described because it would shift the focus away from the encoder-centric question. We will make a partial revision by adding an explicit limitations paragraph in §4 (and updating the abstract and conclusions) that acknowledges residual family-specific biases, clarifies the scope of the isolation claim, and notes that the observed trends (modest scaling, asymmetric generalization, boundary dominance) hold under the reported controlled decoder setup. We will not add new decoder ablations, as they fall outside the original scope. revision: partial

Circularity Check

0 steps flagged

No significant circularity: purely empirical benchmark with no derivations

full rationale

The paper presents a controlled experimental benchmark of visual SSM encoders for remote-sensing segmentation, with findings drawn directly from accuracy, efficiency, and boundary metrics on LoveDA and ISPRS Potsdam under a fixed protocol. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methodology. Central claims (modest intra-family scaling, asymmetric generalization, boundary failures) are observational results rather than reductions to inputs by construction. The fixed decoder and 4-stage interface constitute an experimental design choice whose validity can be debated on empirical grounds but does not create circularity in any derivation chain. This is the expected outcome for a benchmark study whose value lies in reproducible measurements, not in a claimed proof or predictive model.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard computer-vision benchmark assumptions rather than new fitted parameters or invented entities.

axioms (2)

domain assumption LoveDA and ISPRS Potsdam are sufficiently representative benchmarks for assessing remote-sensing semantic segmentation under domain shift.
The paper uses these two datasets as the sole evaluation targets without additional justification for broader generality.
domain assumption A fixed lightweight decoder and 4-stage feature interface do not interact with encoder choice in ways that confound the reported findings.
This assumption is required for the claim that observed differences are attributable to the encoder alone.

pith-pipeline@v0.9.0 · 5534 in / 1418 out tokens · 38964 ms · 2026-05-10T03:05:46.932791+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 4 canonical work pages · 1 internal anchor

[1]

Semantic segmentation of remote sensing imagery using an enhanced encoder-decoder architec- ture,

N. Aburaed, M. Al-Saad, M. Alkhatib, M. Zitouni, S. Alman- soori, and H. Al-Ahmad, “Semantic segmentation of remote sensing imagery using an enhanced encoder-decoder architec- ture,”ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 10, pp. 1015–1020, 2023

2023
[2]

Encoder-decoder with atrous separable convolution for se- mantic image segmentation,

L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for se- mantic image segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018

2018
[3]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017

2017
[4]

UNetFormer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,

L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M. Atkinson, “UNetFormer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 190, pp. 196–214, 2022

2022
[5]

Enhanced scannet with cbam and dice loss for semantic change detection,

R. Ratnayake, W. Wijenayake, D. Sumanasekara, G. Go- daliyadda, H. Herath, and M. Ekanayake, “Enhanced scannet with cbam and dice loss for semantic change detection,” in2025 Moratuwa Engineering Research Conference (MERCon), 2025, pp. 84–89

2025
[6]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page Pith review arXiv 2023
[7]

arXiv preprint arXiv:2404.02668 (2024)

X. Ma, X. Zhang, and C. Man, “RS-Mamba: Large remote sensing image semantic segmentation via bidirectional state space model,”arXiv preprint arXiv:2404.02668, 2024

work page arXiv 2024
[8]

Samba: Semantic segmentation of remote sensing images with state space model,

Q. Zhu, Y . Cai, Y . Fang, Y . Yang, C. Hartmann, and L. Zhao, “Samba: Semantic segmentation of remote sensing images with state space model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024

2024
[9]

Mamba-fcs: Joint spatio-frequency feature fusion and sek- inspired loss for semantic change detection,

B. Wijenayake, A. Ratnayake, P. Sumanasekara, R. Go- daliyadda, P. Ekanayake, V . Herath, and N. Wasalathilaka, “Mamba-fcs: Joint spatio-frequency feature fusion and sek- inspired loss for semantic change detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 7680–7698, 2026

2026
[10]

Precision spatio-temporal feature fusion for robust remote sensing change detection,

W. M. B. S. K. Wijenayakeet al., “Precision spatio-temporal feature fusion for robust remote sensing change detection,” in 2025 IEEE 19th International Conference on Industrial and Information Systems (ICIIS), 2026, pp. 557–562

2025
[11]

VMamba: Visual State Space Model

Y . Liu, H. Yu, L. Xie, Y . Tian, Y . Zhao, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “VMamba: Visual state space model,”arXiv preprint arXiv:2401.10166, 2024

work page internal anchor Pith review arXiv 2024
[12]

MambaVision: A hybrid Mamba-Transformer vision back- bone,

A. Hatamizadeh and J. Kautz, “MambaVision: A hy- brid mamba-transformer vision backbone,”arXiv preprint arXiv:2407.08083, 2024

work page arXiv 2024
[13]

Spatial- mamba: Effective visual state space models via structure-aware state fusion,

C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial- mamba: Effective visual state space models via structure-aware state fusion,” inInternational Conference on Learning Repre- sentations (ICLR), 2025

2025
[14]

LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,

J. Wang, Z. Zheng, A. Ma, X. Lu, and Y . Zhong, “LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProceedings of the Neural Information Pro- cessing Systems (NeurIPS) Track on Datasets and Benchmarks, vol. 1, 2021

2021
[15]

U-Net: Convolutional networks for biomedical image segmentation,

O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” inMedical Im- age Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241

2015
[16]

The ISPRS benchmark on urban object classification and 3D building reconstruction,

F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3D building reconstruction,”ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 1, pp. 293–298, 2012

2012
[17]

Decoupled weight decay regular- ization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regular- ization,” inInternational Conference on Learning Representa- tions (ICLR), 2019

2019
[18]

The Lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,

M. Berman, A. R. Triki, and M. B. Blaschko, “The Lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4413–4421

2018
[19]

Focal loss for dense object detection,

T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988

2017