Recognition: unknown
A Controlled Benchmark of Visual State-Space Backbones with Domain-Shift and Boundary Analysis for Remote-Sensing Segmentation
Pith reviewed 2026-05-10 03:05 UTC · model grok-4.3
The pith
Visual state-space models show favorable efficiency in remote-sensing segmentation yet gain little from encoder scaling while boundary errors dominate under domain shifts
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under a unified 4-stage feature interface and fixed lightweight decoder, visual SSM backbones achieve better accuracy-efficiency balances than controlled CNN and Transformer baselines. However, increasing encoder size within each SSM family produces only modest segmentation gains, cross-domain performance is strongly asymmetric, and boundary errors dominate under distribution shifts, suggesting that robustness-oriented designs and boundary-aware decoding will drive future improvements more than encoder scaling.
What carries the argument
The strictly controlled experimental setup with a unified 4-stage feature interface and fixed lightweight decoder that isolates the effects of different visual state-space encoders.
If this is right
- Intra-family scaling of visual SSM encoders yields only modest gains in segmentation accuracy.
- Cross-domain generalization exhibits strong asymmetry between the evaluated remote-sensing datasets.
- Boundary delineation errors constitute the dominant failure mode under distribution shift.
- Visual SSM backbones deliver favorable accuracy-efficiency trade-offs relative to the CNN and Transformer controls.
Where Pith is reading between the lines
- Adding boundary-aware components to the decoder could reduce the main source of errors identified in the study.
- Reversing the domain shift direction in tests might confirm whether the asymmetry is inherent to specific dataset pairs.
- The observed efficiency could support real-time segmentation on resource-limited platforms for satellite image analysis.
Load-bearing premise
A single fixed lightweight decoder and unified 4-stage feature interface are sufficient to isolate the effects of the encoders without decoder-specific biases altering the scaling, asymmetry, or boundary patterns.
What would settle it
Repeating the benchmark with multiple different decoder designs and observing whether the modest scaling gains, asymmetric generalization, and boundary dominance persist would determine if the encoder isolation is valid.
Figures
read the original abstract
Visual state-space models (SSMs) are increasingly promoted as efficient alternatives to Vision Transformers, yet their practical advantages remain unclear under fair comparison because existing studies rarely isolate encoder effects from decoder and training choices. We present a strictly controlled benchmark of representative visual SSM families, including VMamba, MambaVision, and Spatial-Mamba, for remote-sensing semantic segmentation, in which only the encoder varies across experiments. Evaluated on LoveDA and ISPRS Potsdam under a unified 4-stage feature interface and a fixed lightweight decoder, the benchmark reveals three main findings, intra-family scaling yields only modest gains, cross-domain generalization is strongly asymmetric, and boundary delineation is the dominant failure mode under distribution shift. Although visual SSMs achieve favorable accuracy-efficiency trade-offs relative to the controlled CNN and Transformer baselines considered here, the results suggest that future improvements are more likely to come from robustness-oriented design and boundary-aware decoding than from encoder scaling alone. By isolating encoder behavior under a unified and reproducible protocol, this study establishes a practical reference benchmark for the design and evaluation of future Mamba-based segmentation backbones
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a controlled empirical benchmark of visual state-space model (SSM) backbones (VMamba, MambaVision, Spatial-Mamba) against CNN and Transformer baselines for semantic segmentation on the LoveDA and ISPRS Potsdam remote-sensing datasets. With only the encoder varying under a fixed lightweight decoder and unified 4-stage feature interface, the study reports three main findings: intra-family scaling produces only modest accuracy gains, cross-domain generalization is strongly asymmetric, and boundary delineation errors dominate under distribution shift. The authors conclude that visual SSMs offer favorable accuracy-efficiency trade-offs relative to the baselines but that future progress is more likely to arise from robustness-oriented design and boundary-aware decoding than from encoder scaling alone, while establishing a reproducible protocol for such evaluations.
Significance. If the encoder isolation holds, the work supplies a practical, reproducible reference benchmark for visual SSMs in remote-sensing segmentation, a setting where domain shift is prevalent. It usefully directs attention away from pure scaling toward boundary handling and robustness, which aligns with observed failure modes in the experiments. The emphasis on a unified interface and controlled comparison is a constructive contribution to the empirical literature on efficient vision backbones.
major comments (1)
- [Experimental Setup / Methods] The headline claim that intra-family scaling yields only modest gains and that robustness/boundary-aware decoding will matter more than encoder scaling rests on the assumption that the fixed lightweight decoder and unified 4-stage feature interface cleanly isolate encoder effects. The interface normalizes spatial resolution and channel count but leaves higher-order statistics (activation distributions, long-range dependency patterns) unnormalized across SSM, CNN, and Transformer families. No ablation that varies the decoder or interface while holding encoders fixed is described, so decoder-encoder compatibility biases cannot be ruled out as contributors to the observed modest scaling, asymmetry, and boundary-failure patterns. This is load-bearing for the central conclusions (abstract and §3–4).
minor comments (2)
- The abstract and results sections would benefit from explicit reporting of the number of training runs, whether error bars or statistical significance tests accompany the 'modest gains' and 'strongly asymmetric' statements, and the precise channel/resolution normalization steps in the 4-stage interface.
- Figure captions and tables should clarify which metrics are reported on the source versus target domains for the cross-domain experiments to make the asymmetry claim immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the manuscript's contribution. We address the single major comment below with an honest accounting of the experimental design choices and planned revisions.
read point-by-point responses
-
Referee: [Experimental Setup / Methods] The headline claim that intra-family scaling yields only modest gains and that robustness/boundary-aware decoding will matter more than encoder scaling rests on the assumption that the fixed lightweight decoder and unified 4-stage feature interface cleanly isolate encoder effects. The interface normalizes spatial resolution and channel count but leaves higher-order statistics (activation distributions, long-range dependency patterns) unnormalized across SSM, CNN, and Transformer families. No ablation that varies the decoder or interface while holding encoders fixed is described, so decoder-encoder compatibility biases cannot be ruled out as contributors to the observed modest scaling, asymmetry, and boundary-failure patterns. This is load-bearing for the central conclusions (abstract and §3–4).
Authors: We agree that the 4-stage interface normalizes only spatial resolution and channel count, leaving higher-order statistics unnormalized, and that decoder-encoder compatibility effects cannot be fully ruled out without additional ablations. This is a genuine limitation of the current experimental design. The study deliberately fixes the decoder and interface to create a reproducible, practical benchmark that isolates encoder variation under a common lightweight head, following the protocol used in many backbone comparison papers. No decoder-variation ablation is described because it would shift the focus away from the encoder-centric question. We will make a partial revision by adding an explicit limitations paragraph in §4 (and updating the abstract and conclusions) that acknowledges residual family-specific biases, clarifies the scope of the isolation claim, and notes that the observed trends (modest scaling, asymmetric generalization, boundary dominance) hold under the reported controlled decoder setup. We will not add new decoder ablations, as they fall outside the original scope. revision: partial
Circularity Check
No significant circularity: purely empirical benchmark with no derivations
full rationale
The paper presents a controlled experimental benchmark of visual SSM encoders for remote-sensing segmentation, with findings drawn directly from accuracy, efficiency, and boundary metrics on LoveDA and ISPRS Potsdam under a fixed protocol. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions appear in the abstract or described methodology. Central claims (modest intra-family scaling, asymmetric generalization, boundary failures) are observational results rather than reductions to inputs by construction. The fixed decoder and 4-stage interface constitute an experimental design choice whose validity can be debated on empirical grounds but does not create circularity in any derivation chain. This is the expected outcome for a benchmark study whose value lies in reproducible measurements, not in a claimed proof or predictive model.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LoveDA and ISPRS Potsdam are sufficiently representative benchmarks for assessing remote-sensing semantic segmentation under domain shift.
- domain assumption A fixed lightweight decoder and 4-stage feature interface do not interact with encoder choice in ways that confound the reported findings.
Reference graph
Works this paper leans on
-
[1]
Semantic segmentation of remote sensing imagery using an enhanced encoder-decoder architec- ture,
N. Aburaed, M. Al-Saad, M. Alkhatib, M. Zitouni, S. Alman- soori, and H. Al-Ahmad, “Semantic segmentation of remote sensing imagery using an enhanced encoder-decoder architec- ture,”ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 10, pp. 1015–1020, 2023
2023
-
[2]
Encoder-decoder with atrous separable convolution for se- mantic image segmentation,
L.-C. Chen, Y . Zhu, G. Papandreou, F. Schroff, and H. Adam, “Encoder-decoder with atrous separable convolution for se- mantic image segmentation,” inProceedings of the European Conference on Computer Vision (ECCV), September 2018
2018
-
[3]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inAdvances in Neural Information Processing Systems, vol. 30, 2017
2017
-
[4]
UNetFormer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,
L. Wang, R. Li, C. Zhang, S. Fang, C. Duan, X. Meng, and P. M. Atkinson, “UNetFormer: A unet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery,” ISPRS Journal of Photogrammetry and Remote Sensing, vol. 190, pp. 196–214, 2022
2022
-
[5]
Enhanced scannet with cbam and dice loss for semantic change detection,
R. Ratnayake, W. Wijenayake, D. Sumanasekara, G. Go- daliyadda, H. Herath, and M. Ekanayake, “Enhanced scannet with cbam and dice loss for semantic change detection,” in2025 Moratuwa Engineering Research Conference (MERCon), 2025, pp. 84–89
2025
-
[6]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023
work page Pith review arXiv 2023
-
[7]
arXiv preprint arXiv:2404.02668 (2024)
X. Ma, X. Zhang, and C. Man, “RS-Mamba: Large remote sensing image semantic segmentation via bidirectional state space model,”arXiv preprint arXiv:2404.02668, 2024
-
[8]
Samba: Semantic segmentation of remote sensing images with state space model,
Q. Zhu, Y . Cai, Y . Fang, Y . Yang, C. Hartmann, and L. Zhao, “Samba: Semantic segmentation of remote sensing images with state space model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–15, 2024
2024
-
[9]
Mamba-fcs: Joint spatio-frequency feature fusion and sek- inspired loss for semantic change detection,
B. Wijenayake, A. Ratnayake, P. Sumanasekara, R. Go- daliyadda, P. Ekanayake, V . Herath, and N. Wasalathilaka, “Mamba-fcs: Joint spatio-frequency feature fusion and sek- inspired loss for semantic change detection,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 19, pp. 7680–7698, 2026
2026
-
[10]
Precision spatio-temporal feature fusion for robust remote sensing change detection,
W. M. B. S. K. Wijenayakeet al., “Precision spatio-temporal feature fusion for robust remote sensing change detection,” in 2025 IEEE 19th International Conference on Industrial and Information Systems (ICIIS), 2026, pp. 557–562
2025
-
[11]
VMamba: Visual State Space Model
Y . Liu, H. Yu, L. Xie, Y . Tian, Y . Zhao, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “VMamba: Visual state space model,”arXiv preprint arXiv:2401.10166, 2024
work page internal anchor Pith review arXiv 2024
-
[12]
MambaVision: A hybrid Mamba-Transformer vision back- bone,
A. Hatamizadeh and J. Kautz, “MambaVision: A hy- brid mamba-transformer vision backbone,”arXiv preprint arXiv:2407.08083, 2024
-
[13]
Spatial- mamba: Effective visual state space models via structure-aware state fusion,
C. Xiao, M. Li, Z. Zhang, D. Meng, and L. Zhang, “Spatial- mamba: Effective visual state space models via structure-aware state fusion,” inInternational Conference on Learning Repre- sentations (ICLR), 2025
2025
-
[14]
LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,
J. Wang, Z. Zheng, A. Ma, X. Lu, and Y . Zhong, “LoveDA: A remote sensing land-cover dataset for domain adaptive semantic segmentation,” inProceedings of the Neural Information Pro- cessing Systems (NeurIPS) Track on Datasets and Benchmarks, vol. 1, 2021
2021
-
[15]
U-Net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” inMedical Im- age Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241
2015
-
[16]
The ISPRS benchmark on urban object classification and 3D building reconstruction,
F. Rottensteiner, G. Sohn, J. Jung, M. Gerke, C. Baillard, S. Benitez, and U. Breitkopf, “The ISPRS benchmark on urban object classification and 3D building reconstruction,”ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. 1, no. 1, pp. 293–298, 2012
2012
-
[17]
Decoupled weight decay regular- ization,
I. Loshchilov and F. Hutter, “Decoupled weight decay regular- ization,” inInternational Conference on Learning Representa- tions (ICLR), 2019
2019
-
[18]
The Lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,
M. Berman, A. R. Triki, and M. B. Blaschko, “The Lov ´asz- softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4413–4421
2018
-
[19]
Focal loss for dense object detection,
T.-Y . Lin, P. Goyal, R. Girshick, K. He, and P. Doll ´ar, “Focal loss for dense object detection,” inProceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2980–2988
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.