Transformer-Progressive Mamba Network for Lightweight Image Super-Resolution
Pith reviewed 2026-05-18 01:38 UTC · model grok-4.3
The pith
Integrating window self-attention with progressive Mamba enables scale interactions that improve lightweight image super-resolution without added cost.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By integrating window-based self-attention with Progressive Mamba, the method establishes a fine-grained modeling paradigm that progressively enhances feature representation through interactions among receptive fields of different scales without introducing additional computational cost. The Adaptive High-Frequency Refinement Module recovers high-frequency details lost during Transformer and Mamba processing. This yields better performance than recent Transformer- or Mamba-based methods while incurring lower computational cost.
What carries the argument
Progressive Mamba, which creates progressive interactions among receptive fields at different scales to enhance feature representation without added cost.
If this is right
- Receptive fields expand progressively across network layers through scale interactions.
- Feature expressiveness grows without raising overall computational cost.
- High-frequency image details are restored more effectively after main processing.
- The network achieves higher super-resolution quality than recent Transformer or Mamba baselines.
Where Pith is reading between the lines
- The scale-interaction pattern could transfer to other efficient vision tasks such as denoising or deblurring.
- Real-time super-resolution on mobile hardware might become practical with this efficiency gain.
- Further tests on video sequences could show whether temporal consistency benefits from the same progressive mechanism.
Load-bearing premise
Existing Mamba-based methods lack fine-grained scale transitions and that window self-attention combined with Progressive Mamba plus high-frequency refinement can deliver better results at no extra computational cost.
What would settle it
If standard SR benchmarks such as Set5 or DIV2K show that T-PMambaSR fails to exceed the PSNR or SSIM of recent Mamba-based SR networks while matching or beating their FLOPs, the central performance claim would be disproven.
Figures
read the original abstract
Recently, Mamba-based super-resolution (SR) methods have demonstrated the ability to capture global receptive fields with linear complexity, addressing the quadratic computational cost of Transformer-based SR approaches. However, existing Mamba-based methods lack fine-grained transitions across different modeling scales, which limits the efficiency of feature representation. In this paper, we propose T-PMambaSR, a lightweight SR framework that integrates window-based self-attention with Progressive Mamba. By enabling interactions among receptive fields of different scales, our method establishes a fine-grained modeling paradigm that progressively enhances feature representation without introducing additional computational cost. Furthermore, we introduce an Adaptive High-Frequency Refinement Module (AHFRM) to recover high-frequency details lost during Transformer and Mamba processing. Extensive experiments demonstrate that T-PMambaSR progressively enhances the model's receptive field and expressiveness, yielding better performance than recent Transformer- or Mamba-based methods while incurring lower computational cost.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes T-PMambaSR, a lightweight image super-resolution framework integrating window-based self-attention with Progressive Mamba to enable fine-grained interactions among receptive fields of different scales. This is claimed to progressively enhance feature representation without additional computational cost. An Adaptive High-Frequency Refinement Module (AHFRM) is introduced to recover high-frequency details lost during processing. Extensive experiments are reported to demonstrate superior performance over recent Transformer- and Mamba-based SR methods while incurring lower computational cost.
Significance. If the efficiency and performance claims hold under rigorous verification, the work offers a meaningful advance in lightweight SR by establishing a cross-scale modeling paradigm that combines Transformer locality with Mamba's linear complexity. Credit is given for the explicit focus on zero-overhead scale interactions and the empirical comparisons on standard benchmarks, which provide falsifiable performance predictions.
major comments (1)
- [Abstract and §3.3] Abstract and §3.3: The central claim that Progressive Mamba integration with window self-attention enables scale interactions 'without introducing additional computational cost' is load-bearing for the lightweight positioning. The complexity analysis must explicitly derive or measure the overhead of progressive state updates and AHFRM adaptation steps; if these introduce hidden FLOPs or memory traffic not reflected in the reported tables, the attribution of gains to the fine-grained paradigm is weakened.
minor comments (3)
- [§4.1] §4.1: Ensure all dataset splits and training protocols (e.g., patch sizes, augmentation) are stated with sufficient detail for reproducibility, including any differences from prior Mamba-SR baselines.
- [Figure 2] Figure 2: The network diagram would benefit from explicit annotation of the data flow between window-attention and Progressive Mamba blocks to clarify the claimed scale interactions.
- [Table 1] Table 1: Add standard deviation or multiple-run statistics to PSNR/SSIM entries if single-run results are reported, to strengthen the performance superiority claims.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the major comment point by point below and will revise the paper to strengthen the complexity analysis as requested.
read point-by-point responses
-
Referee: [Abstract and §3.3] Abstract and §3.3: The central claim that Progressive Mamba integration with window self-attention enables scale interactions 'without introducing additional computational cost' is load-bearing for the lightweight positioning. The complexity analysis must explicitly derive or measure the overhead of progressive state updates and AHFRM adaptation steps; if these introduce hidden FLOPs or memory traffic not reflected in the reported tables, the attribution of gains to the fine-grained paradigm is weakened.
Authors: We agree that an explicit derivation and measurement of overhead is necessary to rigorously support the lightweight claims. In the revised manuscript, we will expand the complexity analysis in §3.3 with detailed equations deriving the FLOPs for Progressive Mamba state updates (showing that cross-scale interactions reuse the same linear-complexity SSM transitions without extra matrix operations) and for AHFRM (which employs parameter-efficient adaptive filtering with O(1) overhead relative to the backbone). We will also add empirical measurements of runtime and peak memory on standard benchmarks to rule out hidden costs from memory traffic. This revision will clarify that the reported gains are attributable to the fine-grained paradigm while preserving the overall complexity profile. revision: yes
Circularity Check
No significant circularity; new architecture integration with empirical validation
full rationale
The paper proposes T-PMambaSR as an integration of window-based self-attention and Progressive Mamba plus a new AHFRM module to enable cross-scale receptive field interactions and high-frequency recovery. No equations or sections reduce the claimed fine-grained paradigm or zero-cost property to a self-definition, fitted parameter, or self-citation chain; the central claims rest on the architectural design choices and reported experimental comparisons rather than any input being renamed or forced as output. The derivation chain is therefore self-contained as a standard model-construction-plus-validation process.
Axiom & Free-Parameter Ledger
free parameters (1)
- network design hyperparameters
axioms (1)
- domain assumption Mamba-based methods can capture global receptive fields with linear complexity while Transformers incur quadratic cost
invented entities (2)
-
Progressive Mamba
no independent evidence
-
Adaptive High-Frequency Refinement Module (AHFRM)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A systematic survey of deep learning-based single-image super-resolution,
J. Li, Z. Pei, W. Li, G. Gao, L. Wang, Y . Wang, and T. Zeng, “A systematic survey of deep learning-based single-image super-resolution,” ACM Computing Surveys, vol. 56, no. 10, pp. 1–40, 2024. 1
work page 2024
-
[2]
Towards realistic data generation for real-world super-resolution,
L. Peng, W. Li, R. Pei, J. Ren, J. Xu, Y . Wang, Y . Cao, and Z.-J. Zha, “Towards realistic data generation for real-world super-resolution,” in ICLR, 2024. 1
work page 2024
-
[3]
Pmq-ve: Progressive multi-frame quantization for video enhancement,
Z. Feng, L. Peng, X. Di, Y . Guo, W. Li, Y . Zhang, R. Pei, Y . Wang, Y . Cao, and Z.-J. Zha, “Pmq-ve: Progressive multi-frame quantization for video enhancement,”arXiv preprint arXiv:2505.12266, 2025. 1
-
[4]
Survey on deep face restoration: From non-blind to blind and beyond,
W. Li, M. Wang, K. Zhang, J. Li, X. Li, Y . Zhang, G. Gao, W. Deng, and C.-W. Lin, “Survey on deep face restoration: From non-blind to blind and beyond,”arXiv preprint arXiv:2309.15490, 2023. 1
-
[5]
Self-supervised selective- guided diffusion model for old-photo face restoration,
W. Li, X. Wang, H. Guo, G. Gao, and Z. Ma, “Self-supervised selective- guided diffusion model for old-photo face restoration,” inNeurIPS, 2025. 1
work page 2025
-
[6]
Lightweight image super- resolution with information multi-distillation network,
Z. Hui, X. Gao, Y . Yang, and X. Wang, “Lightweight image super- resolution with information multi-distillation network,” inACM MM, 2019, pp. 2024–2032. 1, 2
work page 2019
-
[7]
Feature distillation interaction weighting network for lightweight image super-resolution,
G. Gao, W. Li, J. Li, F. Wu, H. Lu, and Y . Yu, “Feature distillation interaction weighting network for lightweight image super-resolution,” inAAAI, vol. 36, no. 1, 2022, pp. 661–669. 1, 2
work page 2022
-
[8]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” inICLR, 2021. 1
work page 2021
-
[9]
Swinir: Image restoration using swin transformer,
J. Liang, J. Cao, G. Sun, K. Zhang, L. Van Gool, and R. Timofte, “Swinir: Image restoration using swin transformer,” inICCVW, 2021, pp. 1833–1844. 1, 2, 6, 7, 8
work page 2021
-
[10]
Efficient long-range attention network for image super-resolution,
X. Zhang, H. Zeng, S. Guo, and L. Zhang, “Efficient long-range attention network for image super-resolution,” inECCV. Springer, 2022, pp. 649–667. 1, 6, 7
work page 2022
-
[11]
Srformer: Permuted self-attention for single image super-resolution,
Y . Zhou, Z. Li, C.-L. Guo, S. Bai, M.-M. Cheng, and Q. Hou, “Srformer: Permuted self-attention for single image super-resolution,” inICCV, 2023, pp. 12 780–12 791. 1, 2, 6, 7, 8, 9, 10
work page 2023
-
[12]
Hit-sr: Hierarchical transformer for efficient image super-resolution,
X. Zhang, Y . Zhang, and F. Yu, “Hit-sr: Hierarchical transformer for efficient image super-resolution,” inECCV. Springer, 2024, pp. 483–500. 1, 2, 6, 7, 8, 9, 10
work page 2024
-
[13]
Dual-domain modulation network for lightweight image super-resolution,
W. Li, H. Guo, Y . Hou, G. Gao, and Z. Ma, “Dual-domain modulation network for lightweight image super-resolution,”IEEE Trans. Multimedia,
-
[14]
On single image scale-up using sparse-representations,
R. Zeyde, M. Elad, and M. Protter, “On single image scale-up using sparse-representations,” inICCS, 2010, pp. 711–730. 1, 6
work page 2010
-
[15]
Single image super-resolution from transformed self-exemplars,
J.-B. Huang, A. Singh, and N. Ahuja, “Single image super-resolution from transformed self-exemplars,” inCVPR, 2015, pp. 5197–5206. 1, 6, 7
work page 2015
-
[16]
Mambair: A simple baseline for image restoration with state-space model,
H. Guo, J. Li, T. Dai, Z. Ouyang, X. Ren, and S.-T. Xia, “Mambair: A simple baseline for image restoration with state-space model,” inECCV. Springer, 2024, pp. 222–241. 1, 2, 4, 6, 7, 8, 9, 10
work page 2024
-
[17]
Mambairv2: Attentive state space restoration,
H. Guo, Y . Guo, Y . Zha, Y . Zhang, W. Li, T. Dai, S.-T. Xia, and Y . Li, “Mambairv2: Attentive state space restoration,” inCVPR, 2025, pp. 28 124–28 133. 1, 2, 6, 7, 8, 9, 10
work page 2025
-
[18]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2024. 1, 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Efficient image super-resolution with feature interaction weighted hybrid network,
W. Li, J. Li, G. Gao, W. Deng, J. Yang, G.-J. Qi, and C.-W. Lin, “Efficient image super-resolution with feature interaction weighted hybrid network,” IEEE Trans. Multimedia, vol. 27, pp. 2256–2267, 2025. 1, 2
work page 2025
-
[20]
A. Li, L. Zhang, Y . Liu, and C. Zhu, “Feature modulation transformer: Cross-refinement of global representation via high-frequency prior for image super-resolution,” inICCV, 2023, pp. 12 514–12 524. 2
work page 2023
-
[21]
FADPNet: Frequency-Aware Dual-Path Network for Face Super-Resolution
S. Xu, W. Li, G. Gao, J. Yang, G.-J. Qi, and C.-W. Lin, “Fadpnet: Frequency-aware dual-path network for face super-resolution,”arXiv preprint arXiv:2506.14121, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Soft-edge assisted network for single image super-resolution,
F. Fang, J. Li, and T. Zeng, “Soft-edge assisted network for single image super-resolution,”IEEE Trans. Image Process, vol. 29, pp. 4656–4668,
-
[23]
2 JOURNAL OF LATEX CLASS FILES, VOL-, NO-, 2020 12
work page 2020
-
[24]
Transforming image super- resolution: a convformer-based efficient approach,
G. Wu, J. Jiang, J. Jiang, and X. Liu, “Transforming image super- resolution: a convformer-based efficient approach,”IEEE Trans. Image Process, vol. 33, pp. 6071–6082, 2024. 2
work page 2024
-
[25]
Accelerating the super-resolution convolutional neural network,
C. Dong, C. C. Loy, and X. Tang, “Accelerating the super-resolution convolutional neural network,” inECCV. Springer, 2016, pp. 391–407. 2
work page 2016
-
[26]
Cross-receptive focused inference network for lightweight image super- resolution,
W. Li, J. Li, G. Gao, W. Deng, J. Zhou, J. Yang, and G.-J. Qi, “Cross-receptive focused inference network for lightweight image super- resolution,”IEEE Trans. Multimedia, vol. 26, pp. 864–877, 2023. 2
work page 2023
-
[27]
G. Gao, Z. Wang, J. Li, W. Li, Y . Yu, and T. Zeng, “Lightweight bimodal network for single-image super-resolution via symmetric cnn and recursive transformer,” inIJCAI, 2022, pp. 913–919. 2
work page 2022
-
[28]
Frequency-assisted mamba for remote sensing image super-resolution,
Y . Xiao, Q. Yuan, K. Jiang, Y . Chen, Q. Zhang, and C.-W. Lin, “Frequency-assisted mamba for remote sensing image super-resolution,” IEEE Trans. Multimedia, vol. 27, pp. 1783–1796, 2025. 2
work page 2025
-
[29]
Transformer for single image super-resolution,
Z. Lu, J. Li, H. Liu, C. Huang, L. Zhang, and T. Zeng, “Transformer for single image super-resolution,” inCVPRW, 2022, pp. 457–466. 2, 5
work page 2022
-
[30]
Efficient face super-resolution via wavelet-based feature enhancement network,
W. Li, H. Guo, X. Liu, K. Liang, J. Hu, Z. Ma, and J. Guo, “Efficient face super-resolution via wavelet-based feature enhancement network,” inACM MM, 2024, pp. 4515–4523. 2
work page 2024
-
[31]
Adaptive frequency filters as efficient global token mixers,
Z. Huang, Z. Zhang, C. Lan, Z.-J. Zha, Y . Lu, and B. Guo, “Adaptive frequency filters as efficient global token mixers,” inICCV, 2023, pp. 6049–6059. 2
work page 2023
-
[32]
Fouriersr: A fourier token-based plu- gin for efficient image super-resolution,
W. Li, H. Guo, Y . Hou, and Z. Ma, “Fouriersr: A fourier token-based plu- gin for efficient image super-resolution,”arXiv preprint arXiv:2503.10043,
-
[33]
P. Xu, Q. Liu, H. Bao, R. Zhang, L. Gu, and G. Wang, “Fdsr: An interpretable frequency division stepwise process based single-image super-resolution network,”IEEE Trans. Image Process, vol. 33, pp. 1710– 1725, 2024. 2
work page 2024
-
[34]
Exploring the potential of pooling techniques for universal image restoration,
Y . Cui, W. Ren, and A. Knoll, “Exploring the potential of pooling techniques for universal image restoration,”IEEE Trans. Image Process, vol. 34, pp. 3403–3416, 2025. 2
work page 2025
-
[35]
Can: Cascade augmentations against noise for image restoration,
Y . Yan, S. Yao, W. Ren, R. Zhang, Q. Guo, and X. Cao, “Can: Cascade augmentations against noise for image restoration,”IEEE Trans. Image Process, vol. 34, pp. 5131–5146, 2025. 2
work page 2025
-
[36]
Mamballie: Implicit retinex-aware low light enhancement with global-then-local state space,
J. Weng, Z. Yan, Y . Tai, J. Qian, J. Yang, and J. Li, “Mamballie: Implicit retinex-aware low light enhancement with global-then-local state space,” NeurIPS, pp. 27 440–27 462, 2024. 4, 8
work page 2024
-
[37]
Wave-mamba: Wavelet state space model for ultra-high-definition low-light image enhancement,
W. Zou, H. Gao, W. Yang, and T. Liu, “Wave-mamba: Wavelet state space model for ultra-high-definition low-light image enhancement,” in ACM MM, 2024, pp. 1534–1543. 4
work page 2024
-
[38]
Low- complexity single-image super-resolution based on nonnegative neighbor embedding,
M. Bevilacqua, A. Roumy, C. Guillemot, and M. L. Alberi-Morel, “Low- complexity single-image super-resolution based on nonnegative neighbor embedding,” inBMVC, 2012, pp. 135.1–135.10. 6
work page 2012
-
[39]
D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics,” inICCV, 2001, pp. 416–
work page 2001
-
[40]
Sketch-based manga retrieval using manga109 dataset,
Y . Matsui, K. Ito, Y . Aramaki, A. Fujimoto, T. Ogawa, T. Yamasaki, and K. Aizawa, “Sketch-based manga retrieval using manga109 dataset,” Multimed. Tools Appl., vol. 76, pp. 21 811–21 838, 2017. 6
work page 2017
-
[41]
Omni aggregation networks for lightweight image super-resolution,
H. Wang, X. Chen, B. Ni, Y . Liu, and J. Liu, “Omni aggregation networks for lightweight image super-resolution,” inCVPR, 2023, pp. 22 378– 22 387. 6, 7
work page 2023
-
[42]
Emulating self-attention with convolution for efficient image super-resolution,
D. Lee, S. Yun, and Y . Ro, “Emulating self-attention with convolution for efficient image super-resolution,” inICCV, 2025, pp. 24 467–24 477. 6, 7, 9, 10
work page 2025
-
[43]
A collaborative network of mamba and cnn for lightweight image super-resolution,
X. Wang, J. Li, J. Li, S. Wang, L. Yan, and Y . Xu, “A collaborative network of mamba and cnn for lightweight image super-resolution,”IEEE Trans. Consum. Electron., vol. 71, no. 2, pp. 3591–3604, 2025. 6, 7
work page 2025
-
[44]
Mair: A locality- and continuity-preserving mamba for image restoration,
B. Li, H. Zhao, W. Wang, P. Hu, Y . Gou, and X. Peng, “Mair: A locality- and continuity-preserving mamba for image restoration,” inCVPR, 2025, pp. 7491–7501. 6, 7, 8
work page 2025
-
[45]
Toward real-world single image super-resolution: A new benchmark and a new model,
J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,” in CVPR, 2019, pp. 3086–3095. 6, 7, 8
work page 2019
-
[46]
Ntire 2017 challenge on single image super-resolution: Methods and results,
R. Timofte, E. Agustsson, L. Van Gool, M.-H. Yang, and L. Zhang, “Ntire 2017 challenge on single image super-resolution: Methods and results,” inCVPRW, 2017, pp. 126–135. 6
work page 2017
-
[47]
Image quality assessment: from error visibility to structural similarity,
Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality assessment: from error visibility to structural similarity,”IEEE Trans. Image Process, vol. 13, no. 4, pp. 600–612, 2004. 6
work page 2004
-
[48]
Vmamba: Visual state space model,
Y . Liu, Y . Tian, Y . Zhao, H. Yu, L. Xie, Y . Wang, Q. Ye, J. Jiao, and Y . Liu, “Vmamba: Visual state space model,” inNeurIPS, vol. 37, 2024, pp. 103 031–103 063. 8
work page 2024
-
[49]
Interpreting super-resolution networks with local attribution maps,
J. Gu and C. Dong, “Interpreting super-resolution networks with local attribution maps,” inCVPR, 2021, pp. 9199–9208. 9, 10
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.