pith. sign in

arxiv: 2603.11680 · v2 · submitted 2026-03-12 · 💻 cs.CV

UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Pith reviewed 2026-05-15 12:27 UTC · model grok-4.3

classification 💻 cs.CV
keywords super-resolutionlightweight networkconvolutional attentionreceptive fieldimage restorationattention mechanismparameter sharinghigh-frequency preservation
0
0 comments X

The pith

UCAN unifies window attention, Hedgehog Attention, and distilled large kernels with cross-layer sharing to expand receptive fields efficiently in lightweight super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid CNN-Transformer models produce good image super-resolution results but grow expensive when attention windows or convolution kernels are enlarged for bigger receptive fields. UCAN addresses this by blending window-based spatial attention with Hedgehog Attention to capture both local textures and long-range dependencies, while adding a distillation-based large-kernel module to retain high-frequency details at low cost and using cross-layer parameter sharing to cut complexity further. The network thereby achieves strong benchmark scores on Manga109 and BSDS100 while using far fewer MACs than competing lightweight or larger models. A reader would care because the design targets practical deployment on resource-limited devices for high-resolution image restoration tasks. The paper claims this combination yields a superior accuracy-efficiency-scalability trade-off without hidden costs.

Core claim

UCAN establishes that a lightweight network can expand the effective receptive field by unifying convolution and attention through window-based spatial attention combined with a Hedgehog Attention mechanism for local and long-range modeling, a distillation-based large-kernel module that preserves high-frequency structure without heavy computation, and cross-layer parameter sharing to reduce overall complexity, resulting in higher PSNR on standard super-resolution benchmarks than recent lightweight models at lower MAC counts.

What carries the argument

The Hedgehog Attention mechanism paired with window-based spatial attention, a distillation-based large-kernel module, and cross-layer parameter sharing, which together model local texture and long-range dependencies while keeping computation low.

If this is right

  • UCAN-L reaches 31.63 dB PSNR on Manga109 at 4x scale using only 48.4G MACs, exceeding recent lightweight models.
  • UCAN attains 27.79 dB on BSDS100 while outperforming methods that employ significantly larger models.
  • The design maintains a superior trade-off among accuracy, efficiency, and scalability for image restoration.
  • Cross-layer sharing and the unified attention-convolution approach keep the model suitable for resource-constrained devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same unification pattern could be tested on related low-level tasks such as denoising or deblurring where receptive-field size directly affects detail recovery.
  • Parameter sharing across layers might reduce model size in other attention-heavy vision networks beyond super-resolution.
  • If the efficiency holds on real-world noisy or compressed images, the network could support on-device upscaling in mobile applications.
  • Scaling the approach to higher upscaling factors like 8x would test whether the receptive-field gains remain effective without additional cost.

Load-bearing premise

That the Hedgehog Attention, distillation-based large kernel, and cross-layer sharing can be combined to enlarge receptive fields without introducing accuracy or efficiency losses that full ablation tests would reveal.

What would settle it

Full ablation experiments that remove or isolate each added component and show either lower PSNR than reported or higher MACs than claimed on Manga109 and BSDS100, or failure to beat larger-model baselines on additional test sets.

Figures

Figures reproduced from arXiv: 2603.11680 by Cao Thien Tan, Do Nghiem Duc, Hanyang Zhuang, Ho Ngoc Anh, Nguyen Duc Dung, Phan Thi Thu Trang.

Figure 1
Figure 1. Figure 1: Performance comparison of PSNR versus model parame [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of feature maps output by Linear Attention [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Detailed architecture of (a) Shared and Received Hybrid Attention (SHA and RHA) and (b) Large Kernel Distillation (LKD). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison between ground truth and different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualize in detail ERF of MambaIR [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Local attribution maps (LAM) comparison of different [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of attention maps for Linear Attention using [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ranking consistency analysis. We compare the out￾put ranking of Linear Attention using standard ReLU, Symmetric ReLU, and the Hedgehog Feature Map (sequence length N = 256). While adding negative information (Sym-ReLU) improves consis￾tency, Hedgehog achieves superior performance through learnable stability. regimes. This analysis confirms that our architectural choice is principled rather than heuristic, … view at source ↗
Figure 9
Figure 9. Figure 9: Visual comparison between the ground truth and different methods on Set5 - baby. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visual comparison between the ground truth and different methods on Set14 - man. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visual comparison between the ground truth and different methods on B100 - 300091. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Visual comparison between the ground truth and different methods on Manga109 - Yumeko Cooking. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Visual comparison between the ground truth and different methods on Manga109 - Gakuen Noise. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Visual comparison between the ground truth and different methods on Manga109 - Yasakii Akuma. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Visual comparison between the ground truth and different methods on Urban100 - 015. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Visual comparison between the ground truth and different methods on Urban100 - img19. [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Visual comparison between the ground truth and different methods on Urban100 - img24. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Visual comparison between the ground truth and different methods on Urban100 - img72. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗
read the original abstract

Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UCAN, a lightweight hybrid CNN-Transformer architecture for image super-resolution that unifies window-based spatial attention with Hedgehog Attention to capture local textures and long-range dependencies, incorporates a distillation-based large-kernel module to preserve high-frequency details, and applies cross-layer parameter sharing to reduce complexity. It reports concrete performance gains, including UCAN-L reaching 31.63 dB PSNR on Manga109 (4×) at 48.4G MACs and 27.79 dB on BSDS100, outperforming recent lightweight models while maintaining low computational cost.

Significance. If the central claims hold under rigorous verification, the work would advance efficient super-resolution by demonstrating a practical unification of attention mechanisms that expands receptive fields without proportional increases in parameters or MACs, offering a scalable design suitable for resource-constrained devices. The reported accuracy-efficiency trade-offs on standard benchmarks represent a potentially useful empirical contribution to lightweight SR literature.

major comments (2)
  1. [§4] §4 (Experiments and Ablations): The ablation studies report incremental additions of Hedgehog Attention, the distillation module, and cross-layer sharing but lack full factorial designs that isolate each component while strictly holding total parameters and MACs fixed. This is load-bearing for the central claim, as the headline PSNR/MAC numbers (e.g., 31.63 dB at 48.4G on Manga109) could arise from a single dominant module, training dynamics, or unaccounted compute rather than the unified architecture.
  2. [Results tables] Results tables (e.g., Table 1 or equivalent benchmark tables): Reported PSNR values such as 31.63 dB and 27.79 dB lack error bars, standard deviations from multiple runs, or details on dataset splits and training seeds, making it impossible to assess whether the improvements over baselines are statistically reliable or reproducible.
minor comments (2)
  1. [Abstract and §3] The abstract and §3 would benefit from a brief explicit statement of the total parameter count for UCAN-L alongside the MAC figure to allow direct comparison with cited baselines.
  2. [Figures] Figure captions for architecture diagrams should clarify whether the Hedgehog Attention and distillation modules operate in parallel or sequentially within each block.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where feasible while being transparent about limitations.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments and Ablations): The ablation studies report incremental additions of Hedgehog Attention, the distillation module, and cross-layer sharing but lack full factorial designs that isolate each component while strictly holding total parameters and MACs fixed. This is load-bearing for the central claim, as the headline PSNR/MAC numbers (e.g., 31.63 dB at 48.4G on Manga109) could arise from a single dominant module, training dynamics, or unaccounted compute rather than the unified architecture.

    Authors: We acknowledge that a full factorial ablation with strictly fixed parameters and MACs would offer stronger isolation of each module. However, the components in UCAN are intentionally interdependent within the unified CNN-Transformer design, and enforcing identical compute budgets across all 2^3 combinations would require substantial redesigns that alter the architecture's core efficiency claims. Our sequential ablations demonstrate incremental PSNR gains at each step while preserving the low-MAC target, and the final model outperforms strong baselines. In revision we will expand §4 with additional justification for the sequential approach, a discussion of module interactions, and a note on the prohibitive cost of exhaustive factorial experiments under fixed compute. This constitutes a partial revision. revision: partial

  2. Referee: [Results tables] Results tables (e.g., Table 1 or equivalent benchmark tables): Reported PSNR values such as 31.63 dB and 27.79 dB lack error bars, standard deviations from multiple runs, or details on dataset splits and training seeds, making it impossible to assess whether the improvements over baselines are statistically reliable or reproducible.

    Authors: We agree that reproducibility details strengthen the results. We will revise the manuscript to explicitly report the training seeds, dataset splits, and full experimental protocol used for all benchmarks. However, computing error bars and standard deviations would require multiple independent training runs for every model and dataset, which exceeds our available computational resources. We note that the reported gains are consistent across five standard benchmarks and multiple scales, aligning with practices in the lightweight SR literature. A limitation statement will be added to the text. revision: partial

standing simulated objections not resolved
  • Providing numerical error bars or standard deviations from multiple independent runs, as this cannot be supplied without new multi-seed experiments beyond current resources.

Circularity Check

0 steps flagged

No circularity; empirical architecture claims rest on benchmarks without self-referential derivations

full rationale

The paper introduces UCAN as a hybrid CNN-Transformer architecture for lightweight super-resolution, describing components such as window-based spatial attention, Hedgehog Attention, a distillation-based large-kernel module, and cross-layer parameter sharing. Performance claims (e.g., 31.63 dB PSNR on Manga109 4× at 48.4G MACs) are presented solely as outcomes of experimental evaluation on standard datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The derivation chain is absent; results are independent empirical measurements rather than reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all claims are empirical performance statements.

pith-pipeline@v0.9.0 · 5491 in / 996 out tokens · 23238 ms · 2026-05-15T12:27:11.174444+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

  1. [1]

    Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025

    Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, and Ran He. Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025. 2

  2. [2]

    Low-complexity single-image super-resolution based on nonnegative neighbor embedding

    Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding

  3. [3]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 8

  4. [4]

    Learning a deep convolutional network for image super- resolution

    Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super- resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part IV 13, pages 184–199. Springer, 2014. 2

  5. [5]

    Compression artifacts reduction by a deep convolu- tional network

    Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolu- tional network. InProceedings of the IEEE international conference on computer vision, pages 576–584, 2015. 2

  6. [6]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

  7. [7]

    Interpreting super-resolution net- works with local attribution maps

    Jinjin Gu and Chao Dong. Interpreting super-resolution net- works with local attribution maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9199–9208, 2021. 7

  8. [8]

    Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

    Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269,

  9. [9]

    Mambairv2: Attentive state space restoration, 2024

    Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration, 2024. 7

  10. [10]

    Mambair: A simple baseline for image restoration with state-space model

    Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. Mambair: A simple baseline for image restoration with state-space model. InEuropean Conference on Computer Vision, pages 222–241. Springer, 2025. 6, 7, 3

  11. [11]

    Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

    Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing at- tention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739, 2024. 5

  12. [12]

    Single image super-resolution from transformed self-exemplars

    Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015. 6

  13. [13]

    Fast and accurate single image super-resolution via information distillation net- work

    Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation net- work. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 723–731, 2018. 2

  14. [14]

    Deeply- recursive convolutional network for image super-resolution

    Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply- recursive convolutional network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016. 2

  15. [15]

    Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution

    Cansu Korkmaz and A Murat Tekalp. Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6661–6670, 2024. 6

  16. [16]

    Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024

    Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024. 3

  17. [17]

    Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025

    Dongheon Lee, Seokju Yun, and Youngmin Ro. Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025. 3, 5

  18. [18]

    Swinir: Image restoration using swin transformer

    Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 5, 6, 4

  19. [19]

    Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

    Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5657–5666, 2022. 6

  20. [20]

    Enhanced deep residual networks for single image super-resolution

    Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

  21. [21]

    Residual feature aggregation network for image super- resolution

    Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gangshan Wu. Residual feature aggregation network for image super- resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2359–2368,

  22. [22]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

  23. [23]

    Progressive focused transformer for single image super- resolution

    Wei Long, Xingyu Zhou, Leheng Zhang, and Shuhang Gu. Progressive focused transformer for single image super- resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2279–2288, 2025. 4

  24. [24]

    Transformer for single image super-resolution

    Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 457–466,

  25. [25]

    A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

    David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. InProceedings eighth IEEE international conference on computer vision. ICCV 2001, pages 416–423. IEEE, 2001. 6

  26. [26]

    Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017

    Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017. 6

  27. [27]

    Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution

    Karam Park, Jae Woong Soh, and Nam Ik Cho. Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6416–6424, 2025. 2, 3, 4, 6, 7

  28. [28]

    Vi- sion transformers for dense prediction

    René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2

  29. [29]

    Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

    Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 3

  30. [30]

    Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024

    Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, and Wenming Yang. Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024. 2

  31. [31]

    Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

    Long Sun, Jinshan Pan, and Jinhui Tang. Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

  32. [32]

    Rethinking the inception ar- chitecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 3

  33. [33]

    Image super- resolution via deep recursive residual network

    Ying Tai, Jian Yang, and Xiaoming Liu. Image super- resolution via deep recursive residual network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3147–3155, 2017. 2

  34. [34]

    Image processing gnn: Breaking rigidity in super-resolution

    Yuchuan Tian, Hanting Chen, Chao Xu, and Yunhe Wang. Image processing gnn: Breaking rigidity in super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24117, 2024. 4

  35. [35]

    Omni aggregation networks for lightweight image super-resolution

    Hang Wang, Xuanhong Chen, Bingbing Ni, Yutian Liu, and Jinfan Liu. Omni aggregation networks for lightweight image super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 22378–22387, 2023. 2, 6, 7

  36. [36]

    Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

    Gang Wu, Junjun Jiang, Junpeng Jiang, and Xianming Liu. Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

  37. [37]

    Large kernel distillation network for efficient single image super-resolution

    Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng, Tianlin Zhang, Tianrui Li, and Xiaole Zhao. Large kernel distillation network for efficient single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1283–1292, 2023. 2

  38. [38]

    Restormer: Efficient transformer for high-resolution image restoration

    Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739,

  39. [39]

    On single image scale-up using sparse-representations

    Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. InInternational conference on curves and surfaces, pages 711–730. Springer,

  40. [40]

    Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary

    Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, and Shuhang Gu. Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2856–2865,

  41. [41]

    org/P19-1472

    Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024. 3

  42. [42]

    Efficient long-range attention network for image super-resolution

    Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super-resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2, 6, 4

  43. [43]

    Hit-sr: Hierar- chical transformer for efficient image super-resolution

    Xiang Zhang, Yulun Zhang, and Fisher Yu. Hit-sr: Hierar- chical transformer for efficient image super-resolution. In European Conference on Computer Vision, pages 483–500. Springer, 2024. 6, 7

  44. [44]

    Image super-resolution using very deep residual channel attention networks

    Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286– 301, 2018. 2, 6, 7

  45. [45]

    Residual dense network for image super-resolution

    Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2, 6, 7

  46. [46]

    Srformer: Permuted self-attention for single image super-resolution

    Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou. Srformer: Permuted self-attention for single image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023. 6, 7 UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resoluti...

  47. [47]

    FLK-S (X) =S d(S(X, k core), kcore, d) (12)

    Standard Configuration (FLK-S):This is a two-stage stack, used for smaller receptive fields. FLK-S (X) =S d(S(X, k core), kcore, d) (12)

  48. [48]

    The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended

    Large Configuration (FLK-L):To achieve maximum receptive fields, this configuration extends the standard block into a three-stage stack. The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended. FLK-L (X) =S d(Sd(S(X, k core), kcore, d), kextra, d) (13) The...

  49. [49]

    The base S(·, kcore) block (specifically f 1×kcore dw ) estab- lishes an ERFin =k core

  50. [50]

    The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

    The second stage, Sd(·, kcore, d), (specifically f 1×kcore,d dw ) adds(k core −1)d. The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

  51. [51]

    The base S(·, kcore) block establishes an ERFin =k core

  52. [52]

    The second stage,S d(·, kcore, d), adds(k core −1)d

  53. [53]

    The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4

    The third stage,S d(·, kextra, d), adds(k extra −1)d. The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4. A.3. Feature Fusion and Final Output Finally, the outputs from the three branches are fused. The local and large-kernel spatial features are c...