UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Cao Thien Tan; Do Nghiem Duc; Hanyang Zhuang; Ho Ngoc Anh; Nguyen Duc Dung; Phan Thi Thu Trang

arxiv: 2603.11680 · v2 · submitted 2026-03-12 · 💻 cs.CV

UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resolution

Cao Thien Tan , Phan Thi Thu Trang , Do Nghiem Duc , Ho Ngoc Anh , Hanyang Zhuang , Nguyen Duc Dung This is my paper

Pith reviewed 2026-05-15 12:27 UTC · model grok-4.3

classification 💻 cs.CV

keywords super-resolutionlightweight networkconvolutional attentionreceptive fieldimage restorationattention mechanismparameter sharinghigh-frequency preservation

0 comments

The pith

UCAN unifies window attention, Hedgehog Attention, and distilled large kernels with cross-layer sharing to expand receptive fields efficiently in lightweight super-resolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Hybrid CNN-Transformer models produce good image super-resolution results but grow expensive when attention windows or convolution kernels are enlarged for bigger receptive fields. UCAN addresses this by blending window-based spatial attention with Hedgehog Attention to capture both local textures and long-range dependencies, while adding a distillation-based large-kernel module to retain high-frequency details at low cost and using cross-layer parameter sharing to cut complexity further. The network thereby achieves strong benchmark scores on Manga109 and BSDS100 while using far fewer MACs than competing lightweight or larger models. A reader would care because the design targets practical deployment on resource-limited devices for high-resolution image restoration tasks. The paper claims this combination yields a superior accuracy-efficiency-scalability trade-off without hidden costs.

Core claim

UCAN establishes that a lightweight network can expand the effective receptive field by unifying convolution and attention through window-based spatial attention combined with a Hedgehog Attention mechanism for local and long-range modeling, a distillation-based large-kernel module that preserves high-frequency structure without heavy computation, and cross-layer parameter sharing to reduce overall complexity, resulting in higher PSNR on standard super-resolution benchmarks than recent lightweight models at lower MAC counts.

What carries the argument

The Hedgehog Attention mechanism paired with window-based spatial attention, a distillation-based large-kernel module, and cross-layer parameter sharing, which together model local texture and long-range dependencies while keeping computation low.

If this is right

UCAN-L reaches 31.63 dB PSNR on Manga109 at 4x scale using only 48.4G MACs, exceeding recent lightweight models.
UCAN attains 27.79 dB on BSDS100 while outperforming methods that employ significantly larger models.
The design maintains a superior trade-off among accuracy, efficiency, and scalability for image restoration.
Cross-layer sharing and the unified attention-convolution approach keep the model suitable for resource-constrained devices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same unification pattern could be tested on related low-level tasks such as denoising or deblurring where receptive-field size directly affects detail recovery.
Parameter sharing across layers might reduce model size in other attention-heavy vision networks beyond super-resolution.
If the efficiency holds on real-world noisy or compressed images, the network could support on-device upscaling in mobile applications.
Scaling the approach to higher upscaling factors like 8x would test whether the receptive-field gains remain effective without additional cost.

Load-bearing premise

That the Hedgehog Attention, distillation-based large kernel, and cross-layer sharing can be combined to enlarge receptive fields without introducing accuracy or efficiency losses that full ablation tests would reveal.

What would settle it

Full ablation experiments that remove or isolate each added component and show either lower PSNR than reported or higher MACs than claimed on Manga109 and BSDS100, or failure to beat larger-model baselines on additional test sets.

Figures

Figures reproduced from arXiv: 2603.11680 by Cao Thien Tan, Do Nghiem Duc, Hanyang Zhuang, Ho Ngoc Anh, Nguyen Duc Dung, Phan Thi Thu Trang.

**Figure 2.** Figure 2: Comparison of feature maps output by Linear Attention [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Detailed architecture of (a) Shared and Received Hybrid Attention (SHA and RHA) and (b) Large Kernel Distillation (LKD). [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Visual comparison between ground truth and different [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Visualize in detail ERF of MambaIR [ [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Local attribution maps (LAM) comparison of different [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of attention maps for Linear Attention using [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: Ranking consistency analysis. We compare the output ranking of Linear Attention using standard ReLU, Symmetric ReLU, and the Hedgehog Feature Map (sequence length N = 256). While adding negative information (Sym-ReLU) improves consistency, Hedgehog achieves superior performance through learnable stability. regimes. This analysis confirms that our architectural choice is principled rather than heuristic, … view at source ↗

**Figure 9.** Figure 9: Visual comparison between the ground truth and different methods on Set5 - baby. [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Visual comparison between the ground truth and different methods on Set14 - man. [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: Visual comparison between the ground truth and different methods on B100 - 300091. [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗

**Figure 12.** Figure 12: Visual comparison between the ground truth and different methods on Manga109 - Yumeko Cooking. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗

**Figure 13.** Figure 13: Visual comparison between the ground truth and different methods on Manga109 - Gakuen Noise. [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 14.** Figure 14: Visual comparison between the ground truth and different methods on Manga109 - Yasakii Akuma. [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Visual comparison between the ground truth and different methods on Urban100 - 015. [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Visual comparison between the ground truth and different methods on Urban100 - img19. [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

**Figure 17.** Figure 17: Visual comparison between the ground truth and different methods on Urban100 - img24. [PITH_FULL_IMAGE:figures/full_fig_p017_17.png] view at source ↗

**Figure 18.** Figure 18: Visual comparison between the ground truth and different methods on Urban100 - img72. [PITH_FULL_IMAGE:figures/full_fig_p018_18.png] view at source ↗

read the original abstract

Hybrid CNN-Transformer architectures achieve strong results in image super-resolution, but scaling attention windows or convolution kernels significantly increases computational cost, limiting deployment on resource-constrained devices. We present UCAN, a lightweight network that unifies convolution and attention to expand the effective receptive field efficiently. UCAN combines window-based spatial attention with a Hedgehog Attention mechanism to model both local texture and long-range dependencies, and introduces a distillation-based large-kernel module to preserve high-frequency structure without heavy computation. In addition, we employ cross-layer parameter sharing to further reduce complexity. On Manga109 ($4\times$), UCAN-L achieves 31.63 dB PSNR with only 48.4G MACs, surpassing recent lightweight models. On BSDS100, UCAN attains 27.79 dB, outperforming methods with significantly larger models. Extensive experiments show that UCAN achieves a superior trade-off between accuracy, efficiency, and scalability, making it well-suited for practical high-resolution image restoration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UCAN is a straightforward incremental design for lightweight super-resolution that mixes window attention, Hedgehog Attention, distillation kernels, and parameter sharing to hit competitive PSNR at low MACs, but the efficiency story rests on ablations that probably need tightening.

read the letter

The core of this paper is a new lightweight SR network called UCAN that tries to grow receptive fields without the usual compute penalty. It stacks window-based spatial attention with Hedgehog Attention for long-range stuff, adds a distillation-based large-kernel module for high-frequency details, and uses cross-layer parameter sharing to keep the model small. The headline numbers are 31.63 dB on Manga109 at 4x with 48.4G MACs and 27.79 dB on BSDS100, which beat several heavier recent models. That combination looks like the actual new piece, not just another CNN-Transformer hybrid rehash.

Referee Report

2 major / 2 minor

Summary. The paper proposes UCAN, a lightweight hybrid CNN-Transformer architecture for image super-resolution that unifies window-based spatial attention with Hedgehog Attention to capture local textures and long-range dependencies, incorporates a distillation-based large-kernel module to preserve high-frequency details, and applies cross-layer parameter sharing to reduce complexity. It reports concrete performance gains, including UCAN-L reaching 31.63 dB PSNR on Manga109 (4×) at 48.4G MACs and 27.79 dB on BSDS100, outperforming recent lightweight models while maintaining low computational cost.

Significance. If the central claims hold under rigorous verification, the work would advance efficient super-resolution by demonstrating a practical unification of attention mechanisms that expands receptive fields without proportional increases in parameters or MACs, offering a scalable design suitable for resource-constrained devices. The reported accuracy-efficiency trade-offs on standard benchmarks represent a potentially useful empirical contribution to lightweight SR literature.

major comments (2)

[§4] §4 (Experiments and Ablations): The ablation studies report incremental additions of Hedgehog Attention, the distillation module, and cross-layer sharing but lack full factorial designs that isolate each component while strictly holding total parameters and MACs fixed. This is load-bearing for the central claim, as the headline PSNR/MAC numbers (e.g., 31.63 dB at 48.4G on Manga109) could arise from a single dominant module, training dynamics, or unaccounted compute rather than the unified architecture.
[Results tables] Results tables (e.g., Table 1 or equivalent benchmark tables): Reported PSNR values such as 31.63 dB and 27.79 dB lack error bars, standard deviations from multiple runs, or details on dataset splits and training seeds, making it impossible to assess whether the improvements over baselines are statistically reliable or reproducible.

minor comments (2)

[Abstract and §3] The abstract and §3 would benefit from a brief explicit statement of the total parameter count for UCAN-L alongside the MAC figure to allow direct comparison with cited baselines.
[Figures] Figure captions for architecture diagrams should clarify whether the Hedgehog Attention and distillation modules operate in parallel or sequentially within each block.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback. We address the two major comments point-by-point below, indicating planned revisions where feasible while being transparent about limitations.

read point-by-point responses

Referee: [§4] §4 (Experiments and Ablations): The ablation studies report incremental additions of Hedgehog Attention, the distillation module, and cross-layer sharing but lack full factorial designs that isolate each component while strictly holding total parameters and MACs fixed. This is load-bearing for the central claim, as the headline PSNR/MAC numbers (e.g., 31.63 dB at 48.4G on Manga109) could arise from a single dominant module, training dynamics, or unaccounted compute rather than the unified architecture.

Authors: We acknowledge that a full factorial ablation with strictly fixed parameters and MACs would offer stronger isolation of each module. However, the components in UCAN are intentionally interdependent within the unified CNN-Transformer design, and enforcing identical compute budgets across all 2^3 combinations would require substantial redesigns that alter the architecture's core efficiency claims. Our sequential ablations demonstrate incremental PSNR gains at each step while preserving the low-MAC target, and the final model outperforms strong baselines. In revision we will expand §4 with additional justification for the sequential approach, a discussion of module interactions, and a note on the prohibitive cost of exhaustive factorial experiments under fixed compute. This constitutes a partial revision. revision: partial
Referee: [Results tables] Results tables (e.g., Table 1 or equivalent benchmark tables): Reported PSNR values such as 31.63 dB and 27.79 dB lack error bars, standard deviations from multiple runs, or details on dataset splits and training seeds, making it impossible to assess whether the improvements over baselines are statistically reliable or reproducible.

Authors: We agree that reproducibility details strengthen the results. We will revise the manuscript to explicitly report the training seeds, dataset splits, and full experimental protocol used for all benchmarks. However, computing error bars and standard deviations would require multiple independent training runs for every model and dataset, which exceeds our available computational resources. We note that the reported gains are consistent across five standard benchmarks and multiple scales, aligning with practices in the lightweight SR literature. A limitation statement will be added to the text. revision: partial

standing simulated objections not resolved

Providing numerical error bars or standard deviations from multiple independent runs, as this cannot be supplied without new multi-seed experiments beyond current resources.

Circularity Check

0 steps flagged

No circularity; empirical architecture claims rest on benchmarks without self-referential derivations

full rationale

The paper introduces UCAN as a hybrid CNN-Transformer architecture for lightweight super-resolution, describing components such as window-based spatial attention, Hedgehog Attention, a distillation-based large-kernel module, and cross-layer parameter sharing. Performance claims (e.g., 31.63 dB PSNR on Manga109 4× at 48.4G MACs) are presented solely as outcomes of experimental evaluation on standard datasets. No equations, first-principles derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described content. The derivation chain is absent; results are independent empirical measurements rather than reductions to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no identifiable free parameters, axioms, or invented entities; all claims are empirical performance statements.

pith-pipeline@v0.9.0 · 5491 in / 996 out tokens · 23238 ms · 2026-05-15T12:27:11.174444+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

UCAN combines window-based spatial attention with a Hedgehog Attention mechanism... distillation-based large-kernel module... cross-layer parameter sharing
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Hedgehog Feature Map... ϕH(X) = [exp(W⊤X+b1),...,exp(−W⊤X−bm)]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 1 internal anchor

[1]

Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025

Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, and Ran He. Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025. 2

work page arXiv 2025
[2]

Low-complexity single-image super-resolution based on nonnegative neighbor embedding

Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding

work page
[3]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 8

work page 2022
[4]

Learning a deep convolutional network for image super- resolution

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super- resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part IV 13, pages 184–199. Springer, 2014. 2

work page 2014
[5]

Compression artifacts reduction by a deep convolu- tional network

Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolu- tional network. InProceedings of the IEEE international conference on computer vision, pages 576–584, 2015. 2

work page 2015
[6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010
[7]

Interpreting super-resolution net- works with local attribution maps

Jinjin Gu and Chao Dong. Interpreting super-resolution net- works with local attribution maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9199–9208, 2021. 7

work page 2021
[8]

Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269,

work page arXiv
[9]

Mambairv2: Attentive state space restoration, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration, 2024. 7

work page 2024
[10]

Mambair: A simple baseline for image restoration with state-space model

Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. Mambair: A simple baseline for image restoration with state-space model. InEuropean Conference on Computer Vision, pages 222–241. Springer, 2025. 6, 7, 3

work page 2025
[11]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing at- tention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739, 2024. 5

work page arXiv 2024
[12]

Single image super-resolution from transformed self-exemplars

Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015. 6

work page 2015
[13]

Fast and accurate single image super-resolution via information distillation net- work

Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation net- work. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 723–731, 2018. 2

work page 2018
[14]

Deeply- recursive convolutional network for image super-resolution

Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply- recursive convolutional network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016. 2

work page 2016
[15]

Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution

Cansu Korkmaz and A Murat Tekalp. Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6661–6670, 2024. 6

work page 2024
[16]

Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024

Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024. 3

work page 2024
[17]

Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025

Dongheon Lee, Seokju Yun, and Youngmin Ro. Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025. 3, 5

work page arXiv 2025
[18]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 5, 6, 4

work page 2021
[19]

Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5657–5666, 2022. 6

work page 2022
[20]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017
[21]

Residual feature aggregation network for image super- resolution

Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gangshan Wu. Residual feature aggregation network for image super- resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2359–2368,

work page
[22]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021
[23]

Progressive focused transformer for single image super- resolution

Wei Long, Xingyu Zhou, Leheng Zhang, and Shuhang Gu. Progressive focused transformer for single image super- resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2279–2288, 2025. 4

work page 2025
[24]

Transformer for single image super-resolution

Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 457–466,

work page
[25]

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. InProceedings eighth IEEE international conference on computer vision. ICCV 2001, pages 416–423. IEEE, 2001. 6

work page 2001
[26]

Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017

Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017. 6

work page 2017
[27]

Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution

Karam Park, Jae Woong Soh, and Nam Ik Cho. Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6416–6424, 2025. 2, 3, 4, 6, 7

work page 2025
[28]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2

work page 2021
[29]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 3

work page 2016
[30]

Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024

Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, and Wenming Yang. Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024. 2

work page arXiv 2024
[31]

Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

Long Sun, Jinshan Pan, and Jinhui Tang. Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

work page
[32]

Rethinking the inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 3

work page 2016
[33]

Image super- resolution via deep recursive residual network

Ying Tai, Jian Yang, and Xiaoming Liu. Image super- resolution via deep recursive residual network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3147–3155, 2017. 2

work page 2017
[34]

Image processing gnn: Breaking rigidity in super-resolution

Yuchuan Tian, Hanting Chen, Chao Xu, and Yunhe Wang. Image processing gnn: Breaking rigidity in super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24117, 2024. 4

work page 2024
[35]

Omni aggregation networks for lightweight image super-resolution

Hang Wang, Xuanhong Chen, Bingbing Ni, Yutian Liu, and Jinfan Liu. Omni aggregation networks for lightweight image super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 22378–22387, 2023. 2, 6, 7

work page 2023
[36]

Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

Gang Wu, Junjun Jiang, Junpeng Jiang, and Xianming Liu. Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

work page
[37]

Large kernel distillation network for efficient single image super-resolution

Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng, Tianlin Zhang, Tianrui Li, and Xiaole Zhao. Large kernel distillation network for efficient single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1283–1292, 2023. 2

work page 2023
[38]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739,

work page
[39]

On single image scale-up using sparse-representations

Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. InInternational conference on curves and surfaces, pages 711–730. Springer,

work page
[40]

Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary

Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, and Shuhang Gu. Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2856–2865,

work page
[41]

org/P19-1472

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024. 3

work page arXiv 2024
[42]

Efficient long-range attention network for image super-resolution

Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super-resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2, 6, 4

work page 2022
[43]

Hit-sr: Hierar- chical transformer for efficient image super-resolution

Xiang Zhang, Yulun Zhang, and Fisher Yu. Hit-sr: Hierar- chical transformer for efficient image super-resolution. In European Conference on Computer Vision, pages 483–500. Springer, 2024. 6, 7

work page 2024
[44]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286– 301, 2018. 2, 6, 7

work page 2018
[45]

Residual dense network for image super-resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2, 6, 7

work page 2018
[46]

Srformer: Permuted self-attention for single image super-resolution

Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou. Srformer: Permuted self-attention for single image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023. 6, 7 UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resoluti...

work page 2023
[47]

FLK-S (X) =S d(S(X, k core), kcore, d) (12)

Standard Configuration (FLK-S):This is a two-stage stack, used for smaller receptive fields. FLK-S (X) =S d(S(X, k core), kcore, d) (12)

work page
[48]

The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended

Large Configuration (FLK-L):To achieve maximum receptive fields, this configuration extends the standard block into a three-stage stack. The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended. FLK-L (X) =S d(Sd(S(X, k core), kcore, d), kextra, d) (13) The...

work page
[49]

The base S(·, kcore) block (specifically f 1×kcore dw ) estab- lishes an ERFin =k core

work page
[50]

The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

The second stage, Sd(·, kcore, d), (specifically f 1×kcore,d dw ) adds(k core −1)d. The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

work page
[51]

The base S(·, kcore) block establishes an ERFin =k core

work page
[52]

The second stage,S d(·, kcore, d), adds(k core −1)d

work page
[53]

The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4

The third stage,S d(·, kextra, d), adds(k extra −1)d. The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4. A.3. Feature Fusion and Final Output Finally, the outputs from the three branches are fused. The local and large-kernel spatial features are c...

work page

[1] [1]

Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025

Yuang Ai, Huaibo Huang, Tao Wu, Qihang Fan, and Ran He. Breaking complexity barriers: High-resolution image restoration with rank enhanced linear attention.arXiv preprint arXiv:2505.16157, 2025. 2

work page arXiv 2025

[2] [2]

Low-complexity single-image super-resolution based on nonnegative neighbor embedding

Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity single-image super-resolution based on nonnegative neighbor embedding

work page

[3] [3]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 8

work page 2022

[4] [4]

Learning a deep convolutional network for image super- resolution

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional network for image super- resolution. InComputer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part IV 13, pages 184–199. Springer, 2014. 2

work page 2014

[5] [5]

Compression artifacts reduction by a deep convolu- tional network

Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction by a deep convolu- tional network. InProceedings of the IEEE international conference on computer vision, pages 576–584, 2015. 2

work page 2015

[6] [6]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Trans- formers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2

work page internal anchor Pith review Pith/arXiv arXiv 2010

[7] [7]

Interpreting super-resolution net- works with local attribution maps

Jinjin Gu and Chao Dong. Interpreting super-resolution net- works with local attribution maps. InProceedings of the IEEE/CVF conference on computer vision and pattern recog- nition, pages 9199–9208, 2021. 7

work page 2021

[8] [8]

Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration.arXiv preprint arXiv:2411.15269,

work page arXiv

[9] [9]

Mambairv2: Attentive state space restoration, 2024

Hang Guo, Yong Guo, Yaohua Zha, Yulun Zhang, Wenbo Li, Tao Dai, Shu-Tao Xia, and Yawei Li. Mambairv2: Attentive state space restoration, 2024. 7

work page 2024

[10] [10]

Mambair: A simple baseline for image restoration with state-space model

Hang Guo, Jinmin Li, Tao Dai, Zhihao Ouyang, Xudong Ren, and Shu-Tao Xia. Mambair: A simple baseline for image restoration with state-space model. InEuropean Conference on Computer Vision, pages 222–241. Springer, 2025. 6, 7, 3

work page 2025

[11] [11]

Fourier position embedding: Enhancing attention’s periodic extension for length generalization,

Ermo Hua, Che Jiang, Xingtai Lv, Kaiyan Zhang, Youbang Sun, Yuchen Fan, Xuekai Zhu, Biqing Qi, Ning Ding, and Bowen Zhou. Fourier position embedding: Enhancing at- tention’s periodic extension for length generalization.arXiv preprint arXiv:2412.17739, 2024. 5

work page arXiv 2024

[12] [12]

Single image super-resolution from transformed self-exemplars

Jia-Bin Huang, Abhishek Singh, and Narendra Ahuja. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5197–5206, 2015. 6

work page 2015

[13] [13]

Fast and accurate single image super-resolution via information distillation net- work

Zheng Hui, Xiumei Wang, and Xinbo Gao. Fast and accurate single image super-resolution via information distillation net- work. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 723–731, 2018. 2

work page 2018

[14] [14]

Deeply- recursive convolutional network for image super-resolution

Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply- recursive convolutional network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1637–1645, 2016. 2

work page 2016

[15] [15]

Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution

Cansu Korkmaz and A Murat Tekalp. Training transformer models by wavelet losses improves quantitative and visual performance in single image super-resolution. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6661–6670, 2024. 6

work page 2024

[16] [16]

Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024

Kin Wai Lau, Lai-Man Po, and Yasar Abbas Ur Rehman. Large separable kernel attention: Rethinking the large kernel attention design in cnn.Expert Systems with Applications, 236:121352, 2024. 3

work page 2024

[17] [17]

Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025

Dongheon Lee, Seokju Yun, and Youngmin Ro. Emulat- ing self-attention with convolution for efficient image super- resolution.arXiv preprint arXiv:2503.06671, 2025. 3, 5

work page arXiv 2025

[18] [18]

Swinir: Image restoration using swin transformer

Jingyun Liang, Jiezhang Cao, Guolei Sun, Kai Zhang, Luc Van Gool, and Radu Timofte. Swinir: Image restoration using swin transformer. InProceedings of the IEEE/CVF international conference on computer vision, pages 1833– 1844, 2021. 5, 6, 4

work page 2021

[19] [19]

Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution

Jie Liang, Hui Zeng, and Lei Zhang. Details or artifacts: A locally discriminative learning approach to realistic im- age super-resolution. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 5657–5666, 2022. 6

work page 2022

[20] [20]

Enhanced deep residual networks for single image super-resolution

Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual networks for single image super-resolution. InProceedings of the IEEE confer- ence on computer vision and pattern recognition workshops, pages 136–144, 2017. 2

work page 2017

[21] [21]

Residual feature aggregation network for image super- resolution

Jie Liu, Wenjie Zhang, Yuting Tang, Jie Tang, and Gangshan Wu. Residual feature aggregation network for image super- resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2359–2368,

work page

[22] [22]

Swin transformer: Hierarchical vision transformer using shifted windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021. 2

work page 2021

[23] [23]

Progressive focused transformer for single image super- resolution

Wei Long, Xingyu Zhou, Leheng Zhang, and Shuhang Gu. Progressive focused transformer for single image super- resolution. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2279–2288, 2025. 4

work page 2025

[24] [24]

Transformer for single image super-resolution

Zhisheng Lu, Juncheng Li, Hong Liu, Chaoyan Huang, Linlin Zhang, and Tieyong Zeng. Transformer for single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 457–466,

work page

[25] [25]

A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics

David Martin, Charless Fowlkes, Doron Tal, and Jitendra Malik. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. InProceedings eighth IEEE international conference on computer vision. ICCV 2001, pages 416–423. IEEE, 2001. 6

work page 2001

[26] [26]

Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017

Yusuke Matsui, Kota Ito, Yuji Aramaki, Azuma Fujimoto, Toru Ogawa, Toshihiko Yamasaki, and Kiyoharu Aizawa. Sketch-based manga retrieval using manga109 dataset.Mul- timedia tools and applications, 76(20):21811–21838, 2017. 6

work page 2017

[27] [27]

Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution

Karam Park, Jae Woong Soh, and Nam Ik Cho. Effi- cient attention-sharing information distillation transformer for lightweight single image super-resolution. InProceed- ings of the AAAI Conference on Artificial Intelligence, pages 6416–6424, 2025. 2, 3, 4, 6, 7

work page 2025

[28] [28]

Vi- sion transformers for dense prediction

René Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vi- sion transformers for dense prediction. InProceedings of the IEEE/CVF international conference on computer vision, pages 12179–12188, 2021. 2

work page 2021

[29] [29]

Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1874–1883, 2016. 3

work page 2016

[30] [30]

Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024

Yuan Shi, Bin Xia, Xiaoyu Jin, Xing Wang, Tianyu Zhao, Xin Xia, Xuefeng Xiao, and Wenming Yang. Vmambair: Visual state space model for image restoration.arXiv preprint arXiv:2403.11423, 2024. 2

work page arXiv 2024

[31] [31]

Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

Long Sun, Jinshan Pan, and Jinhui Tang. Shufflemixer: An efficient convnet for image super-resolution.Advances in Neural Information Processing Systems, 35:17314–17326,

work page

[32] [32]

Rethinking the inception ar- chitecture for computer vision

Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception ar- chitecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 3

work page 2016

[33] [33]

Image super- resolution via deep recursive residual network

Ying Tai, Jian Yang, and Xiaoming Liu. Image super- resolution via deep recursive residual network. InProceed- ings of the IEEE conference on computer vision and pattern recognition, pages 3147–3155, 2017. 2

work page 2017

[34] [34]

Image processing gnn: Breaking rigidity in super-resolution

Yuchuan Tian, Hanting Chen, Chao Xu, and Yunhe Wang. Image processing gnn: Breaking rigidity in super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24117, 2024. 4

work page 2024

[35] [35]

Omni aggregation networks for lightweight image super-resolution

Hang Wang, Xuanhong Chen, Bingbing Ni, Yutian Liu, and Jinfan Liu. Omni aggregation networks for lightweight image super-resolution. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, pages 22378–22387, 2023. 2, 6, 7

work page 2023

[36] [36]

Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

Gang Wu, Junjun Jiang, Junpeng Jiang, and Xianming Liu. Transforming image super-resolution: a convformer-based efficient approach.IEEE Transactions on Image Processing,

work page

[37] [37]

Large kernel distillation network for efficient single image super-resolution

Chengxing Xie, Xiaoming Zhang, Linze Li, Haiteng Meng, Tianlin Zhang, Tianrui Li, and Xiaole Zhao. Large kernel distillation network for efficient single image super-resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1283–1292, 2023. 2

work page 2023

[38] [38]

Restormer: Efficient transformer for high-resolution image restoration

Syed Waqas Zamir, Aditya Arora, Salman Khan, Mu- nawar Hayat, Fahad Shahbaz Khan, and Ming-Hsuan Yang. Restormer: Efficient transformer for high-resolution image restoration. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5728–5739,

work page

[39] [39]

On single image scale-up using sparse-representations

Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations. InInternational conference on curves and surfaces, pages 711–730. Springer,

work page

[40] [40]

Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary

Leheng Zhang, Yawei Li, Xingyu Zhou, Xiaorui Zhao, and Shuhang Gu. Transcending the limit of local window: Ad- vanced super-resolution transformer with adaptive token dic- tionary. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2856–2865,

work page

[41] [41]

org/P19-1472

Michael Zhang, Kush Bhatia, Hermann Kumbong, and Christopher Ré. The hedgehog & the porcupine: Expres- sive linear attentions with softmax mimicry.arXiv preprint arXiv:2402.04347, 2024. 3

work page arXiv 2024

[42] [42]

Efficient long-range attention network for image super-resolution

Xindong Zhang, Hui Zeng, Shi Guo, and Lei Zhang. Efficient long-range attention network for image super-resolution. In European conference on computer vision, pages 649–667. Springer, 2022. 2, 6, 4

work page 2022

[43] [43]

Hit-sr: Hierar- chical transformer for efficient image super-resolution

Xiang Zhang, Yulun Zhang, and Fisher Yu. Hit-sr: Hierar- chical transformer for efficient image super-resolution. In European Conference on Computer Vision, pages 483–500. Springer, 2024. 6, 7

work page 2024

[44] [44]

Image super-resolution using very deep residual channel attention networks

Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution using very deep residual channel attention networks. InProceedings of the European conference on computer vision (ECCV), pages 286– 301, 2018. 2, 6, 7

work page 2018

[45] [45]

Residual dense network for image super-resolution

Yulun Zhang, Yapeng Tian, Yu Kong, Bineng Zhong, and Yun Fu. Residual dense network for image super-resolution. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2472–2481, 2018. 2, 6, 7

work page 2018

[46] [46]

Srformer: Permuted self-attention for single image super-resolution

Yupeng Zhou, Zhen Li, Chun-Le Guo, Song Bai, Ming-Ming Cheng, and Qibin Hou. Srformer: Permuted self-attention for single image super-resolution. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 12780–12791, 2023. 6, 7 UCAN: Unified Convolutional Attention Network for Expansive Receptive Fields in Lightweight Super-Resoluti...

work page 2023

[47] [47]

FLK-S (X) =S d(S(X, k core), kcore, d) (12)

Standard Configuration (FLK-S):This is a two-stage stack, used for smaller receptive fields. FLK-S (X) =S d(S(X, k core), kcore, d) (12)

work page

[48] [48]

The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended

Large Configuration (FLK-L):To achieve maximum receptive fields, this configuration extends the standard block into a three-stage stack. The first two stages are identical to the Standard Configuration, after which a third dilated separable depthwise convolution block using kextra is appended. FLK-L (X) =S d(Sd(S(X, k core), kcore, d), kextra, d) (13) The...

work page

[49] [49]

The base S(·, kcore) block (specifically f 1×kcore dw ) estab- lishes an ERFin =k core

work page

[50] [50]

The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

The second stage, Sd(·, kcore, d), (specifically f 1×kcore,d dw ) adds(k core −1)d. The total ERF is therefore: ERFS =k core + (kcore −1)d (16) Large Configuration (FLK-L).This configuration stacks S(·, kcore)andS d(·, kextra, d)

work page

[51] [51]

The base S(·, kcore) block establishes an ERFin =k core

work page

[52] [52]

The second stage,S d(·, kcore, d), adds(k core −1)d

work page

[53] [53]

The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4

The third stage,S d(·, kextra, d), adds(k extra −1)d. The total ERF is therefore: ERFL =k core + (kcore −1)d+ (k extra −1)d (17) This derivation confirms the formulas used to generate the configurations in Table 4. A.3. Feature Fusion and Final Output Finally, the outputs from the three branches are fused. The local and large-kernel spatial features are c...

work page